LLDB: FreeBSD Legacy Process Plugin Removed

By Michał Górny

March 4, 2021 - 12 minutes read - 2425 words

BSD contract debugger FreeBSD LLDB LLVM

Moritz Systems have been contracted by the FreeBSD Foundation to continue our work on modernizing the LLDB debugger’s support for FreeBSD.

The complete Project Schedule is divided into four milestones, each taking approximately one month:

M1 Switch all the non-x86 CPUs to the LLDB FreeBSD Remote-Process-Plugin.
M2 Iteration over regression tests on ARM64 and fixing known bugs, marking the non-trivial ones for future work. Remove the old local-only Process-Plugin.
M3 Implement follow-fork and follow-vfork operations on par with the GNU GDB support. Cover the functionality with LLDB regression tests.
M4 Implement SaveCore functionality for FreeBSD and enhance the regression testing of core files in LLDB. Update the FreeBSD manual.

During the past month we’ve successfully removed the legacy FreeBSD plugin and continued improving the new one. We have prepared an implementation of hardware breakpoint and watchpoint support for FreeBSD/AArch64, and iterated over all tests that currently fail on that platform. Therefore, we have concluded the second milestone.

Building FreeBSD for different ABIs

Architectures with multiple ABIs

In the previous report, we have shortly explained how to perform a cross-build of FreeBSD and LLVM for a different architecture, using AArch64 as an example. We would like to expand that a bit to give specific examples for other architectures we’ve been working on, and to cover different ABIs on the architectures that have more than one.

An Operating System Application Binary Interface (ABI) defines the interface between compiled applications, libraries and the Operating System. It covers a wide range of aspects necessary for interoperability, for example the executable format, usage rules for registers, the method of passing parameters and return values.

The program and all its dependent libraries must use the same ABI to work correctly, and the Operating System’s kernel must support the same ABI. ABI is generally defined per platform, though for some platforms multiple ABIs existed. For example, the FreeBSD/ARM port has historically used OABI, then switched to EABI.

The existence of multiple parallel ABIs for the same platform is often dictated by two factors: endianness and presence of hardware Floating-Point Unit (FPU).

Endian

Endianness specifies the order in which bytes of a word (e.g. an integer) are stored. On Big Endian systems, the most significant byte (i.e. the one with the highest weight) are stored first, similarly to how humans type numbers starting with the most significant digits. On Little Endian systems, the opposite happens — the least significant byte is written first, and the most significant last. Other kinds of endianness do exist but they are of little significance.

Many modern architectures are bi-endian — that is, can either run in Big Endian or Little Endian mode. Today Little Endian is more popular, primarily because this is the mode used in the x86 architecture. It is also frequently chosen for bi-endian architectures to avoid compatibility problems with buggy software.

Interoperability between computer systems using different endianness requires conversion to a common variant. The base TCP/IP protocols use Big Endian encoding for integers. Many other protocols, particularly originating from x86 world, use Little Endian. Some data formats support both variants, for example the UFS filesystem can be either made Little Endian or Big Endian to optimize for the host platform. Similarly, the wide-character Unicode encoding UTF-16, UCS-2 and UCS-4 exist in both Little Endian and Big Endian variants, and are often preceded by a Byte-Order Mark that can be used to recognize the encoding.

Processors with optional FPUs often support softfloat and hardfloat ABIs. The hardfloat ABI assumes that the FPU is present and floating-point parameters can be passed via the dedicated FPU registers. On the other hand, the softfloat ABI assumes that all floating-point operations must be emulated in software, and therefore floating-point parameters need to be passed elsewhere. Technically, it is also possible to use hardware FPU for arithmetic while using softfloat ABI for argument passing, with a little overhead necessary to move arguments from/to FPU registers.

; double add(double a, double b) {
;   return a + b;
; }

; hardfloat version (gnueabihf)
; in: d0, d1
; out: d0
add_hardfp:
  vadd.f64  d0, d0, d1
  bx        lr

; hardfloat version with softfloat ABI (gnueabi)
; in: {r0, r1}, {r2, r3}
; out: {r0, r1}
add_softfp:
  vmov      d0, r0, r1
  vmov      d1, r2, r3
  vadd.f64  d0, d0, d1
  vmov      r0, r1, d0
  bx        lr

; softfloat version (-mfloat-abi=soft)
add_soft:
  push      {r11, lr}
  bl        __aeabi_dadd
  pop       {r11, pc}

The above snippet demonstrates a trivial function that adds two double precision numbers and returns the result. In hardfloat ABI, the parameters are passed in dedicated 64-bit registers d0 and d1, and the result is returned in d0. In softfloat ABI, the doubles are split into 32-bit registers r0 and r1, and r2 and r3 appropriately. The result is returned in the first pair. This implies that when using the softfloat ABI with FPU, the data needs to be moved between General-Purpose Registers and FPU Registers. Finally, the softfloat version invokes a compiler runtime double-adding function that performs the operation in software.

Choosing the ABI for cross-builds

The following table summarizes available ABIs for the processors discussed in our previous report.


Arch	Endian	FPU?	FreeBSD `TARGET_ARCH`	Clang triplet (`CHOST`)	Extra CC args
ARMv7	Little Endian	hard	`armv7`	`armv7-unknown-freebsd13.0-gnueabihf`
AArch64	Little Endian	hard	`aarch64`	`aarch64-unknown-freebsd13.0`
AArch64	Big Endian	hard	not supported	`aarch64_be-unknown-freebsd13.0`
MIPS64	Big Endian	soft	`mips64`	`mips64-unknown-freebsd13.0`
	Big Endian	hard	`mips64hf`	`mips64-unknown-freebsd13.0`	`-mhard-float`
	Little Endian	soft	`mips64el`	`mips64el-unknown-freebsd13.0`
	Little Endian	hard	`mips64elhf`	`mips64el-unknown-freebsd13.0`	`-mhard-float`
PPC64	Big Endian	hard	`powerpc64`	`powerpc64-unknown-freebsd13.0`
PPC64	Little Endian	hard	`powerpc64le`	`powerpc64le-unknown-freebsd13.0`

Hardware Breakpoints and Watchpoints on AArch64

Breakpoints and Watchpoints, Hardware and Software

Breakpoints and watchpoints belong to a category of contraptions collectively called traps. A trap interrupts the program execution whenever a specific condition occurs. When the program is running under a debugger, the trap generally causes the control to be returned to the debugger.

Breakpoints are traps that are triggered when a specific code location is being executed. Watchpoints are triggered when a specific memory location (e.g. a variable) is being accessed. Both these kinds of traps can be implemented in hardware or emulated in software. Hardware implementations are generally more performant, as processor itself is checking for the monitored condition. However, they usually have limitations, most notably in the number of hardware traps available. Whenever using a hardware implementation is not feasible, the debugger can emulate breakpoints and watchpoints in software.

Software breakpoints are easy to implement and have relatively small performance impact. The common method of implementing them is to overwrite the memory at monitored location with a code explicitly triggering the debugger (e.g. an int3 instruction on x86). When the process executes this code, the control is returned to the debugger as if a hardware breakpoint was triggered. When the debugger is about to resume the program, it temporarily restores the original code, single-steps through it and then reintroduces the software trap before actually resuming the execution. One particular limitation of this implementation is that it can only be used on writable memory — i.e. they cannot be used to debug code present on ROM.

Software watchpoints are harder. LLDB does not implement them at all. GDB implements them partially by single-stepping through the program and explicitly checking whether the variable’s value changed. Naturally, this implied that it is impossible to monitor for variable reads or even writing the same value. A more versatile implementation would require actually analyzing the operands of instructions executed and therefore would probably be best done via an emulator.

Hardware Breakpoints and Watchpoints on x86 and ARM

Both x86 and ARM architectures implement hardware breakpoints and watchpoints. However, their respective implementations and the exposed programmer’s interfaces are different enough to justify a comparison.

Debug Registers

The x86 architecture provides a fixed number of 4 traps, and each of them can be used either as a breakpoint or a watchpoint. This implies that every breakpoint set reduces the number of available watchpoints, and vice versa. On the other hand, ARM has breakpoints and watchpoints entirely separate. The architecture permits the processor to provide between 2 and 16 breakpoints and watchpoints each.

x86 uses 6 debug registers (there are 8 on i386 and 16 on amd64 in total but others are reserved). The registers DR0 through DR3 are used to set the addresses for the four available traps, DR6 is used as a status register, while DR7 is used as a control register for all of them.

ARM provides a pair of debug registers — a value (i.e. address register) and a control register — for every breakpoint and watchpoint. On AArch64, the DBGBCRi_EL1 and DBGBVRi_EL1 registers are respectively breakpoint control registers and value registers, while DBGWCRi_EL1 and DBGWVRi_EL1 are respectively watchpoint control and value registers (on AArch32, the corresponding registers do not have the _EL1 suffix). Additionally, the ID_AA64DFR0_EL1 register provides the number of available breakpoints and watchpoints.

x86 allows breakpoints and watchpoints to be set on arbitrary memory locations, with watchpoints monitoring 1, 2, 4 or 8 bytes starting at that location (8 bytes available only on 64-bit processors). ARM requires the monitored memory address to be aligned at 32 bit boundary, and allows monitoring a contiguous range of bytes inside the 8-byte memory block at the specified address. Using an appropriate range makes it possible to watch unaligned variables and set breakpoints on 16-bit Thumb instructions.

Watchpoints on x86 can be set to be triggered by writes or by reads and writes. The ARM architectures additionally supports triggering them by reads only.

On x86, the debugger can verify which trap was triggered by inspecting the status register for the bits corresponding to individual traps. On arm64, the kernel needs to pass the value FAR that contains the memory address that triggered the trap, and the debugger needs to match it against the traps set.

LLDB, pseudo-barriers and AArch64

LLDB features a number of tests that are meant to verify the debugger’s behavior while processing a number of concurrent events. During our testing we have noticed that when compiled with a recent enough clang, these tests hang and timeout on our testing machine, while they seem to work fine on a QEMU VM. Here is what we established.

To achieve the best results (i.e. test coverage), it is useful to make sure that all threads start exhibiting the tested behavior at roughly the same time. Given that normally threads are started one after another, synchronization is required to prevent the threads started earlier from executing too fast.

To achieve that, LLDB uses a so-called pseudo-barrier. A barrier is a threading primitive that requires all threads reaching it to stop until all other threads reach the same barrier. Once all threads reach it, they are all released simultaneously. Therefore, a barrier placed at the beginning of thread function can ensure that the tested code does not execute until all threads have actually been created and started.

The LLDB’s pseudo-barrier implementation is really trivial. It uses a global variable that is initialized by the main program to the number of threads that are going to be started. Every thread decrements this global variable upon reaching the barrier and waits for it to reach zero. When all threads reach the barrier, the final decrement zeroes it and causes all threads to resume.

A decrement operation on a global variable on AArch64 consists of three instructions:

Fetching the current value of the variable from memory into a register.
Decrementing the value of the register.
Storing the new value from the register into the memory.

Non-atomic decrement by multiple threads

Since this involves multiple operations, the decrement is not atomic. This means it is susceptible to race conditions — if multiple threads decrement the variable at the same time, the result is undefined. For example, the following could happen:

Thread 1 fetches x from memory to a register. Simultaneously, thread 2 fetches x from memory. Both threads get the value of 10.
Both threads decrement the value, obtaining 9 each.
Both threads store 9 in memory (instead of expected 8).

The exact details may differ. Depending on the exact timing, the final result may vary from 0 (i.e. if all operations end up being serialized) to 9 (if all happen simultaneously, or a thread fetching the initial value ends up writing last).

Atomic decrement by multiple threads

To resolve this problem, LLDB’s pseudo-barrier is using atomic operations, that is operations that are guaranteed to be performed as a single action without a risk of races. On AArch64, atomic decrement is implemented similarly to a regular decrement, except that instead of regular load/store instructions, load-exclusive and store-exclusive are used. If the variable is written by any other thread between the load-exclusive and store-exclusive, the latter returns unsuccessfully without storing the new value and the program tries again.

One of the restrictions of this atomicity model is that if the program performs any other memory access between the load-exclusive and store-exclusive instructions, the latter may fail. This particularly means that if the same thread issues another store between the two, the store-exclusive operation will keep failing, resulting in an infinite loop.

A recent rewrite of RegAllocFast has introduced precisely this problem. As explained on bug #48017, clang is now introducing a regular store (the str instruction) between the exclusive ldaxr — stlxr pair. As a result, stlxr never succeeds the atomic decrement loops forever and blocks all threads on the barrier indefinitely.

Summary of changes

Changes merged upstream

Future plans

Our next milestone focuses on improving support for debugging processes that spawn children (e.g. via fork(2)). We would like to implement the GDB’s follow-fork and follow-vfork model. In this model, the debugger can trace only one process at a time, and when a child is created, it either continues following the parent or switches to tracing the child process.

This portion of work involves both design and implementation work. We will need to integrate this model into the gdb-remote protocol used by LLDB. We will also cover the new functions with additional tests.

We are also waiting for reviews of our previous patches, and we will continue responding to feedback until they are merged.