Before the BSD Kernel starts: Part One on AMD64

By Maciej Grochowski

November 19, 2020 - 13 minutes read - 2751 words

amd64 bootloader bsd netbsd x86

System initialization is one of the niche areas that few people look into. The exact details vary considerably between different platforms, firmwares, CPU architectures and operating systems, making it difficult to learn it all. Usually, if something is not working correctly during the early stages of system startup or if the OS does not boot, it rarely has anything to do with the code responsible for booting. Most of the time, it is due to other factors, such as the boot media or BIOS configuration. However, understanding the early initialization process may help debug or to familiarize yourself with a new platform or hardware.

In this article, I will walk through the early kernel initialization process, defining the meaning of this term. System initialization is a broad topic that ranges from the platform’s hardware design all the way up to typical functions of an operating system such as handling I/O operations. It is not possible to cover the entire topic adequately within the scope of an article. In this first part I will describe the well-known AMD64: 64-bit platform. I am going to highlight a very interesting part of the initialization process the early initialization of the kernel. Later, I will compare it with ARM64. In both cases I will discuss the topic in the context of NetBSD, the operating system known for its portability.

The Bigger Picture

The CPU starting point is called the reset vector: the CPU bootstraps, then fetches and executes the first physical address at location 0xFFFFFFF0. The bootloader must always contain a jump to the initialization code in these last top 16 bytes. The CPU is in a variant of a real mode called unreal-mode. 16-bit addressing with segments can address up to 1 MiB of memory. After the reset, the CS descriptor cache base field contains a special fixed 32-bit value: 0xFFFF 0000. (In real-mode a user can change only the lower 16 bits of CS; the upper half, also called the base, is set on reset and hidden). Using this technique, the instruction pointer addresses relative to the last 64 KiB fragment of the physical memory, which is usually wired to read-only flash memory, where part of the platform firmware (BIOS/UEFI) is located.

BIOS or UEFI?

BIOS (Basic Input/Output System) is a term used for legacy platform initialization firmware and an interface between the operating system and platform. It is used mostly with IBM PC compatible machines, such as personal computers or server type machines. On the other hand, UEFI (Unified Extensible Firmware Interface) is a generic specification, not a particular implementation, and similarly to the BIOS defines an interface between the operating system and platform firmware. The goal of UEFI is to replace legacy interfaces, also is designed to be universal, it can be applied to PC’s or servers as well as embedded devices. This newer standard was developed to overcome limitations of older standards such as 16-bit processor mode with 1 MB of addressable space, or maximum hard drive sizes from which the operating system can be booted. It also brings new features like secure boot or UEFI runtime services. Describing UEFI and how it differs from BIOS is out of the scope of this article, but what is important to know is that both BIOS and UEFI based firmware will perform platform initialization, and later load the operating system from the physical medium. The way that the system is loaded differs between UEFI and BIOS. The newer standard allows for more advanced functionalities, such as GPT partition layout where the BIOS operates on boot sectors. For this article, we will start with the legacy boot process based on Master Boot Record (MBR). The topic of UEFI can be extended in the future if needed.

Legacy BIOS

When the CPU starts after reset, most of the platform hardware is not ready to use: system memory connected as a DIMMs modules is not yet detected and initialized, timers and interrupts aren’t ready, nor is the PCI bus working yet. Hardware has to be initialized, and that is the essential role of platform firmware. A more detailed description of initialization process can be found by a curious reader in Minimal Boot Loader for Intel(R) Architecture, here I will point out only the critical functionalities. At the beginning, firmware initialization code needs to initialize the CPU and platform chipsets can only then prepare memory to work. After the memory is operational in a phase called post memory initialization, the firmware copies itself from the slow flash memory to the system DRAM. Initialization code can start execution only after it prepares software environment as stack or the CPU mode. When the CPU jumps to memory address below 1MB in the DRAM (this memory region is historically reserved for that purpose), it still has many things to do before it is be able to communicate with external devices. At the latest phase, IO devices are initialized as well as the PCI bus is enumerated. Once that is done, initialization code will search for a legacy operating system to boot, load the MBR sector from disk to the memory and execute it.

BIOS loads the first sector, called the MBR (512 bytes), from the beginning of the hard disk. That region must end with the magic number (also called a signature) 0xAA55. This sector contains instructions that have to load further sectors into the memory to execute a higher-level bootstrap program for a simple reason: size and how many instructions can fit into 512 bytes. Only in that way can we have a more complex program that will find and execute the kernel of the operating system. Before I describe the process of executing kernel and making it operational in the long mode, we need to know what the starting point of a typical UNIX kernel is.

Master Boot Record

The Kernel is an ELF file

The two most common executable file formats are ELF (Executable and Linkable Format) and PE (Portable Executable). In the UNIX environment, ELF is the typical format for program binaries, while PE is widely used on Windows. It should not be a surprise to the reader that the NetBSD kernel is also ELF executable.

Before the main

We are used to thinking that programs start with some kind of main function. Those of us who have studied libraries or flow of execution can recall a lower level _start function that was called when the program was loaded into memory. In ELF executables, the program actually starts at an entry point that is defined inside the header of the file (Entry point address). We can easily verify this claim using the readelf program on our kernel binary:

$ readelf -h ./netbsd
ELF Header:
  Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 
  Class:                             ELF64
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              EXEC (Executable file)
  Machine:                           Advanced Micro Devices X86-64
  Version:                           0x1
  Entry point address:               0xffffffff80209000		<<<
  Start of program headers:          64 (bytes into file)
  Start of section headers:          219286488 (bytes into file)
  Flags:                             0x0
  Size of this header:               64 (bytes)
  Size of program headers:           56 (bytes)
  Number of program headers:         2
  Size of section headers:           64 (bytes)
  Number of section headers:         39
  Section header string table index: 37

We can now check the name of the symbol with such an address, but only if we use non stripped kernel!

$ readelf --syms ./netbsd.gdb | grep ffffffff80209000
 41333: ffffffff80209000    0 NOTYPE GLOBAL DEFAULT 1 __text_user_end
 48452: ffffffff80209000 1096 FUNC   GLOBAL DEFAULT 1 start

The starting function for our kernel is start. Before we look into this function, we need to understand how the CPU’s knows this starting point. Before the operating system can execute the program, compiled into ELF format, it has to load it into memory. When the program runs on bare metal (without the operating system), it needs to take care of loading into memory by itself. This is one of the reasons why we need programs such as bootloaders.

A Few Words About Bootloaders

There are a lot of possible ways and programs we can use to setup our platform. Different boot loader programs such as grub or u-boot can be configured to work together on various hardware and support operating systems. Early loader programs provide much flexibility in configuration. I mentioned earlier two partition schemes: GPT and MBR, both can be used together as a hybrid. I don’t want go too deeper into disk layout as such a description would end up with multiple tables and descriptions, so I will focus on the NetBSD kernel initialization for the default configuration.

After the BIOS finds a valid sector (with 0xAA55 signature), it loads the first disk sector (MBR) to physical address 0x7c00 . It also sets the DL register to drive the number from which MBR was loaded, and after that is done, firmware executes the loaded data. For the x86 platform, the first two bootloaders are MBR (mbr(8)) and PBR whose names correspond to the sectors where they are placed: Master Boot Record and Partition Boot Record. Traditionally, the MBR code relocates itself to a different physical memory location (0x600) and then locates the active partition, reads its first sector (PBR) to the address 0x7c00 and jumps to it. The PBR is designed to work with the classical NetBSD chain where it is loaded by the own MBR as well as to work with GPT partition, in both cases there is a difference in behavior. In the case of GPT the EAX register will contain the constant !GPT (in hex: 54504721) and the MBR structure, which contains Logical Block Address (LBA) from which was loaded and some extra information like OS type or GPT partition entry, is passed using registers DS:SI. Otherwise, only the ESI register will contain to pass logical block address from which the program was read. This lets PBR code select between two NetBSD systems on the same physical driver.

PBR identifies the disk that it was loaded from, in both cases passed inside DL register, it has to find and copy boot2 code from disk into memory, then jumps to it. The purpose of boot2 is to locate and read the program called boot. This is the program that shows the boot prompt to the user and allows to choose different kernel files. The boot program reads a kernel binary from the file system, interprets different sections, and loads them into memory. In this way kernel ends up finally in the memory and the program can execute it. The program has implementation of ELF file format thus it can read different sections of the file and load them to the memory, or get values from the headers. The last thing that boot also takes care of are parameters, the ones the user can provide for example in the command prompt.

Into the kernel

Start, locore.s and Machine Dependent Code

In NetBSD, the first executed code when the kernel is loaded is machine-dependent, which should not be surprising. This code is located inside the assembly file locore.S. A quick search inside /sys/arch hints that NetBSD has separate implementations for different architectures:

find  ./sys/arch -iname locore.s
./sys/arch/x68k/x68k/locore.s
./sys/arch/arm/arm32/locore.S
./sys/arch/newsmips/stand/boot/locore.S
./sys/arch/amiga/amiga/locore.s
./sys/arch/ibmnws/ibmnws/locore.S
./sys/arch/i386/i386/locore.S
...
./sys/arch/sparc64/sparc64/locore.s
./sys/arch/hp300/hp300/locore.s

Let’s take a look at the NetBSD locore.S for AMD64. Again I encourage the readers to explore this topic in more detail within the source code. I will cover some of the crucial operations.

Inside the locore.S for AMD64, we can easily find the start entry, ENTRY(start) The very first operation that start performs is writing the magic value 0x1234 into the address 0x472. This suspicious operation tells BIOS to bypass the memory test. This is also known as a warm reboot. Addresses between 0x400 - 0x4FF are part of the BIOS Data Area (BDA) and 0x72 offset in BDA is a 2-byte flag Soft reset flag

	movw    $0x1234,0x472

The next thing that we see are kernel flags loaded from boothowto(9). The parameters were placed on the stack by the boot program in the previous stage.

	/*
	 * Load parameters from the stack (32 bits):
	 *    boothowto, [bootdev], bootinfo, esym, biosextmem, biosbasemem
	 * We are not interested in 'bootdev'.
	 */
	
	/* Load 'boothowto' */
	movl    4(%esp),%eax
	movl    %eax,RELOC(boothowto)

Jumping into long mode

The NetBSD start function executes in virtual mode (32-bit mode) and initializes the processor up to the point where it can switch to the long mode. But before the CPU can switch to 64bits, there are a couple of things that have to be done. The first task is to calculate the kernel memory layout, and fill page tables. There are a couple of ways how page tables can be configured in the virtual mode, however long mode explicite requires physical-address extensions (PAE) to be enabled. PAE uses 3 levels of tables: the page-directory pointer table (PDPT), the page-directory table (PDT) and the page table (PT). Activating the long mode without PAE enabled will cause an exception on the CPU.

The kernel image is already loaded by the bootstrap code that brought us to start. So we have the start and end of the kernel image, and using that offset we need to calculate the following offsets for the next sections such as page tables, process zero stack and I/O memory for legacy devices that are getting mapped to virtual addresses but not allocated in the physical memory. Below we present a simplified map of the kernel virtual memory, which starts with the platform-dependent value KERNELBASE. We can easily check that on the AMD64 platform it is

#define	KERNBASE 0xffffffff80000000 /* start of kernel virtual space */

We also mark sections to show the connection with the ELF binary that was loaded in the previous steps.

Memory Layout

For the AMD64 platform, we have four levels of page tables that are called: PML4 -> PDPT -> PD -> PT . Before we can fill them, they have to be erased, after we are done with cleaning, we reach the end of the memory segment designed for page tables; thus we can start filling them from PT (L1) all way up to PML4 (L4). Parts of the kernel such as the kernel stack or kernel code have to be present and mapped into memory so based on the known memory map we need to fill them out. A breakdown of the 64bit virtual address into page tables is shown in the picture below:

CR3 Page Tables

After the page tables are mapped, we can enable PAE (they are represented as flags in control registers). To do that we need to set LM-bit (the 9th bit in the register) in EFER. This doesn’t transfer the CPU to the Long Mode, to transfer it the jump instruction has to be executed (this is a general way to switch between modes on Intel CPUs). Before we switch the CPU to the long mode, we need to point control register 3 to the address of PML4 top entry. Now we are ready to enable paging. After we write the proper flags to the CR0, in order for it to take effect, we need to perform the jump instruction.

orl     $(CR0_PE|CR0_PG|CR0_NE|CR0_TS|CR0_MP|CR0_WP|CR0_AM),%eax 
movl    %eax,%cr0
jmp 	compat
compat:

After the switch, the CPU is in a variant of the long mode called the compatibility mode and we need to perform one more operation. To do the switch, we need to load the prepared Global Descriptor Table (GDT) and perform a long jump. Code segments and descriptors still exist in the flat 64-bit mode because they establish the processor execution privilege levels as well as the operating mode (see 4.8.1 - 4.8.2 of the AMD64 Architecture Programmer’s Manual Volume 2). To do that, we load GDT, set the prepared code segment to it and perform the long jump.

_C_LABEL(farjmp64):
# RELOC gives us offset between Start of the kernel to the instruction
.long   _RELOC(longmode)

movl    $RELOC(farjmp64),%eax 
ljmp    *(%eax)

	.code64
longmode:

After the long jump, we are finally in the long mode! Now there are just a few steps that need to be done before we can call main, but we will discuss them in the next part of the article.

         call    _C_LABEL(init_slotspace)                                        
         popq    %rdi                                                            
         call    _C_LABEL(init_x86_64)                                           
         call    _C_LABEL(main)

Resources

[1] Minimal Boot Loader for Intel® Architecture
[2] 4.8.1 - 4.8.2 of the AMD64 Architecture Programmer’s Manual Volume 2
[3] NetBSD source code
[4] Intel Software Developer’s manual
[5] Most of the topics can be learnt in more detail by just searching them on OsDev