Go Parallel
Booting an Intel Architecture System, Part I: Early Initialization

The boot sequence today is far more complex than it was even a decade ago. Here's a detailed, low-level, step-by-step walkthrough of the boot up.

By Pete Dice
December 26, 2011
URL : http://www.drdobbs.com/parallel/booting-an-intel-architecture-system-par/232300699

Taking a lot of little steps along a path is a good analogy for understanding boot flow in Intel Architecture-based systems. The minimum firmware requirements for making a system operational and for booting an operating system are presented in this article. The vast majority of systems perform these steps in this article to do a full or cold boot. Depending on the architecture of the BIOS, there may be multiple software phases to jump through with different sets of rules, but the sequence for waking up the hardware is, at least in the early phases, very much the same.

Hardware Power Sequences: The Pre-Pre-Boot

When someone pushes the power button, the CPU can't simply jump up and start fetching code from flash memory. When external power is first applied, the hardware platform must carry out a number of tasks before the processor can be brought out of its reset state.

The first task is for the power supply to be allowed to settle down to its nominal state. Once the primary power supply settles, there are usually a number of derived voltage levels needed on the platform. For example, on the Intel Architecture reference platform the main input supply is a 12-volt source, but the platform and processor require voltage rails of 1.5, 3.3, 5, and 12 volts. Voltages must be provided in a particular order, a process known as power sequencing. The power is sequenced by controlling analog switches, typically field-effect transistors. The sequence is often driven by a Complex Program Logic Device (CPLD).

Platform clocks are derived from a small number of input clock and oscillator sources. The devices use phase-locked loop circuitry to generate the derived clocks used for the platform. These clocks take time to converge.

It is only after all these steps have occurred that the power-sequencing CPLD can de-assert the reset line to the processor, as illustrated in Figure 1. Depending on integration of silicon features, some of this logic may be on chip and controlled by microcontroller firmware that starts prior to the main processor.

Figure 1: An overview of power sequencing.

A variety of subsystems may begin prior to the main host system.

The Intel Manageability Engine (ME), available on some mainstream desktop and server-derived chips, is one such component. The main system firmware does not initialize the devices composing the ME. However, there is likely to be some level of interaction that must be taken into account in the settings of the firmware, or in the descriptors of the flash component, for the ME to start up and enable the clocks correctly. The main system firmware also has the potential to make calls and be called from the ME.

Another example is micro engines, which are telecommunications components used in the embedded-systems world. Micro engines have their own firmware that starts independently of the system BIOS. The host system's BIOS must make allowances for them in the Advanced Configuration and Power Interface (ACPI) memory map to allow for proper interaction between host drivers and the micro-engine subsystem.

Once the processor reset line has been de-asserted, the processor begins fetching instructions. The location of these instructions is known as the reset vector. The reset vector may contain instructions or a pointer to the starting instructions in flash memory. The location of the vector is architecture-specific and usually in a fixed location, depending on the processor. The initial address must be a physical address, as the Memory Management Unit (MMU), if it exists, has not yet been enabled. The first fetching instructions start at 0xFFF, FFF0. Only 16 bytes are left to the top of memory, so these 16 bytes must contain a far jump to the remainder of the initialization code. This code is always written in assembly at this point as there is no software stack or cache RAM available at this time.

Because the processor cache is not enabled by default, it is not uncommon to flush cache in this step with a WBINV instruction. The WBINV is not needed on newer processors, but it doesn't hurt anything.

Mode Selection

IA-32 supports three operating modes and one quasi-operating mode:

The Intel 64 architecture supports all operating modes of IA-32 architecture plus IA-32e mode. In IA-32e mode, the processor supports two sub-modes: compatibility mode and 64-bit mode. Compatibility mode allows most legacy protected-mode applications to run unchanged, while 64-bit mode provides 64-bit linear addressing and support for physical address space larger than 64 GB.

Figure 2 shows how the processor moves between operating modes.

Figure 2: Switching processor operating modes.

When the processor is first powered on, it will be in a special mode similar to real mode, but with the top 12 address lines asserted high. This aliasing allows the boot code to be accessed directly from nonvolatile RAM (physical address 0xFFFxxxxx).

Upon execution of the first long jump, these 12 address lines will be driven according to instructions by firmware. If one of the protected modes is not entered before the first long jump, the processor will enter real mode, with only 1 MB of addressability. In order for real mode to work without memory, the chipset needs to be able to alias a range of memory below 1 MB to an equivalent range just below 4 GB. Certain chipsets do not have this aliasing and may require a switch to another operating mode before performing the first long jump. The processor also invalidates the internal caches and translation look-aside buffers.

The processor continues to boot in real mode. There is no particular technical reason for the boot sequence to occur in real mode. Some speculate that this feature is maintained in order to ensure that the platform can boot legacy code such as MS-DOS. While this is a valid issue, there are other factors that complicate a move to protected-mode booting. The change would need to be introduced and validated among a broad ecosystem of manufacturers and developers, for example. Compatibility issues would arise in test and manufacturing environments. These and other natural hurdles keep boot mode "real."

The first power-on mode is actually a special subset of real mode. The top 12 address lines are held high, thus allowing aliasing, in which the processor can execute code from nonvolatile storage (such as flash memory) located within the lowest one megabyte as if it were located at the top of memory.

Normal operation of firmware (including the BIOS) is to switch to flat protected mode as early in the boot sequence as possible. It is usually not necessary to switch back to real mode unless executing an option ROM that makes legacy software interrupt calls. Flat protected mode runs 32-bit code and physical addresses are mapped one-to-one with logical addresses (that is, paging is off). The interrupt descriptor table is used for interrupt handling. This is the recommended mode for all BIOS/boot loaders.

Early Initialization

The early phase of the BIOS/bootloader initializes the memory and processor cores. In a BIOS constructed in accord with the Unified EFI Forum's UEFI 2.0 framework, the security and Pre-EFI Initialization (PEI) phases are normally synonymous with "early initialization." It doesn't matter if legacy or UEFI BIOS is used. From a hardware point of view, the early initialization sequence is the same.

In a multicore system, the bootstrap processor is the CPU core (or thread) that is chosen to boot the system firmware, which is normally single-threaded. At RESET, all of the processors race for a semaphore flag bit in the chipset The first finds it clear and in the process of reading it sets the flag; the other processors find the flag set and enter a wait-for-SIPI (Start-up Inter-Processor Interrupt) or halt state. The first processor initializes main memory and the Application Processors (APs), then continues with the rest of the boot process.

A multiprocessor system does not truly enter multiprocessing operation until the OS takes over. While it is possible to do a limited amount of parallel processing during the UEFI boot phase, such as during memory initialization with multiple socket designs, any true multithreading activity would require changes to be made to the Driver Execution Environment (DXE) phase of the UEFI. Without obvious benefits, such changes are unlikely to be broadly adopted.

The early initialization phase next readies the bootstrap processor and I/O peripherals' base address registers, which are needed to configure the memory controller. The device-specific portion of an Intel architecture memory map is highly configurable. Most devices are seen and accessed via a logical Peripheral Component Interconnect (PCI) bus hierarchy. Device control registers are mapped to a predefined I/O or memory-mapped I/O space, and they can be set up before the memory map is configured. This allows the early initialization firmware to configure the memory map of the device as needed to set up DRAM. Before DRAM can be configured, the firmware must establish the exact configuration of DRAM that is on the board. The Intel Architecture reference platform memory map is described in detail in Chapter 6 of my book, Quick Boot: A Guide for Embedded Firmware Developers, from Intel Press.

System-on-a-chip (SOC) devices based on other processor architectures typically provide a static address map for all internal peripherals, with external devices connected via a bus interface. Bus-based devices are mapped to a memory range within the SOC address space. These SOC devices usually provide a configurable chip-select register set to specify the base address and size of the memory range enabled by the chip select. SOCs based on Intel Architecture primarily use the logical PCI infrastructure for internal and external devices.

The location of the device in the host's memory address space is defined by the PCI Base Address Register (BAR) for each of the devices. The device initialization typically enables all the BAR registers for the devices required for system boot. The BIOS will assign all devices in the system a PCI base address by writing the appropriate BAR registers during PCI enumeration. Long before full PCI enumeration, the BIOS must enable the PCI Express (PCIe) BAR as well as the Platform Controller Hub (PCH) Root Complex Base Address Register (RCBA) BAR for memory, I/O, and memory-mapped I/O (MMIO) interactions during the early phase of boot. Depending on the chipset, there are prefetchers that can be enabled at this point to speed up data transfer from the flash device. There may also be Direct Media Interface (DMI) link settings that must be tuned for optimal performance.

The next step, initialization of the CPU, requires simple configuration of processor and machine registers, loading a microcode update, and enabling the Local APIC (LAPIC).

Microcode is a hardware layer of instructions involved in the implementation of the machine-defined architecture. It is most prevalent in CISC-based processors. Microcode is developed by the CPU vendor and incorporated into an internal CPU ROM during manufacture. Since the infamous "Pentium flaw," Intel processor architecture allows that microcode to be updated in the field either through a BIOS update or via an OS update.

Today, an Intel processor must have the latest microcode update to be considered a warranted CPU. Intel provides microcode updates that are written to the microcode store in the CPU. The updates are encrypted and signed by Intel such that only the processor that the microcode update was designed for can authenticate and load the update. On socketed systems, the BIOS may have to carry many flavors of microcode update depending on the number of processor steppings supported. It is important to load microcode updates early in the boot sequence to limit the exposure of the system to known errata in the silicon. Note that the microcode update may need to be reapplied to the CPU after certain reset events in the boot sequence.

Next, the LAPICs must be enabled to handle interrupts that occur before enabling protected mode.

Software initialization code must load a minimum number of protected-mode data structures and code modules into memory to support reliable operation of the processor in protected mode. These data structures include an Interrupt Descriptor Table (IDT), a Global Descriptor Table (GDT), a Task-State Segment (TSS), and, optionally, a Local Descriptor Table (LDT). If paging is to be used, at least one page directory and one page table must be loaded. A code segment containing the code to be executed when the processor switches to protected mode must also be loaded, as well as one or more code modules that contain necessary interrupt and exception handlers.

Initialization code must also initialize certain system registers. The global descriptor table register must be initialized, along with control registers CR1 through CR4. The IDT register may be initialized immediately after switching to protected mode, prior to enabling interrupts. Memory Type Range Registers (MTRRs) are also initialized.

With these data structures, code modules, and system registers initialized, the processor can be switched to protected mode. This is accomplished by loading control register CR0 with a value that sets the PE (protected mode enable) flag. From this point onward, it is likely that the system will not enter real mode again, legacy option ROMs and legacy OS/BIOS interface notwithstanding, until the next hardware reset is experienced.

Since no DRAM is available at this point, code initially operates in a stackless environment. Most modern processors have an internal cache that can be configured as RAM to provide a software stack. Developers must write extremely tight code when using this cache-as-RAM feature because an eviction would be unacceptable to the system at this point in the boot sequence; there is no memory to maintain coherency. That's why processors operate in "No Evict Mode" (NEM) at this point in the boot process, when they are operating on a cache-as-RAM basis. In NEM, a cache-line miss in the processor will not cause an eviction. Developing code with an available software stack is much easier, and initialization code often performs the minimal setup to use a stack even prior to DRAM initialization.

The processor may boot into a slower than optimal mode for various reasons. It may be considered less risky to run in a slower mode, or it may be done to save power. The BIOS may force the speed to something appropriate for a faster boot. This additional optimization is optional; the OS will likely have the drivers to deal with this parameter when it loads.

Memory Configuration and Initialization

The initialization of the memory controller varies slightly depending on the DRAM technology and the capabilities of the memory controller itself. The information on the DRAM controller is proprietary for SOC devices, and in such cases the initialization Memory Reference Code (MRC) is typically supplied by the vendor. Developers have to contact Intel to request access to the low level information required. Developers who lack an MRC can utilize a standard JEDEC initialization sequence for a given DRAM technology. JEDEC refers to the JEDEC Solid State Technology Association, which was formerly known as the Joint Electronic Devices Engineering Council. JEDEC is a trade organization and standards body for the electronics industry.

It is likely that memory configuration will be performed by single-point-of-entry and single-point-of-exit code that has multiple boot paths contained within it. It will be 32-bit protected-mode code. Settings for buffer strengths and loading for a given number of banks of memory are chipset specific.

There is a very wide range of DRAM configuration parameters, including number of ranks, eight-bit or 16-bit addresses, overall memory size and constellation, soldered down or add-in module configurations, page-closing policy, and power management. Given that most embedded systems populate soldered-down DRAM on the board, the firmware may not need to discover the configuration at boot time. These configurations are known as memory-down. The firmware is specifically built for the target configuration. This is the case for the Intel reference platform from the Embedded Computing Group. At current DRAM speeds, the wires between the memory controllers behave like transmission lines; the SOC may provide automatic calibration and run-time control of resistive compensation and delay-locked look capabilities. These capabilities allow the memory controller to change elements such as drive strength to ensure error-free operation over time and temperature variations.

If the platform supports add-in-modules for memory, it may use any of a number of standard form factors. The small-outline Dual In-Line Memory Module (DIMM) is often found in embedded systems. The DIMMs provide a serial EPROM that contains DRAM configuration information known as Serial Presence Detect (SPD) data. The firmware reads the SDP data and subsequently configures the device. The serial EPROM is connected via the System Management Bus (SMBus). This means the device must be available in the early initialization phase so the software can establish the memory devices on-board. It is also possible for memory-down motherboards to incorporate SPD EEPROMs to allow for multiple and updatable memory configurations that can be handled efficiently by a single BIOS algorithm. A hard-coded table in one of the MRC files could be used to implement an EEPROM-less design.

Once the memory controller has been initialized, a number of subsequent cleanup events take place, including tests to ensure that memory is operational. Memory testing is now part of the MRC, but it is possible to add more tests should the design require it. BIOS vendors typically provide some kind of memory test on a cold boot. Writing custom firmware requires the authors to choose a balance between thoroughness and speed, as many embedded and mobile devices require extremely fast boot times. Memory testing can take considerable time.

If testing is warranted, right after initialization is the time to do it. The system is idle, the subsystems are not actively accessing memory, and the OS has not taken over the host side of the platform. Several hardware features can assist in this testing both during boot and at run-time. These features have traditionally been thought of as high-end or server features, but over time they have moved into the client and embedded markets.

One of the most common technologies is error-correction codes. Some embedded devices use ECC memory, which may require extra initialization. After power-up, the state of the correction codes may not reflect the contents, and all memory must be written to. Writing to memory ensures that the ECC bits are valid and set to the appropriate contents. For security purposes, the memory may need to be zeroed out manually by the BIOS — or, in some cases, a memory controller may incorporate the feature into hardware to save time.

Depending on the source of the reset and security requirements, the system may or may not execute a memory wipe or ECC initialization. On a warm reset sequence, memory context can be maintained.

If there are memory timing changes or other configuration alterations that require a reset to take effect, this is normally the time to execute a warm reset. That warm reset would start the early initialization phase over again. Affected registers would need to be restored.

From the reset vector, execution starts directly from nonvolatile flash storage. This operating mode is known as execute-in-place. The read performance of nonvolatile storage is much slower than the read performance of DRAM. The performance of code running from flash is therefore much lower than code executed in RAM. Most firmware is therefore copied from slower nonvolatile storage into RAM. The firmware is then executed in RAM in a process known as shadowing.

In embedded systems, the chip typically selects ranges that are managed to allow the change from flash to RAM execution. Most computing systems execute-in-place as little as possible. However, some RAM-constrained embedded platforms execute all applications in place. This is a viable option on very small embedded devices.

Intel Architecture systems generally do not execute-in-place for anything but the initial boot steps before memory has been configured. The firmware is often compressed, allowing reduction of nonvolatile RAM requirements. Clearly, the processor cannot execute a compressed image in place. There is a trade-off between the size of data to be shadowed and the act of decompression. The decompression algorithm may take much longer to load and execute than it would take for the image to remain uncompressed. Prefetchers in the processor, if enabled, may speed up execution-in-place. Some SOCs have internal NVRAM cache buffers to assist in pipelining the data from the flash to the processor. Figure 3 shows the memory map at initialization in real mode.

Figure 3: The Intel Architecture memory map at power-on.

Before memory is initialized, the data and code stacks are held in the processor cache. Once memory is initialized, the system must exit that special caching mode and flush the cache. The stack will be transferred to a new location in main memory and cache reconfigured as part of AP initialization.

The stack must be set up before jumping into the shadowed portion of the BIOS that is now in memory. A memory location must be chosen for stack space. The stack will count down so the top of the stack must be entered and enough memory must be allocated for the maximum stack.

If the system is in real mode, then SS:SP must be set with the appropriate values. If protected mode is used, which is likely the case following MRC execution, then SS:ESP must be set to the correct memory location.

This is where the code makes the jump into memory. If a memory test has not been performed before this point, the jump could very well be to garbage. System failures indicated by a Power-On Self Test (POST) code between "end of memory initialization" and the first following POST code almost always indicate a catastrophic memory initialization problem. If this is a new design, then chances are this is in the hardware and requires step-by-step debug.

For legacy option ROMs and BIOS memory ranges, Intel chipsets usually come with memory aliasing capabilities that allow access to memory below 1 MB to be routed to or from DRAM or nonvolatile storage located just under 4 GB. The registers that control this aliasing are typically referred to as Programmable Attribute Maps (PAMs). Manipulation of these registers may be required before, during, and after firmware shadowing. The control over the redirection of memory access varies from chipset to chipset For example, some chipsets allow control over reads and writes, while others allow control over reads only.

For shadowing, if PAM registers remain at default values (all 0s), all Firmware Hub (FWH) accesses to the E and F segments (E_0000–F_FFFFh) will be directed downstream toward the flash component. This will function to boot the system, but is very slow. Shadowing can be used to improve boot speed. One method of shadowing the E and F segments of BIOS is to utilize the PAM registers. This can be done by changing the enables (HIENABLE[] and LOENABLE[]) to 10 (write only). This will direct reads to the flash device and writes to memory. Data can then be shadowed into memory by reading and writing the same address. Once BIOS code has been shadowed into memory, the enables can be changed to read-only mode so memory reads are directed to memory. This also prevents accidental overwriting of the image in memory.

AP Initialization

Even in SOCs, there is the likelihood of having multiple CPU cores. Each core may be visualized as a Board Support Package (BSP) plus an AP. The BSP starts and initializes the system. The APs must be initialized with identical features. Before memory is activated, the APs are uninitialized. After memory is started, the remaining processors are initialized and left in a wait-for-SIPI state. To accomplish this, the system firmware must:

From a UEFI perspective, AP initialization may either be part of the PEI or DXE phase of the boot flow, or in the early or advanced initialization. There is some debate as to the final location.

Since Intel processors are packaged in various configurations, there are different terms that must be understood when considering processor initialization. In this context, a thread is a logical processor that shares resources with another logical processor in the same physical package. A core is a processor that coexists with another processor in the same physical package and does not share any resources with other processors. A package is a chip that contains any number of cores and threads.

Threads and cores on the same package are detectable by executing the CPUID instruction. Detection of additional packages must be done blindly. If a design must accommodate more than one physical package, the BSP needs to wait a certain amount of time for all potential APs in the system to "log in." Once a timeout occurs or the maximum expected number of processors "log in," it can be assumed that there are no more processors in the system.

In order to wake up secondary threads or cores, the BSP sends a SIPI to each thread and core. This SIPI is sent by using the BSP's LAPIC, indicating the physical address from which the AP should start executing. This address must be below 1 MB of memory and must be aligned on a 4-KB boundary. Upon receipt of the SIPI, the AP starts executing the code pointed to by the SIPI message. Unlike the BSP, the AP starts code execution in real mode. This requires that the code that the AP starts executing is located below 1 MB.

Because of the different processor combinations and the various attributes of shared processing registers between threads, care must be taken to ensure that there are no caching conflicts in the memory used throughout the system.

AP behavior during firmware initialization is dependent on the firmware implementation, but is most commonly restricted to short periods of initialization followed by a HLT instruction, during which the system awaits direction from the BSP before undertaking another operation.

Once the firmware is ready to attempt to boot an OS, all AP processors must be placed back in their power-on state. The BSP accomplishes this by sending an Init Assert IPI followed by an Init De-assert IPI to all APs in the system (except itself).

The final part of this article, which will appear in January, covers advanced device installation, memory-map configuration, and all the other steps required to prepare the hardware for loading the operating system.

This article is adapted from material in Intel Technology Journal (March 2011) "UEFI Today: Bootstrapping the Continuum," and portions of it are copyright Intel Corp.

Pete Dice is a software architect in Intel's chipset architecture group. He holds a bachelor of science degree in electrical engineering. Dice has over 19 years of experience in the computing industry, including 15 years at Intel.

Related Article

Part II in this series

Copyright © 2012 UBM Techweb