diff --git a/pixiecore/README.booting.md b/pixiecore/README.booting.md new file mode 100644 index 0000000..a557c96 --- /dev/null +++ b/pixiecore/README.booting.md @@ -0,0 +1,210 @@ +# How it works + +Pixiecore implements four different, but related protocols in one +binary, which together can take a PXE ROM from nothing to booting +Linux. They are: ProxyDHCP, PXE, TFTP, and HTTP. Let's walk through +the boot process for a PXE ROM. + +![Boot process graph](http://g.gravizo.com/g? +digraph G { + ProxyDHCP1 [label="ProxyDHCP"]; + ProxyDHCP2 [label="ProxyDHCP (iPXE)"]; + + ProxyDHCP1 -> PXE [label=< UEFI >, fontsize=11]; + ProxyDHCP1 -> TFTP [label=< BIOS >, fontsize=11]; + PXE -> TFTP; + TFTP -> ProxyDHCP2; + ProxyDHCP2 -> HTTP; +} +) + +## Step 1: DHCP/ProxyDHCP + +The first thing a PXE ROM does is request a configuration through +DHCP, with some additional PXE options set to indicate that it wants +to netboot. It expects a reply that mirrors some of these options, and +includes boot instructions in addition to network configuration. + +The normal way of providing these options is to edit your DHCP +server's configuration to provide them to clients that identify +themselves as PXE clients. Unfortunately, reconfiguring your network's +DHCP server is tedious at best, and impossible if you DHCP server is +built into a consumer router, or managed by someone else. + +Pixiecore instead uses a feature of the PXE specification called +_ProxyDHCP_. As you might guess from the name, ProxyDHCP is not a +proxy at all (yeah, the PXE spec is like that), but a second DHCP +server that only provides PXE configuration. + +When the PXE ROM sends out a `DHCPDISCOVER`, it gets two replies back: +one containing only network configuration from the primary DHCP server +(no PXE options), and one containing only PXE DHCP options from the +ProxyDHCP server. The PXE firmware combines the two, and continues as +if the primary server had provided all of the configuration. + +The client will finish network configuration with the primary DHCP +server (we're not involved with that), and will then proceed with the +next steps of booting. + +## Step 1.5: PXE-ish + +For classic BIOS clients, the ProxyDHCP response points to a TFTP +server and filename, and we go straight to step 2. For UEFI firmwares, +however, there's an additional step. + +Sadly, many UEFI firmwares in the wild don't implement PXE properly, +and fail to chainload correctly if you send then a ProxyDHCP response +pointing directly to a TFTP server. + +To get UEFI clients to boot reliably, we need to send them a ProxyDHCP +response that is invalid according to the PXE +specification. Specifically, a reply that lacks DHCP option 43 (PXE +Vendor Options). + +Once the UEFI client has configured its network, it will then send a +DHCPREQUEST packet to port 4011 of our ProxyDHCP server. This is the +"PXE Boot Server" port, another relatively obscure part of the PXE +specification that allows PXE firmwares to display boot menus +natively, among other things. + +Like our ProxyDHCP response, the PXE boot request and response in this +exchange are not valid according to the PXE specification, since they +both lack DHCP option 43 but include other PXE-specific options. Our +response to this request is essentially what we told BIOS clients in +step 1: here's a TFTP server and filename, go boot that. + +So, UEFI clients need to do this little indirection before catching up +with its BIOS cousin. + +### What is this strange protocol? + +I haven't fully verified this yet, but the protocol seems to be +"BINL", a Microsoft proprietary fork of PXE that was introduced in the +early days of EFI. + +There's no public specification for this protocol, but there is an +open-source implementation of a BINL client in the form of the +TianoCore EDK2 UEFI firmware. We can also examine packet captures of +machines being booted by "Windows Deployment Services", the service +that performs network installation of windows, and see that they use +this protoocl. + +Both of these secondary sources strongly indicate that what we're +actually doing here is telling the UEFI client to use BINL in our +ProxyDHCP response, and then telling it to use TFTP in our BINL +response. + +Modern UEFI firmwares (e.g. OVMF, derived from the TianoCore codebase) +support both standard PXE and this BINL variant, if BINL is what this +is. However, many firmwares that are still shipping in new devices +seem to only support BINL, which makes BINL the lowest common +denominator that has the best chance of booting all UEFI clients. + +This is a somewhat sad state of affairs given that Intel provides an +open-source reference UEFI implementation that has supported PXE for a +long time. However, industry practice seems to be to maintain +seldom-to-never updated private forks of TianoCore, with extensive +non-public modifications. As a result, it's likely we'll be stuck in +this situation for a long time to come. + +## Step 2: TFTP + +TFTP is, as the name suggests, a trivial protocol for transferring +files. I have found some PXE ROMs that manage to add unnecessary +complexity even to that, but by and large, this step is +straightforward. + +However, TFTP is quite slow, because it doesn't support transfer +windows (well, it does, but it's an extension defined in an RFC +published in 2015, so guess how many PXE ROMs implement it...). As a +result, you must pay one round-trip per ~1500 bytes transferred, and +even on a gigabit network, that slows things down. + +Given that some netboot images are quite large (CoreOS clocks in at +almost 200MB), what we really want is to switch to a more efficient +protocol. That's where iPXE comes in. + +iPXE is a small bootloader that knows how to boot Linux kernels, and +can speak HTTP. iPXE is between 50kB and 900kB (depending on the +architecture and BIOS vs. UEFI), which even over TFTP is very fast to +transfer. + +Thus, Pixiecore uses TFTP only to transfer iPXE, and from there steers +to HTTP for the rest of the loading process. + +## Step 3: ProxyDHCP, again + +Unlike some other bootloaders like PXELINUX, iPXE does not reuse the +firmware's preexisting network settings. Instead, it starts the +process all over again with a DHCP request. Again, we send it a +ProxyDHCP response. + +To break the infinite loop here, we can detect in the DHCP request +that the client is iPXE, and so we serve up a different response, one +that just points to an HTTP URL as the boot filename. iPXE interprets +this as a script (a sequence of iPXE commands, with minimal control +flow) that it should download and run. + +One more catch is that iPXE has a race condition: when configuring +DHCP, if it receives the regular DHCP response before the ProxyDHCP +response, it will quickly finish configuring the network... and then +complain that it has no boot instructions. To counteract this, we +embed an iPXE script in the iPXE binary itself, telling it to retry +network configuration until it gets a boot filename out of it. So, +we're actually chainloading from one iPXE script (embedded) to another +(from HTTP). + +## HTTP + +We've finally crawled our way up to the late nineties - we can speak +HTTP! Pixiecore's HTTP server is wonderfully familiar and normal. It +just serves up a trivial iPXE script telling it to boot a Linux +kernel, and the user-provided kernel and initrd files. + +iPXE grabs all of that, and finally, Linux boots. + +## Recap + +This is what the whole boot process looks like on the wire. + +### Dramatis Personae + +- **PXE ROM**, a brittle firmware burned into the network card. +- **DHCP server**, a plain old DHCP server providing network configuration. +- **Pixieboot**, the Hero and server of ProxyDHCP, PXE, TFTP and HTTP. +- **iPXE**, an open source [bootloader](http://ipxe.org). + +### Timeline + +- PXE ROM starts, broadcasts `DHCPDISCOVER`. +- DHCP server responds with a `DHCPOFFER` containing network configs. +- Pixiecore's ProxyDHCP server responds with a `DHCPOFFER` listing a TFTP file (BIOS) or BINL options (UEFI). +- PXE ROM does a `DHCPREQUEST`/`DHCPACK` exchange with the DHCP server to get a network configuration. +- (UEFI only) PXE ROM sends a `DHCPREQUEST` to Pixiecore's "PXE" server, asking for boot instructions. +- (UEFI only) Pixiecore's "PXE" server responds with a `DHCPACK` listing a TFTP file. +- PXE ROM downloads iPXE from Pixiecore's TFTP server, and hands off to iPXE. +- iPXE starts, broadcasts `DHCPDISCOVER`. +- DHCP server responds with a `DHCPOFFER` containing network configs. +- Pixiecore's ProxyDHCP server responds with a `DHCPOFFER` listing an HTTP URL. +- iPXE does a `DHCPREQUEST`/`DHCPACK` exchange with the DHCP server to get a network configuration. +- iPXE fetches its boot script from Pixiecore's HTTP server. +- iPXE fetches a kernel and ramdisk from Pixiecore's HTTP server, and boots Linux. + +# Known deviations from specifications + +Pixiecore aims to be compliant with the relevant specifications for +TFTP, DHCP, and PXE. This section lists the places where Pixiecore +deliberately deviates from the spec to support buggy clients. + +## Missing Client Machine Identifier (GUID) option + +Some PXE ROMs don't send DHCP option 97, "Client Machine Identifier +(GUID)", in their DHCP and PXE requests. According to the PXE 2.1 +specification and RFC 4578, this makes the requests non-compliant: + +> This option MUST be present in all DHCP and PXE packets sent by PXE-compliant clients and servers. + +Pixiecore's behavior implements "SHOULD" instead of "MUST": if a +client request has a GUID, Pixiecore's response will respond with a +GUID. If the client request has no GUID, Pixiecore omits option 97 in +its response. diff --git a/pixiecore/README.md b/pixiecore/README.md index 9ae5eef..35afe32 100644 --- a/pixiecore/README.md +++ b/pixiecore/README.md @@ -18,6 +18,9 @@ into a single binary that can cooperate with your network's existing DHCP server. You don't need to reconfigure anything else in the network. +If you're curious about the whole process that Pixiecore manages, you +can read the details in [README.booting](README.booting.md). + ## Installation Install Pixiecore via `go get`: @@ -117,148 +120,3 @@ the host network stack. ```shell sudo docker run -v .:/image --net=host danderson/pixiecore boot /image/coreos_production_pxe.vmlinuz /image/coreos_production_pxe_image.cpio.gz ``` - -## How it works - -Pixiecore implements four different, but related protocols in one -binary, which together can take a PXE ROM from nothing to booting -Linux. They are: ProxyDHCP, PXE, TFTP, and HTTP. Let's walk through -the boot process for a PXE ROM. - -### DHCP/ProxyDHCP - -The first thing a PXE ROM does is request a configuration through -DHCP, waiting for a DHCP reply that includes PXE vendor options. The -normal way of providing these options is to edit your DHCP server's -configuration to provide them to clients that identify themselves as -PXE clients. Unfortunately, reconfiguring your network's DHCP server -is tedious at best, and impossible if you DHCP server is built into a -consumer router, or managed by someone else. - -Pixiecore instead uses a feature of the PXE specification called -_ProxyDHCP_. As you might guess from the name, ProxyDHCP is not a -proxy at all (yeah, the PXE spec is like that), but a second DHCP -server that only provides PXE configuration. - -When the PXE ROM sends out a `DHCPDISCOVER`, it gets two replies back: -one containing network configuration from the primary DHCP server, and -one containing only PXE DHCP options from the ProxyDHCP server. The -PXE firmware combines the two, and continues as if the primary server -had provided all the configuration. - -### PXE - -In theory, you'd expect the ProxyDHCP server to just provide a TFTP -server IP and a filename to the PXE firmware, and it would proceed to -download and boot that just like the BOOTP of old. - -Sadly, the average quality of PXE ROM implementations is abysmal, and -many of them fail to chainload correctly if you try to do this from a -ProxyDHCP server. - -So, instead, we make use of the spec's "PXE menu" functionality, which -lets you tell the PXE firmware to display a boot menu. Just like -everything else in PXE, this is quite brittle, so nobody actually uses -it to display menus - instead, they just push a more fully featured -bootloader over PXE, and let that bootloader do the fancy work. - -However, PXE menus seem to work reliably when combined with -ProxyDHCP... And the PXE configuration can provide a timeout after -which the first menu entry is booted... And that timeout can be set to -zero. - -So, we can just provide a single-entry menu, with a zero timeout, and -chainload that way! But wait, there's more terribleness. PXE menu -entries don't just list a TFTP server and file to load, because that -would be too simple. Instead, each menu entry maps to a "Boot Server -Type", and yet another DHCP option maps that boot server type to a set -of IP addresses. - -Those IP addresses aren't TFTP servers, but PXE boot servers. PXE boot -servers listen on port 4011. They use the DHCP packet format, but only -as a way of conveying a DHCP option that says "please tell me how to -boot the following Boot Server Type". It's quite possibly the least -efficient protocol encoding ever devised. - -At long last, when the PXE server receives that request, it can reply -with a BOOTP-ish packet that specified next-server and a filename. And -_those_ are, at long last, TFTP. - -### TFTP - -After navigating the eldritch horror of PXE, TFTP is a breath of fresh -air. It is indeed a trivial protocol for transferring files. I have -found some PXE ROMs that manage to add unnecessary complexity even to -that, but by and large, this step is straightforward. - -However, TFTP is quite slow, because it doesn't support transfer -windows (well, it does, but it's an extension defined in an RFC -published in 2015, so guess how many PXE ROMs implement it...). As a -result, you must pay one round-trip per ~1500 bytes transferred, and -even on a gigabit network, that slows things down. - -Given that some netboot images are quite large (CoreOS clocks in at -almost 200MB), what we really want is to switch to a more efficient -protocol. That's where PXELINUX comes in. - -PXELINUX is a small bootloader that knows how to boot Linux kernels, -and it comes in a variant that can speak HTTP. PXELINUX is 90kB, which -even over TFTP is very fast to transfer. - -Thus, Pixiecore uses TFTP only to transfer PXELINUX, and from there -steers it to HTTP for the rest of the loading process. - -### HTTP - -We've finally crawled our way up to the late nineties - we can speak -HTTP! Pixiecore's HTTP server is wonderfully familiar and normal. It -just serves up a support file that PXELINUX needs (`ldlinux.c32`), a -trivial PXELINUX configuration telling it to boot a Linux kernel, and -the user-provided kernel and initrd files. - -PXELINUX grabs all of that, and finally, Linux boots. - -### Recap - -This is what the whole boot process looks like on the wire. - -#### Dramatis Personae - -- **PXE ROM**, a brittle firmware burned into the network card. -- **DHCP server**, a plain old DHCP server providing network configuration. -- **Pixieboot**, the Hero and server of ProxyDHCP, PXE, TFTP and HTTP. -- **PXELINUX**, an open source bootloader of the [Syslinux project](http://www.syslinux.org). - -#### Timeline - -- PXE ROM starts, broadcasts `DHCPDISCOVER`. -- DHCP server responds with a `DHCPOFFER` containing network configs. -- Pixiecore's ProxyDHCP server responds with a `DHCPOFFER` containing a PXE boot menu. -- PXE ROM does a `DHCPREQUEST`/`DHCPACK` exchange with the DHCP server to get a network configuration. -- PXE ROM processes the PXE boot menu, decides to boot menu entry 0. -- PXE ROM sends a `DHCPREQUEST` to Pixiecore's PXE server, asking for a boot file. -- Pixiecore's PXE server responds with a `DHCPACK` listing a TFTP - server, a boot filename, and a PXELINUX vendor option to make it use - HTTP. -- PXE ROM downloads PXELINUX from Pixiecore's TFTP server, and hands off to PXELINUX. -- PXELINUX fetches its configuration from Pixiecore's HTTP server. -- PXELINUX fetches a kernel and ramdisk from Pixiecore's HTTP server, and boots Linux. - -## Known deviations from specifications - -Pixiecore aims to be compliant with the relevant specifications for -TFTP, DHCP, and PXE. This section lists the places where Pixiecore -deliberately deviates from the spec to support buggy clients. - -### Missing Client Machine Identifier (GUID) option - -Some PXE ROMs don't send DHCP option 97, "Client Machine Identifier -(GUID)", in their DHCP and PXE requests. According to the PXE 2.1 -specification and RFC 4578, this makes the requests non-compliant: - -> This option MUST be present in all DHCP and PXE packets sent by PXE-compliant clients and servers. - -Pixiecore's behavior implements "SHOULD" instead of "MUST": if a -client request has a GUID, Pixiecore's response will respond with a -GUID. If the client request has no GUID, Pixiecore omits option 97 in -its response.