pixiecore: update the boot walkthrough to include iPXE and UEFI.

This commit is contained in:
David Anderson 2016-08-24 01:58:11 -07:00
parent fef14c66e8
commit 7929a15c6a
2 changed files with 213 additions and 145 deletions

210
pixiecore/README.booting.md Normal file
View File

@ -0,0 +1,210 @@
# How it works
Pixiecore implements four different, but related protocols in one
binary, which together can take a PXE ROM from nothing to booting
Linux. They are: ProxyDHCP, PXE, TFTP, and HTTP. Let's walk through
the boot process for a PXE ROM.
![Boot process graph](http://g.gravizo.com/g?
digraph G {
ProxyDHCP1 [label="ProxyDHCP"];
ProxyDHCP2 [label="ProxyDHCP (iPXE)"];
ProxyDHCP1 -> PXE [label=< <i>UEFI</i> >, fontsize=11];
ProxyDHCP1 -> TFTP [label=< <i>BIOS</i> >, fontsize=11];
PXE -> TFTP;
TFTP -> ProxyDHCP2;
ProxyDHCP2 -> HTTP;
}
)
## Step 1: DHCP/ProxyDHCP
The first thing a PXE ROM does is request a configuration through
DHCP, with some additional PXE options set to indicate that it wants
to netboot. It expects a reply that mirrors some of these options, and
includes boot instructions in addition to network configuration.
The normal way of providing these options is to edit your DHCP
server's configuration to provide them to clients that identify
themselves as PXE clients. Unfortunately, reconfiguring your network's
DHCP server is tedious at best, and impossible if you DHCP server is
built into a consumer router, or managed by someone else.
Pixiecore instead uses a feature of the PXE specification called
_ProxyDHCP_. As you might guess from the name, ProxyDHCP is not a
proxy at all (yeah, the PXE spec is like that), but a second DHCP
server that only provides PXE configuration.
When the PXE ROM sends out a `DHCPDISCOVER`, it gets two replies back:
one containing only network configuration from the primary DHCP server
(no PXE options), and one containing only PXE DHCP options from the
ProxyDHCP server. The PXE firmware combines the two, and continues as
if the primary server had provided all of the configuration.
The client will finish network configuration with the primary DHCP
server (we're not involved with that), and will then proceed with the
next steps of booting.
## Step 1.5: PXE-ish
For classic BIOS clients, the ProxyDHCP response points to a TFTP
server and filename, and we go straight to step 2. For UEFI firmwares,
however, there's an additional step.
Sadly, many UEFI firmwares in the wild don't implement PXE properly,
and fail to chainload correctly if you send then a ProxyDHCP response
pointing directly to a TFTP server.
To get UEFI clients to boot reliably, we need to send them a ProxyDHCP
response that is invalid according to the PXE
specification. Specifically, a reply that lacks DHCP option 43 (PXE
Vendor Options).
Once the UEFI client has configured its network, it will then send a
DHCPREQUEST packet to port 4011 of our ProxyDHCP server. This is the
"PXE Boot Server" port, another relatively obscure part of the PXE
specification that allows PXE firmwares to display boot menus
natively, among other things.
Like our ProxyDHCP response, the PXE boot request and response in this
exchange are not valid according to the PXE specification, since they
both lack DHCP option 43 but include other PXE-specific options. Our
response to this request is essentially what we told BIOS clients in
step 1: here's a TFTP server and filename, go boot that.
So, UEFI clients need to do this little indirection before catching up
with its BIOS cousin.
### What is this strange protocol?
I haven't fully verified this yet, but the protocol seems to be
"BINL", a Microsoft proprietary fork of PXE that was introduced in the
early days of EFI.
There's no public specification for this protocol, but there is an
open-source implementation of a BINL client in the form of the
TianoCore EDK2 UEFI firmware. We can also examine packet captures of
machines being booted by "Windows Deployment Services", the service
that performs network installation of windows, and see that they use
this protoocl.
Both of these secondary sources strongly indicate that what we're
actually doing here is telling the UEFI client to use BINL in our
ProxyDHCP response, and then telling it to use TFTP in our BINL
response.
Modern UEFI firmwares (e.g. OVMF, derived from the TianoCore codebase)
support both standard PXE and this BINL variant, if BINL is what this
is. However, many firmwares that are still shipping in new devices
seem to only support BINL, which makes BINL the lowest common
denominator that has the best chance of booting all UEFI clients.
This is a somewhat sad state of affairs given that Intel provides an
open-source reference UEFI implementation that has supported PXE for a
long time. However, industry practice seems to be to maintain
seldom-to-never updated private forks of TianoCore, with extensive
non-public modifications. As a result, it's likely we'll be stuck in
this situation for a long time to come.
## Step 2: TFTP
TFTP is, as the name suggests, a trivial protocol for transferring
files. I have found some PXE ROMs that manage to add unnecessary
complexity even to that, but by and large, this step is
straightforward.
However, TFTP is quite slow, because it doesn't support transfer
windows (well, it does, but it's an extension defined in an RFC
published in 2015, so guess how many PXE ROMs implement it...). As a
result, you must pay one round-trip per ~1500 bytes transferred, and
even on a gigabit network, that slows things down.
Given that some netboot images are quite large (CoreOS clocks in at
almost 200MB), what we really want is to switch to a more efficient
protocol. That's where iPXE comes in.
iPXE is a small bootloader that knows how to boot Linux kernels, and
can speak HTTP. iPXE is between 50kB and 900kB (depending on the
architecture and BIOS vs. UEFI), which even over TFTP is very fast to
transfer.
Thus, Pixiecore uses TFTP only to transfer iPXE, and from there steers
to HTTP for the rest of the loading process.
## Step 3: ProxyDHCP, again
Unlike some other bootloaders like PXELINUX, iPXE does not reuse the
firmware's preexisting network settings. Instead, it starts the
process all over again with a DHCP request. Again, we send it a
ProxyDHCP response.
To break the infinite loop here, we can detect in the DHCP request
that the client is iPXE, and so we serve up a different response, one
that just points to an HTTP URL as the boot filename. iPXE interprets
this as a script (a sequence of iPXE commands, with minimal control
flow) that it should download and run.
One more catch is that iPXE has a race condition: when configuring
DHCP, if it receives the regular DHCP response before the ProxyDHCP
response, it will quickly finish configuring the network... and then
complain that it has no boot instructions. To counteract this, we
embed an iPXE script in the iPXE binary itself, telling it to retry
network configuration until it gets a boot filename out of it. So,
we're actually chainloading from one iPXE script (embedded) to another
(from HTTP).
## HTTP
We've finally crawled our way up to the late nineties - we can speak
HTTP! Pixiecore's HTTP server is wonderfully familiar and normal. It
just serves up a trivial iPXE script telling it to boot a Linux
kernel, and the user-provided kernel and initrd files.
iPXE grabs all of that, and finally, Linux boots.
## Recap
This is what the whole boot process looks like on the wire.
### Dramatis Personae
- **PXE ROM**, a brittle firmware burned into the network card.
- **DHCP server**, a plain old DHCP server providing network configuration.
- **Pixieboot**, the Hero and server of ProxyDHCP, PXE, TFTP and HTTP.
- **iPXE**, an open source [bootloader](http://ipxe.org).
### Timeline
- PXE ROM starts, broadcasts `DHCPDISCOVER`.
- DHCP server responds with a `DHCPOFFER` containing network configs.
- Pixiecore's ProxyDHCP server responds with a `DHCPOFFER` listing a TFTP file (BIOS) or BINL options (UEFI).
- PXE ROM does a `DHCPREQUEST`/`DHCPACK` exchange with the DHCP server to get a network configuration.
- (UEFI only) PXE ROM sends a `DHCPREQUEST` to Pixiecore's "PXE" server, asking for boot instructions.
- (UEFI only) Pixiecore's "PXE" server responds with a `DHCPACK` listing a TFTP file.
- PXE ROM downloads iPXE from Pixiecore's TFTP server, and hands off to iPXE.
- iPXE starts, broadcasts `DHCPDISCOVER`.
- DHCP server responds with a `DHCPOFFER` containing network configs.
- Pixiecore's ProxyDHCP server responds with a `DHCPOFFER` listing an HTTP URL.
- iPXE does a `DHCPREQUEST`/`DHCPACK` exchange with the DHCP server to get a network configuration.
- iPXE fetches its boot script from Pixiecore's HTTP server.
- iPXE fetches a kernel and ramdisk from Pixiecore's HTTP server, and boots Linux.
# Known deviations from specifications
Pixiecore aims to be compliant with the relevant specifications for
TFTP, DHCP, and PXE. This section lists the places where Pixiecore
deliberately deviates from the spec to support buggy clients.
## Missing Client Machine Identifier (GUID) option
Some PXE ROMs don't send DHCP option 97, "Client Machine Identifier
(GUID)", in their DHCP and PXE requests. According to the PXE 2.1
specification and RFC 4578, this makes the requests non-compliant:
> This option MUST be present in all DHCP and PXE packets sent by PXE-compliant clients and servers.
Pixiecore's behavior implements "SHOULD" instead of "MUST": if a
client request has a GUID, Pixiecore's response will respond with a
GUID. If the client request has no GUID, Pixiecore omits option 97 in
its response.

View File

@ -18,6 +18,9 @@ into a single binary that can cooperate with your network's existing
DHCP server. You don't need to reconfigure anything else in the
network.
If you're curious about the whole process that Pixiecore manages, you
can read the details in [README.booting](README.booting.md).
## Installation
Install Pixiecore via `go get`:
@ -117,148 +120,3 @@ the host network stack.
```shell
sudo docker run -v .:/image --net=host danderson/pixiecore boot /image/coreos_production_pxe.vmlinuz /image/coreos_production_pxe_image.cpio.gz
```
## How it works
Pixiecore implements four different, but related protocols in one
binary, which together can take a PXE ROM from nothing to booting
Linux. They are: ProxyDHCP, PXE, TFTP, and HTTP. Let's walk through
the boot process for a PXE ROM.
### DHCP/ProxyDHCP
The first thing a PXE ROM does is request a configuration through
DHCP, waiting for a DHCP reply that includes PXE vendor options. The
normal way of providing these options is to edit your DHCP server's
configuration to provide them to clients that identify themselves as
PXE clients. Unfortunately, reconfiguring your network's DHCP server
is tedious at best, and impossible if you DHCP server is built into a
consumer router, or managed by someone else.
Pixiecore instead uses a feature of the PXE specification called
_ProxyDHCP_. As you might guess from the name, ProxyDHCP is not a
proxy at all (yeah, the PXE spec is like that), but a second DHCP
server that only provides PXE configuration.
When the PXE ROM sends out a `DHCPDISCOVER`, it gets two replies back:
one containing network configuration from the primary DHCP server, and
one containing only PXE DHCP options from the ProxyDHCP server. The
PXE firmware combines the two, and continues as if the primary server
had provided all the configuration.
### PXE
In theory, you'd expect the ProxyDHCP server to just provide a TFTP
server IP and a filename to the PXE firmware, and it would proceed to
download and boot that just like the BOOTP of old.
Sadly, the average quality of PXE ROM implementations is abysmal, and
many of them fail to chainload correctly if you try to do this from a
ProxyDHCP server.
So, instead, we make use of the spec's "PXE menu" functionality, which
lets you tell the PXE firmware to display a boot menu. Just like
everything else in PXE, this is quite brittle, so nobody actually uses
it to display menus - instead, they just push a more fully featured
bootloader over PXE, and let that bootloader do the fancy work.
However, PXE menus seem to work reliably when combined with
ProxyDHCP... And the PXE configuration can provide a timeout after
which the first menu entry is booted... And that timeout can be set to
zero.
So, we can just provide a single-entry menu, with a zero timeout, and
chainload that way! But wait, there's more terribleness. PXE menu
entries don't just list a TFTP server and file to load, because that
would be too simple. Instead, each menu entry maps to a "Boot Server
Type", and yet another DHCP option maps that boot server type to a set
of IP addresses.
Those IP addresses aren't TFTP servers, but PXE boot servers. PXE boot
servers listen on port 4011. They use the DHCP packet format, but only
as a way of conveying a DHCP option that says "please tell me how to
boot the following Boot Server Type". It's quite possibly the least
efficient protocol encoding ever devised.
At long last, when the PXE server receives that request, it can reply
with a BOOTP-ish packet that specified next-server and a filename. And
_those_ are, at long last, TFTP.
### TFTP
After navigating the eldritch horror of PXE, TFTP is a breath of fresh
air. It is indeed a trivial protocol for transferring files. I have
found some PXE ROMs that manage to add unnecessary complexity even to
that, but by and large, this step is straightforward.
However, TFTP is quite slow, because it doesn't support transfer
windows (well, it does, but it's an extension defined in an RFC
published in 2015, so guess how many PXE ROMs implement it...). As a
result, you must pay one round-trip per ~1500 bytes transferred, and
even on a gigabit network, that slows things down.
Given that some netboot images are quite large (CoreOS clocks in at
almost 200MB), what we really want is to switch to a more efficient
protocol. That's where PXELINUX comes in.
PXELINUX is a small bootloader that knows how to boot Linux kernels,
and it comes in a variant that can speak HTTP. PXELINUX is 90kB, which
even over TFTP is very fast to transfer.
Thus, Pixiecore uses TFTP only to transfer PXELINUX, and from there
steers it to HTTP for the rest of the loading process.
### HTTP
We've finally crawled our way up to the late nineties - we can speak
HTTP! Pixiecore's HTTP server is wonderfully familiar and normal. It
just serves up a support file that PXELINUX needs (`ldlinux.c32`), a
trivial PXELINUX configuration telling it to boot a Linux kernel, and
the user-provided kernel and initrd files.
PXELINUX grabs all of that, and finally, Linux boots.
### Recap
This is what the whole boot process looks like on the wire.
#### Dramatis Personae
- **PXE ROM**, a brittle firmware burned into the network card.
- **DHCP server**, a plain old DHCP server providing network configuration.
- **Pixieboot**, the Hero and server of ProxyDHCP, PXE, TFTP and HTTP.
- **PXELINUX**, an open source bootloader of the [Syslinux project](http://www.syslinux.org).
#### Timeline
- PXE ROM starts, broadcasts `DHCPDISCOVER`.
- DHCP server responds with a `DHCPOFFER` containing network configs.
- Pixiecore's ProxyDHCP server responds with a `DHCPOFFER` containing a PXE boot menu.
- PXE ROM does a `DHCPREQUEST`/`DHCPACK` exchange with the DHCP server to get a network configuration.
- PXE ROM processes the PXE boot menu, decides to boot menu entry 0.
- PXE ROM sends a `DHCPREQUEST` to Pixiecore's PXE server, asking for a boot file.
- Pixiecore's PXE server responds with a `DHCPACK` listing a TFTP
server, a boot filename, and a PXELINUX vendor option to make it use
HTTP.
- PXE ROM downloads PXELINUX from Pixiecore's TFTP server, and hands off to PXELINUX.
- PXELINUX fetches its configuration from Pixiecore's HTTP server.
- PXELINUX fetches a kernel and ramdisk from Pixiecore's HTTP server, and boots Linux.
## Known deviations from specifications
Pixiecore aims to be compliant with the relevant specifications for
TFTP, DHCP, and PXE. This section lists the places where Pixiecore
deliberately deviates from the spec to support buggy clients.
### Missing Client Machine Identifier (GUID) option
Some PXE ROMs don't send DHCP option 97, "Client Machine Identifier
(GUID)", in their DHCP and PXE requests. According to the PXE 2.1
specification and RFC 4578, this makes the requests non-compliant:
> This option MUST be present in all DHCP and PXE packets sent by PXE-compliant clients and servers.
Pixiecore's behavior implements "SHOULD" instead of "MUST": if a
client request has a GUID, Pixiecore's response will respond with a
GUID. If the client request has no GUID, Pixiecore omits option 97 in
its response.