pixiecore: Add a retry loop in the iPXE boot.

iPXE has an annoying race condition where it sometimes doesn't notice
the ProxyDHCP response when booting, and fails. So we embed a boot
script in the builtin iPXE binaries that implements the retry loop
recommended in the documentation. Empirically, this has resolved
flaky boots on my test machine, usually no more than a single
retry is needed.
This commit is contained in:
David Anderson 2016-08-14 16:55:27 -07:00
parent b5e956b9fc
commit ff7a0b56c6
3 changed files with 62 additions and 4 deletions

56
pixiecore/boot.ipxe Normal file
View File

@ -0,0 +1,56 @@
#!ipxe
#
# This is the iPXE boot script that we embed into the iPXE binary.
#
# The entire reason for the existence of this script is that iPXE very
# eagerly configures DHCP as soon as it gets a DHCP response, and
# because of this it might miss the ProxyDHCP response that tells it
# how to boot. In this situation, `autoboot` (the default command)
# just fails and falls out of the PXE boot codepath, so we end up with
# machines that sometimes fail to "catch" the network boot.
#
# This script implements what the ipxe documentation recommends, which
# is to just retry the `dhcp` command a bunch until ipxe does see a
# ProxyDHCP response. It's quite ugly, and a proper fix should really
# get upstreamed to ipxe, but for right now, this works.
set attempts:int32 10
set x:int32 0
# Try to get a filename from ProxyDHCP, retrying a couple of times if
# we fail.
:loop
dhcp && isset ${filename} || goto retry
goto boot
:retry
iseq ${x} ${attempts} && goto fail ||
inc x
echo No ProxyDHCP response, retrying (attempt ${x}/${attempts})
goto loop
# Got a filename from ProxyDHCP, that's the actual boot script,
# off we go!
:boot
chain ${filename}
# Failure at this point probably means Pixiecore changed its mind
# about whether this machine should be booted in the middle of the
# boot cycle, so we had already handed off to iPXE, but now we're
# no longer serving a boot script for it.
#
# Reboot the machine to restart the whole cycle (and presumably skip
# PXE completely this time).
#
# It's also possible we just got horribly unlucky and the network
# environment is such that we're consistently missing the ProxyDHCP
# reply. That really sucks, so give people pointers to bug filing
# here.
:fail
echo Failed to get a ProxyDHCP response after ${attempts} attempts
echo
echo If you are sure that Pixiecore is still trying to boot this machine,
echo please file a bug at https://github.com/google/netboot .
echo
echo Rebooting in 5 seconds...
sleep 5
reboot

View File

@ -5,7 +5,9 @@ ipxe:
git clone git://git.ipxe.org/ipxe.git
(cd ipxe && git rev-parse HEAD >COMMIT-ID)
rm -rf ipxe/.git
(cd ipxe/src && make bin/undionly.kpxe bin-x86_64-efi/ipxe.efi bin-i386-efi/ipxe.efi)
(cd ipxe/src && make bin/undionly.kpxe EMBED=../../../pixiecore/boot.ipxe)
(cd ipxe/src && make bin-x86_64-efi/ipxe.efi EMBED=../../../pixiecore/boot.ipxe)
(cd ipxe/src && make bin-i386-efi/ipxe.efi EMBED=../../../pixiecore/boot.ipxe)
(cd ipxe && rm -rf bin && mkdir bin)
mv -f ipxe/src/bin/undionly.kpxe ipxe/bin/undionly.kpxe
mv -f ipxe/src/bin-x86_64-efi/ipxe.efi ipxe/bin/ipxe-x86_64.efi

File diff suppressed because one or more lines are too long