The kola test scripts are named by the platforms. The image naming is
also quite difficult to know and remember, e.g., whether "ami" or
"ami_vmdk" is needed for AWS tests and whether it's "vmware" or
"vmware_ova".
To address these problems the vms build stage now accepts the platform
names as format input, and for each platform it will automatically
generate the needed image types to run the tests.
The garbage collect job should also clean up kola resources if a test
job failed to do so due to forced terminator or misbehavior. The
cleanup is done by "ore" which needs credentials like kola.
Run ore from the mantle container image. Unfortunately Docker does not
support Podman's --env-host option and the env vars had to be passed
explicitly. While --env-file=<(env) would work it contains a lot of
variables that cause the container to behave a bit weird.
use upstream ignition (coreos/ignition) and apply our patches on top of
it.
It's currently done in the same way with coreos/afterburn.
Signed-off-by: Mathieu Tortuyaux <mtortuyaux@microsoft.com>
The removal of the mantle ebuild file also meant that dnsmasq isn't
installed into the SDK anymore, yet we actually need it to run kola
QEMU tests in the SDK on the original CI pipeline. As long as the
original CI pipeline is kept, we have to keep kola's dependencies
like QEMU and dnsmasq around.
pahole is a build-time dependency of our kernel build, due to us setting
CONFIG_BTF_DEBUG_INFO. If pahole is missing, a `make modules_prepare` with our
kernel config results in symbols in the config changing. This will affect
people building kernel modules against coreos-sources in the developer
container, but not the SDK because pahole is already in sdk-depends.
pahole is now an (explicit) BDEPEND of all the coreos-kernel/coreos-modules
packages, and we'll make it an RDEPEND of coreos-sources so that it is pulled
in whenever it might be necessary. Also add it to the coreos-dev package so
that it is included in developer container by default, uncompressed size
increase is <1MB.
This is the fallback path that nvidia publishes for verifying device node
creation was successful. It now handles multiple gpus and creating the
nvidia-uvm node, with a dynamic major.
The weird thing is that nvidia-smi and nvidia-modprobe also create some device
nodes and files under /dev, but this does not appear to be well documented. So
keep the static creation.
This involves putting libraries under /usr/lib64 and kernel modules under
/usr/lib/module. This is an experiment at making the nvidia installation work
as a sysext as well, but there are still some issues around that. The major
issue was that `systemd-sysext refresh` would remove the OEM symlink and I
don't feel comfortable with `systemctl restart systemd-sysext` from within
another unit.
If anyone wants to try it, it's now a matter of:
ln -s /opt/nvidia/current /run/extensions/nvidia-driver
Bonus points for moving nvidia binaries from /opt/bin to
/opt/nvidia/current/usr/bin.
Since we no longer need to run emerge in the developer container, we can as
well just treat the developer container more like a container image and use an
ephemeral overlay.
Currently the setup-nvidia script fails when re-executed. It should work in
cases when the driver is already built and just needs to be loaded, or when it
needs to be rebuilt for a new kernel (but driver version may not have changed).
To make this work, several changes where necessary:
* `./nvidia*.run -x -s` fails when already unpacked. Allow it so that we can
rebuild
* there are several module dependencies for nvidia modules that are implicit,
related to i2c/ipmi. Probe those explicitly.
* `[ -f /dev/nvidia* ]` fails because those are character devices, so need a
`[ -c ...]` check.
* `nvidia-modprobe` previously always failed, because it doesn't actually know
the location of the modules and can only call modprobe (modprobe looks into
/lib/modules/). We now explicitly probe the important modules, at that point
nvidia-modprobe just creates additional device nodes.
* `is_nvidia_installation_required` checks whether building and loading is needed.
Factor out the loading check so that we can reload the module after an update.
Currently the script will reuse a developer container that was downloaded once,
without ensuring that the same version is used as the running image. This works
on the first boot, but wouldn't be correct after an OS update.
To resolve this, add a version number to the downloaded filename, and check for
the versioned dev container file. When the file is missing we also cleanup all
other dev container files via glob remove.
...by providing /etc/flatcar/nvidia-metadata. Newer driver packages do not
support some older Nvidia cards. An example is the Tesla K80 cards in
Standard_NC6 VMs on Azure, which are only supported up to the 470.x driver
version. To allow users to continue using those, give them a way to override
the driver version through /etc/flatcar/nvidia-metadata. For example, this
entry could be used to pin a specific driver version:
NVIDIA_DRIVER_VERSION=470.103.01
There are two ways to build the nvidia-driver - either against a full kernel
source tree in /usr/src/linux, or against a slim kernel-devel equivalent in
/lib/modules/*/build. The /lib/modules/*/build is provided by
sys-kernel/coreos-module, see `install_build_source`. The interesting thing is
that in absence of --kernel-source-path, nvidia-installer will autodetect which
to use and already builds against /lib/modules/*/build on Flatcar right now. By
passing --kernel-name, we make that choice explicit and this allows us to skip
the emerge steps of the build.
Since this runs in the developer container, there is also no point in trying to
execute systemctl or depmod, so pass the flags to disable usage of those.
Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>