To be able to distinguish changelog entries from each other, we should
write a specific project name, e.g. coreos-overlay, instead of `PR`.
Changelog entries with a simple `PR` usually cause so much additional
rework when doing actual releases.
The GitHub Actions were defined for the LTS stream directly but we can
now follow the approach used for the other channels. This means that
in the future we could decide to create new Actions for 2022 by copying
the current one and modifying it when 2023 gets the new current LTS -
anyway some manual work would be required to set up Actions for both
old and new at the same time (we have no "previous" symlink on Origin).
We could retire the old LTS Actions immediately because the releases
don't occur on a fixed schedule but I think the automation is nice to
keep.
use upstream ignition (coreos/ignition) and apply our patches on top of
it.
It's currently done in the same way with coreos/afterburn.
Signed-off-by: Mathieu Tortuyaux <mtortuyaux@microsoft.com>
The removal of the mantle ebuild file also meant that dnsmasq isn't
installed into the SDK anymore, yet we actually need it to run kola
QEMU tests in the SDK on the original CI pipeline. As long as the
original CI pipeline is kept, we have to keep kola's dependencies
like QEMU and dnsmasq around.
pahole is a build-time dependency of our kernel build, due to us setting
CONFIG_BTF_DEBUG_INFO. If pahole is missing, a `make modules_prepare` with our
kernel config results in symbols in the config changing. This will affect
people building kernel modules against coreos-sources in the developer
container, but not the SDK because pahole is already in sdk-depends.
pahole is now an (explicit) BDEPEND of all the coreos-kernel/coreos-modules
packages, and we'll make it an RDEPEND of coreos-sources so that it is pulled
in whenever it might be necessary. Also add it to the coreos-dev package so
that it is included in developer container by default, uncompressed size
increase is <1MB.
This is the fallback path that nvidia publishes for verifying device node
creation was successful. It now handles multiple gpus and creating the
nvidia-uvm node, with a dynamic major.
The weird thing is that nvidia-smi and nvidia-modprobe also create some device
nodes and files under /dev, but this does not appear to be well documented. So
keep the static creation.
This involves putting libraries under /usr/lib64 and kernel modules under
/usr/lib/module. This is an experiment at making the nvidia installation work
as a sysext as well, but there are still some issues around that. The major
issue was that `systemd-sysext refresh` would remove the OEM symlink and I
don't feel comfortable with `systemctl restart systemd-sysext` from within
another unit.
If anyone wants to try it, it's now a matter of:
ln -s /opt/nvidia/current /run/extensions/nvidia-driver
Bonus points for moving nvidia binaries from /opt/bin to
/opt/nvidia/current/usr/bin.
Since we no longer need to run emerge in the developer container, we can as
well just treat the developer container more like a container image and use an
ephemeral overlay.
Currently the setup-nvidia script fails when re-executed. It should work in
cases when the driver is already built and just needs to be loaded, or when it
needs to be rebuilt for a new kernel (but driver version may not have changed).
To make this work, several changes where necessary:
* `./nvidia*.run -x -s` fails when already unpacked. Allow it so that we can
rebuild
* there are several module dependencies for nvidia modules that are implicit,
related to i2c/ipmi. Probe those explicitly.
* `[ -f /dev/nvidia* ]` fails because those are character devices, so need a
`[ -c ...]` check.
* `nvidia-modprobe` previously always failed, because it doesn't actually know
the location of the modules and can only call modprobe (modprobe looks into
/lib/modules/). We now explicitly probe the important modules, at that point
nvidia-modprobe just creates additional device nodes.
* `is_nvidia_installation_required` checks whether building and loading is needed.
Factor out the loading check so that we can reload the module after an update.
Currently the script will reuse a developer container that was downloaded once,
without ensuring that the same version is used as the running image. This works
on the first boot, but wouldn't be correct after an OS update.
To resolve this, add a version number to the downloaded filename, and check for
the versioned dev container file. When the file is missing we also cleanup all
other dev container files via glob remove.
...by providing /etc/flatcar/nvidia-metadata. Newer driver packages do not
support some older Nvidia cards. An example is the Tesla K80 cards in
Standard_NC6 VMs on Azure, which are only supported up to the 470.x driver
version. To allow users to continue using those, give them a way to override
the driver version through /etc/flatcar/nvidia-metadata. For example, this
entry could be used to pin a specific driver version:
NVIDIA_DRIVER_VERSION=470.103.01
There are two ways to build the nvidia-driver - either against a full kernel
source tree in /usr/src/linux, or against a slim kernel-devel equivalent in
/lib/modules/*/build. The /lib/modules/*/build is provided by
sys-kernel/coreos-module, see `install_build_source`. The interesting thing is
that in absence of --kernel-source-path, nvidia-installer will autodetect which
to use and already builds against /lib/modules/*/build on Flatcar right now. By
passing --kernel-name, we make that choice explicit and this allows us to skip
the emerge steps of the build.
Since this runs in the developer container, there is also no point in trying to
execute systemctl or depmod, so pass the flags to disable usage of those.
Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
With the new mantle container image referenced by the scripts repo we
don't need the mantle copy in the SDK anymore.
Drop the mantle package and the unused kola-data package.
Found this while checking why I was still seeing lots of
!!! Section 'gentoo' in repos.conf is missing location attribute
messages while building. Turns out that after the last sync of portage we
stopped applying patches from files/. This was caused by a local variable
definition of PATCHES that was overriding the global one.
This might be a sign to drop them or we can refresh them, as they do fix bugs
that have been hit in CoreOS in the past. I opted to refresh them, and inject
them into the local variable.