docs: nvidia oss drivers

Add docs on using NVIDIA OSS drivers Part of #6127 Signed-off-by: Noel Georgi <git@frezbo.dev>
2025-12-03 00:21:17 +01:00 · 2022-08-22 22:01:07 +05:30 · 2022-08-22 22:01:07 +05:30 · cfe6c2bc2d
commit cfe6c2bc2d
parent 2f2d97b6b5
6 changed files with 285 additions and 85 deletions
--- a/website/content/v1.2/_index.md
+++ b/website/content/v1.2/_index.md
@ -5,10 +5,12 @@ linkTitle: "Documentation"
 cascade:
  type: docs
 preRelease: true
-lastRelease: v1.2.0-alpha.1
+lastRelease: v1.2.0-beta.0
 kubernetesRelease: "v1.25.0-rc.1"
 prevKubernetesRelease: "1.24.3"
 theilaRelease: "v0.2.1"
+nvidiaContainerToolkitRelease: "v1.10.0"
+nvidiaDriverRelease: "515.65.01"
 ---

 ## Welcome
--- a/website/content/v1.2/talos-guides/configuration/nvidia-fabricmanager.md
+++ b/website/content/v1.2/talos-guides/configuration/nvidia-fabricmanager.md
@ -0,0 +1,38 @@
+---
+title: "NVIDIA Fabric Manager"
+description: "In this guide we'll follow the procedure to enable NVIDIA Fabric Manager."
+aliases:
+  - ../../guides/nvidia-fabricmanager
+---
+
+NVIDIA GPUs that have nvlink support (for eg: A100) will need the [nvidia-fabricmanager](https://github.com/siderolabs/extensions/pkgs/container/nvidia-fabricmanager) system extension also enabled in addition to the [NVIDIA drivers]({{< relref "nvidia-gpu" >}}).
+For more information on Fabric Manager refer https://docs.nvidia.com/datacenter/tesla/fabric-manager-user-guide/index.html
+
+The published versions of the NVIDIA fabricmanager system extensions is available [here](https://github.com/siderolabs/extensions/pkgs/container/nvidia-fabricmanager)
+
+> The `nvidia-fabricmanager` extension version has to match with the NVIDIA driver version in use.
+
+## Upgrading Talos and enabling the NVIDIA fabricmanager system extension
+
+In addition to the patch defined in the [NVIDIA drivers]({{< relref "nvidia-gpu" >}}#upgrading-talos-and-enabling-the-nvidia-modules-and-the-system-extension) guide, we need to add the `nvidia-fabricmanager` system extension to the patch yaml `gpu-worker-patch.yaml`:
+
+```yaml
+- op: add
+  path: /machine/install/extensions
+  value:
+    - image: ghcr.io/siderolabs/nvidia-open-gpu-kernel-modules:{{< nvidia_driver_release >}}-{{< release >}}
+    - image: ghcr.io/siderolabs/nvidia-container-toolkit:{{< nvidia_driver_release >}}-{{< nvidia_container_toolkit_release >}}
+    - image: ghcr.io/siderolabs/nvidia-fabricmanager:{{< nvidia_driver_release >}}
+- op: add
+  path: /machine/kernel
+  value:
+    modules:
+      - name: nvidia
+      - name: nvidia_uvm
+      - name: nvidia_drm
+      - name: nvidia_modeset
+- op: add
+  path: /machine/sysctls
+  value:
+    net.core.bpf_jit_harden: 1
+```
--- a/website/content/v1.2/talos-guides/configuration/nvidia-gpu-proprietary.md
+++ b/website/content/v1.2/talos-guides/configuration/nvidia-gpu-proprietary.md
@ -0,0 +1,220 @@
+---
+title: "NVIDIA GPU (Proprietary drivers)"
+description: "In this guide we'll follow the procedure to support NVIDIA GPU using proprietary drivers on Talos."
+aliases:
+  - ../../guides/nvidia-gpu-proprietary
+---
+
+> Enabling NVIDIA GPU support on Talos is bound by [NVIDIA EULA](https://www.nvidia.com/en-us/drivers/nvidia-license/).
+> Talos GPU support has been promoted to **beta**.
+
+These are the steps to enabling NVIDIA support in Talos.
+
+- Talos pre-installed on a node with NVIDIA GPU installed.
+- Building a custom Talos installer image with NVIDIA modules
+- Upgrading Talos with the custom installer and enabling NVIDIA modules and the system extension
+
+This requires that the user build and maintain their own Talos installer image.
+
+## Prerequisites
+
+This guide assumes the user has access to a container registry with `push` permissions, docker installed on the build machine and the Talos host has `pull` access to the container registry.
+
+Set the local registry and username environment variables:
+
+```bash
+export USERNAME=<username>
+export REGISTRY=<registry>
+```
+
+For eg:
+
+```bash
+export USERNAME=talos-user
+export REGISTRY=ghcr.io
+```
+
+> The examples below will use the sample variables set above.
+Modify accordingly for your environment.
+
+## Building the installer image
+
+Start by cloning the [pkgs](https://github.com/siderolabs/pkgs) repository.
+
+Now run the following command to build and push custom Talos kernel image and the NVIDIA image with the NVIDIA kernel modules signed by the kernel built along with it.
+
+```bash
+make kernel nonfree-kmod-nvidia PLATFORM=linux/amd64 PUSH=true
+```
+
+> Replace the platform with `linux/arm64` if building for ARM64
+
+Now we need to create a custom Talos installer image.
+
+Start by creating a `Dockerfile` with the following content:
+
+```Dockerfile
+FROM scratch as customization
+COPY --from=ghcr.io/talos-user/nonfree-kmod-nvidia:{{< release >}}-nvidia /lib/modules /lib/modules
+
+FROM ghcr.io/siderolabs/installer:{{< release >}}
+COPY --from=ghcr.io/talos-user/kernel:{{< release >}}-nvidia /boot/vmlinuz /usr/install/${TARGETARCH}/vmlinuz
+```
+
+Now build the image and push it to the registry.
+
+```bash
+DOCKER_BUILDKIT=0 docker build --squash --build-arg RM="/lib/modules" -t ghcr.io/talos-user/installer:{{< release >}}-nvidia .
+docker push ghcr.io/talos-user/installer:{{< release >}}-nvidia
+```
+
+> Note: buildkit has a bug [#816](https://github.com/moby/buildkit/issues/816), to disable it use DOCKER_BUILDKIT=0
+> Replace the platform with `linux/arm64` if building for ARM64
+
+## Upgrading Talos and enabling the NVIDIA modules and the system extension
+
+> Make sure to use `talosctl` version {{< release >}} or later
+
+First create a patch yaml `gpu-worker-patch.yaml` to update the machine config similar to below:
+
+```yaml
+- op: add
+  path: /machine/install/extensions
+  value:
+    - image: ghcr.io/siderolabs/nvidia-container-toolkit:{{< nvidia_driver_release >}}-{{< nvidia_container_toolkit_release >}}
+- op: add
+  path: /machine/kernel
+  value:
+    modules:
+      - name: nvidia
+      - name: nvidia_uvm
+      - name: nvidia_drm
+      - name: nvidia_modeset
+- op: add
+  path: /machine/sysctls
+  value:
+    net.core.bpf_jit_harden: 1
+```
+
+Now apply the patch to all Talos nodes in the cluster having NVIDIA GPU's installed:
+
+```bash
+talosctl patch mc --patch @gpu-worker-patch.yaml
+```
+
+Now we can proceed to upgrading Talos with the installer built previously:
+
+```bash
+talosctl upgrade --image=ghcr.io/talos-user/installer:{{< release >}}-nvidia
+```
+
+Once the node reboots, the NVIDIA modules should be loaded and the system extension should be installed.
+
+This can be confirmed by running:
+
+```bash
+talosctl read /proc/modules
+```
+
+which should produce an output similar to below:
+
+```text
+nvidia_uvm 1146880 - - Live 0xffffffffc2733000 (PO)
+nvidia_drm 69632 - - Live 0xffffffffc2721000 (PO)
+nvidia_modeset 1142784 - - Live 0xffffffffc25ea000 (PO)
+nvidia 39047168 - - Live 0xffffffffc00ac000 (PO)
+```
+
+```bash
+talosctl get extensions
+```
+
+which should produce an output similar to below:
+
+```text
+NODE           NAMESPACE   TYPE              ID                                                                 VERSION   NAME                       VERSION
+172.31.41.27   runtime     ExtensionStatus   000.ghcr.io-frezbo-nvidia-container-toolkit-510.60.02-v1.9.0       1         nvidia-container-toolkit   510.60.02-v1.9.0
+```
+
+```bash
+talosctl read /proc/driver/nvidia/version
+```
+
+which should produce an output similar to below:
+
+```text
+NVRM version: NVIDIA UNIX x86_64 Kernel Module  510.60.02  Wed Mar 16 11:24:05 UTC 2022
+GCC version:  gcc version 11.2.0 (GCC)
+```
+
+## Deploying NVIDIA device plugin
+
+First we need to create the `RuntimeClass`
+
+Apply the following manifest to create a runtime class that uses the extension:
+
+```yaml
+---
+apiVersion: node.k8s.io/v1
+kind: RuntimeClass
+metadata:
+  name: nvidia
+handler: nvidia
+```
+
+Install the NVIDIA device plugin:
+
+```bash
+helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
+helm repo update
+helm install nvidia-device-plugin nvdp/nvidia-device-plugin --version=0.11.0 --set=runtimeClassName=nvidia
+```
+
+Apply the following manifest to run CUDA pod via nvidia runtime:
+
+```bash
+cat <<EOF | kubectl apply -f -
+---
+apiVersion: v1
+kind: Pod
+metadata:
+  name: gpu-operator-test
+spec:
+  restartPolicy: OnFailure
+  runtimeClassName: nvidia
+  containers:
+  - name: cuda-vector-add
+    image: "nvidia/samples:vectoradd-cuda11.6.0"
+    resources:
+      limits:
+         nvidia.com/gpu: 1
+<<EOF
+```
+
+The status can be viewed by running:
+
+```bash
+kubectl get pods
+```
+
+which should produce an output similar to below:
+
+```text
+NAME                READY   STATUS      RESTARTS   AGE
+gpu-operator-test   0/1     Completed   0          13s
+```
+
+```bash
+kubectl logs gpu-operator-test
+```
+
+which should produce an output similar to below:
+
+```text
+[Vector addition of 50000 elements]
+Copy input data from the host memory to the CUDA device
+CUDA kernel launch with 196 blocks of 256 threads
+Copy output data from the CUDA device to the host memory
+Test PASSED
+Done
+```
--- a/website/content/v1.2/talos-guides/configuration/nvidia-gpu.md
+++ b/website/content/v1.2/talos-guides/configuration/nvidia-gpu.md
@ -1,87 +1,19 @@
 ---
-title: "NVIDIA GPU"
-description: "In this guide we'll follow the procedure to support NVIDIA GPU on Talos."
+title: "NVIDIA GPU (OSS drivers)"
+description: "In this guide we'll follow the procedure to support NVIDIA GPU using OSS drivers on Talos."
 aliases:
  - ../../guides/nvidia-gpu
 ---

-> Enabling NVIDIA GPU support on Talos is bound by [NVIDIA EULA](https://www.nvidia.com/en-us/drivers/nvidia-license/)
-> Talos GPU support is an **alpha** feature.
+> Enabling NVIDIA GPU support on Talos is bound by [NVIDIA EULA](https://www.nvidia.com/en-us/drivers/nvidia-license/).
+> Talos GPU support has been promoted to **beta**.
+> The Talos published NVIDIA OSS drivers are bound to a specific Talos release.
+> The extensions versions also needs to be updated when upgrading Talos.

-These are the steps to enabling NVIDIA support in Talos.
+The published versions of the NVIDIA system extensions can be found here:

- Talos pre-installed on a node with NVIDIA GPU installed.
- Building a custom Talos installer image with NVIDIA modules
- Building NVIDIA container toolkit system extension which allows to register a custom runtime with containerd
- Upgrading Talos with the custom installer and enabling NVIDIA modules and the system extension
-
-Both these components require that the user build and maintain their own Talos installer image and the NVIDIA container toolkit [Talos System Extension]({{< relref "system-extensions" >}}).
-
-## Prerequisites
-
-This guide assumes the user has access to a container registry with `push` permissions, docker installed on the build machine and the Talos host has `pull` access to the container registry.
-
-Set the local registry and username environment variables:
-
-```bash
-export USERNAME=<username>
-export REGISTRY=<registry>
-```
-
-For eg:
-
-```bash
-export USERNAME=talos-user
-export REGISTRY=ghcr.io
-```
-
-> The examples below will use the sample variables set above.
-Modify accordingly for your environment.
-
-## Building the installer image
-
-Start by cloning the [pkgs](https://github.com/siderolabs/pkgs) repository.
-
-Now run the following command to build and push custom Talos kernel image and the NVIDIA image with the NVIDIA kernel modules signed by the kernel built along with it.
-
-```bash
-make kernel nonfree-kmod-nvidia PLATFORM=linux/amd64 PUSH=true
-```
-
-> Replace the platform with `linux/arm64` if building for ARM64
-
-Now we need to create a custom Talos installer image.
-
-Start by creating a `Dockerfile` with the following content:
-
-```Dockerfile
-FROM scratch as customization
-COPY --from=ghcr.io/talos-user/nonfree-kmod-nvidia:{{< release >}}-nvidia /lib/modules /lib/modules
-
-FROM ghcr.io/siderolabs/installer:{{< release >}}
-COPY --from=ghcr.io/talos-user/kernel:{{< release >}}-nvidia /boot/vmlinuz /usr/install/${TARGETARCH}/vmlinuz
-```
-
-Now build the image and push it to the registry.
-
-```bash
-DOCKER_BUILDKIT=0 docker build --squash --build-arg RM="/lib/modules" -t ghcr.io/talos-user/installer:{{< release >}}-nvidia .
-docker push ghcr.io/talos-user/installer:{{< release >}}-nvidia
-```
-
-> Note: buildkit has a bug [#816](https://github.com/moby/buildkit/issues/816), to disable it use DOCKER_BUILDKIT=0
-
-## Building the system extension
-
-Start by cloning the [extensions](https://github.com/siderolabs/extensions) repository.
-
-Now run the following command to build and push the system extension.
-
-```bash
-make nvidia-container-toolkit PLATFORM=linux/amd64 PUSH=true TAG=510.60.02-v1.9.0
-```
-
-> Replace the platform with `linux/arm64` if building for ARM64
+- [nvidia-open-gpu-kernel-modules](https://github.com/siderolabs/extensions/pkgs/container/nvidia-open-gpu-kernel-modules)
+- [nvidia-container-toolkit](https://github.com/siderolabs/extensions/pkgs/container/nvidia-container-toolkit)

 ## Upgrading Talos and enabling the NVIDIA modules and the system extension

@ -93,7 +25,8 @@ First create a patch yaml `gpu-worker-patch.yaml` to update the machine config s
 - op: add
  path: /machine/install/extensions
  value:
-    - image: ghcr.io/talos-user/nvidia-container-toolkit:510.60.02-v1.9.0
+    - image: ghcr.io/siderolabs/nvidia-open-gpu-kernel-modules:{{< nvidia_driver_release >}}-{{< release >}}
+    - image: ghcr.io/siderolabs/nvidia-container-toolkit:{{< nvidia_driver_release >}}-{{< nvidia_container_toolkit_release >}}
 - op: add
  path: /machine/kernel
  value:
@ -108,16 +41,20 @@ First create a patch yaml `gpu-worker-patch.yaml` to update the machine config s
    net.core.bpf_jit_harden: 1
 ```

+> Update the driver version and Talos release in the above patch yaml from the published versions if there is a newer one available.
+> Make sure the driver version matches for both the `nvidia-open-gpu-kernel-modules` and `nvidia-container-toolkit` extensions.
+> The `nvidia-open-gpu-kernel-modules` extension is versioned as `<nvidia-driver-version>-<talos-release-version>` and the `nvidia-container-toolkit` extension is versioned as `<nvidia-driver-version>-<nvidia-container-toolkit-version>`.
+
 Now apply the patch to all Talos nodes in the cluster having NVIDIA GPU's installed:

 ```bash
 talosctl patch mc --patch @gpu-worker-patch.yaml
 ```

-Now we can proceed to upgrading Talos with the installer built previously:
+Now we can proceed to upgrading Talos to the same version to enable the system extension:

 ```bash
-talosctl upgrade --image=ghcr.io/talos-user/installer:{{< release >}}-nvidia
+talosctl upgrade --image=ghcr.io/siderolabs/installer:{{< release >}}
 ```

 Once the node reboots, the NVIDIA modules should be loaded and the system extension should be installed.
@ -144,8 +81,9 @@ talosctl get extensions
 which should produce an output similar to below:

 ```text
-NODE           NAMESPACE   TYPE              ID                                                                 VERSION   NAME                       VERSION
-172.31.41.27   runtime     ExtensionStatus   000.ghcr.io-frezbo-nvidia-container-toolkit-510.60.02-v1.9.0       1         nvidia-container-toolkit   510.60.02-v1.9.0
+NODE           NAMESPACE   TYPE              ID                                                                           VERSION   NAME                             VERSION
+172.31.41.27   runtime     ExtensionStatus   000.ghcr.io-siderolabs-nvidia-container-toolkit-515.65.01-v1.10.0            1         nvidia-container-toolkit         515.65.01-v1.10.0
+172.31.41.27   runtime     ExtensionStatus   000.ghcr.io-siderolabs-nvidia-open-gpu-kernel-modules-515.65.01-v1.2.0       1         nvidia-open-gpu-kernel-modules   515.65.01-v1.2.0
 ```

 ```bash
@ -155,8 +93,8 @@ talosctl read /proc/driver/nvidia/version
 which should produce an output similar to below:

 ```text
-NVRM version: NVIDIA UNIX x86_64 Kernel Module  510.60.02  Wed Mar 16 11:24:05 UTC 2022
-GCC version:  gcc version 11.2.0 (GCC)
+NVRM version: NVIDIA UNIX x86_64 Kernel Module  515.65.01  Wed Mar 16 11:24:05 UTC 2022
+GCC version:  gcc version 12.2.0 (GCC)
 ```

 ## Deploying NVIDIA device plugin
--- a/website/layouts/shortcodes/nvidia_container_toolkit_release.html
+++ b/website/layouts/shortcodes/nvidia_container_toolkit_release.html
@ -0,0 +1 @@
+{{ .Page.FirstSection.Params.nvidiaContainerToolkitRelease -}}
--- a/website/layouts/shortcodes/nvidia_driver_release.html
+++ b/website/layouts/shortcodes/nvidia_driver_release.html
@ -0,0 +1 @@
+{{ .Page.FirstSection.Params.nvidiaDriverRelease -}}
				`@ -0,0 +1 @@`
				`{{ .Page.FirstSection.Params.nvidiaContainerToolkitRelease -}}`
				`@ -0,0 +1 @@`
				`{{ .Page.FirstSection.Params.nvidiaDriverRelease -}}`