docs: nvidia oss drivers

Add docs on using NVIDIA OSS drivers

Part of #6127

Signed-off-by: Noel Georgi <git@frezbo.dev>
This commit is contained in:
Noel Georgi 2022-08-22 22:01:07 +05:30
parent 2f2d97b6b5
commit cfe6c2bc2d
No known key found for this signature in database
GPG Key ID: B1F736354201D483
6 changed files with 285 additions and 85 deletions

View File

@ -5,10 +5,12 @@ linkTitle: "Documentation"
cascade:
type: docs
preRelease: true
lastRelease: v1.2.0-alpha.1
lastRelease: v1.2.0-beta.0
kubernetesRelease: "v1.25.0-rc.1"
prevKubernetesRelease: "1.24.3"
theilaRelease: "v0.2.1"
nvidiaContainerToolkitRelease: "v1.10.0"
nvidiaDriverRelease: "515.65.01"
---
## Welcome

View File

@ -0,0 +1,38 @@
---
title: "NVIDIA Fabric Manager"
description: "In this guide we'll follow the procedure to enable NVIDIA Fabric Manager."
aliases:
- ../../guides/nvidia-fabricmanager
---
NVIDIA GPUs that have nvlink support (for eg: A100) will need the [nvidia-fabricmanager](https://github.com/siderolabs/extensions/pkgs/container/nvidia-fabricmanager) system extension also enabled in addition to the [NVIDIA drivers]({{< relref "nvidia-gpu" >}}).
For more information on Fabric Manager refer https://docs.nvidia.com/datacenter/tesla/fabric-manager-user-guide/index.html
The published versions of the NVIDIA fabricmanager system extensions is available [here](https://github.com/siderolabs/extensions/pkgs/container/nvidia-fabricmanager)
> The `nvidia-fabricmanager` extension version has to match with the NVIDIA driver version in use.
## Upgrading Talos and enabling the NVIDIA fabricmanager system extension
In addition to the patch defined in the [NVIDIA drivers]({{< relref "nvidia-gpu" >}}#upgrading-talos-and-enabling-the-nvidia-modules-and-the-system-extension) guide, we need to add the `nvidia-fabricmanager` system extension to the patch yaml `gpu-worker-patch.yaml`:
```yaml
- op: add
path: /machine/install/extensions
value:
- image: ghcr.io/siderolabs/nvidia-open-gpu-kernel-modules:{{< nvidia_driver_release >}}-{{< release >}}
- image: ghcr.io/siderolabs/nvidia-container-toolkit:{{< nvidia_driver_release >}}-{{< nvidia_container_toolkit_release >}}
- image: ghcr.io/siderolabs/nvidia-fabricmanager:{{< nvidia_driver_release >}}
- op: add
path: /machine/kernel
value:
modules:
- name: nvidia
- name: nvidia_uvm
- name: nvidia_drm
- name: nvidia_modeset
- op: add
path: /machine/sysctls
value:
net.core.bpf_jit_harden: 1
```

View File

@ -0,0 +1,220 @@
---
title: "NVIDIA GPU (Proprietary drivers)"
description: "In this guide we'll follow the procedure to support NVIDIA GPU using proprietary drivers on Talos."
aliases:
- ../../guides/nvidia-gpu-proprietary
---
> Enabling NVIDIA GPU support on Talos is bound by [NVIDIA EULA](https://www.nvidia.com/en-us/drivers/nvidia-license/).
> Talos GPU support has been promoted to **beta**.
These are the steps to enabling NVIDIA support in Talos.
- Talos pre-installed on a node with NVIDIA GPU installed.
- Building a custom Talos installer image with NVIDIA modules
- Upgrading Talos with the custom installer and enabling NVIDIA modules and the system extension
This requires that the user build and maintain their own Talos installer image.
## Prerequisites
This guide assumes the user has access to a container registry with `push` permissions, docker installed on the build machine and the Talos host has `pull` access to the container registry.
Set the local registry and username environment variables:
```bash
export USERNAME=<username>
export REGISTRY=<registry>
```
For eg:
```bash
export USERNAME=talos-user
export REGISTRY=ghcr.io
```
> The examples below will use the sample variables set above.
Modify accordingly for your environment.
## Building the installer image
Start by cloning the [pkgs](https://github.com/siderolabs/pkgs) repository.
Now run the following command to build and push custom Talos kernel image and the NVIDIA image with the NVIDIA kernel modules signed by the kernel built along with it.
```bash
make kernel nonfree-kmod-nvidia PLATFORM=linux/amd64 PUSH=true
```
> Replace the platform with `linux/arm64` if building for ARM64
Now we need to create a custom Talos installer image.
Start by creating a `Dockerfile` with the following content:
```Dockerfile
FROM scratch as customization
COPY --from=ghcr.io/talos-user/nonfree-kmod-nvidia:{{< release >}}-nvidia /lib/modules /lib/modules
FROM ghcr.io/siderolabs/installer:{{< release >}}
COPY --from=ghcr.io/talos-user/kernel:{{< release >}}-nvidia /boot/vmlinuz /usr/install/${TARGETARCH}/vmlinuz
```
Now build the image and push it to the registry.
```bash
DOCKER_BUILDKIT=0 docker build --squash --build-arg RM="/lib/modules" -t ghcr.io/talos-user/installer:{{< release >}}-nvidia .
docker push ghcr.io/talos-user/installer:{{< release >}}-nvidia
```
> Note: buildkit has a bug [#816](https://github.com/moby/buildkit/issues/816), to disable it use DOCKER_BUILDKIT=0
> Replace the platform with `linux/arm64` if building for ARM64
## Upgrading Talos and enabling the NVIDIA modules and the system extension
> Make sure to use `talosctl` version {{< release >}} or later
First create a patch yaml `gpu-worker-patch.yaml` to update the machine config similar to below:
```yaml
- op: add
path: /machine/install/extensions
value:
- image: ghcr.io/siderolabs/nvidia-container-toolkit:{{< nvidia_driver_release >}}-{{< nvidia_container_toolkit_release >}}
- op: add
path: /machine/kernel
value:
modules:
- name: nvidia
- name: nvidia_uvm
- name: nvidia_drm
- name: nvidia_modeset
- op: add
path: /machine/sysctls
value:
net.core.bpf_jit_harden: 1
```
Now apply the patch to all Talos nodes in the cluster having NVIDIA GPU's installed:
```bash
talosctl patch mc --patch @gpu-worker-patch.yaml
```
Now we can proceed to upgrading Talos with the installer built previously:
```bash
talosctl upgrade --image=ghcr.io/talos-user/installer:{{< release >}}-nvidia
```
Once the node reboots, the NVIDIA modules should be loaded and the system extension should be installed.
This can be confirmed by running:
```bash
talosctl read /proc/modules
```
which should produce an output similar to below:
```text
nvidia_uvm 1146880 - - Live 0xffffffffc2733000 (PO)
nvidia_drm 69632 - - Live 0xffffffffc2721000 (PO)
nvidia_modeset 1142784 - - Live 0xffffffffc25ea000 (PO)
nvidia 39047168 - - Live 0xffffffffc00ac000 (PO)
```
```bash
talosctl get extensions
```
which should produce an output similar to below:
```text
NODE NAMESPACE TYPE ID VERSION NAME VERSION
172.31.41.27 runtime ExtensionStatus 000.ghcr.io-frezbo-nvidia-container-toolkit-510.60.02-v1.9.0 1 nvidia-container-toolkit 510.60.02-v1.9.0
```
```bash
talosctl read /proc/driver/nvidia/version
```
which should produce an output similar to below:
```text
NVRM version: NVIDIA UNIX x86_64 Kernel Module 510.60.02 Wed Mar 16 11:24:05 UTC 2022
GCC version: gcc version 11.2.0 (GCC)
```
## Deploying NVIDIA device plugin
First we need to create the `RuntimeClass`
Apply the following manifest to create a runtime class that uses the extension:
```yaml
---
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: nvidia
handler: nvidia
```
Install the NVIDIA device plugin:
```bash
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update
helm install nvidia-device-plugin nvdp/nvidia-device-plugin --version=0.11.0 --set=runtimeClassName=nvidia
```
Apply the following manifest to run CUDA pod via nvidia runtime:
```bash
cat <<EOF | kubectl apply -f -
---
apiVersion: v1
kind: Pod
metadata:
name: gpu-operator-test
spec:
restartPolicy: OnFailure
runtimeClassName: nvidia
containers:
- name: cuda-vector-add
image: "nvidia/samples:vectoradd-cuda11.6.0"
resources:
limits:
nvidia.com/gpu: 1
<<EOF
```
The status can be viewed by running:
```bash
kubectl get pods
```
which should produce an output similar to below:
```text
NAME READY STATUS RESTARTS AGE
gpu-operator-test 0/1 Completed 0 13s
```
```bash
kubectl logs gpu-operator-test
```
which should produce an output similar to below:
```text
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
```

View File

@ -1,87 +1,19 @@
---
title: "NVIDIA GPU"
description: "In this guide we'll follow the procedure to support NVIDIA GPU on Talos."
title: "NVIDIA GPU (OSS drivers)"
description: "In this guide we'll follow the procedure to support NVIDIA GPU using OSS drivers on Talos."
aliases:
- ../../guides/nvidia-gpu
---
> Enabling NVIDIA GPU support on Talos is bound by [NVIDIA EULA](https://www.nvidia.com/en-us/drivers/nvidia-license/)
> Talos GPU support is an **alpha** feature.
> Enabling NVIDIA GPU support on Talos is bound by [NVIDIA EULA](https://www.nvidia.com/en-us/drivers/nvidia-license/).
> Talos GPU support has been promoted to **beta**.
> The Talos published NVIDIA OSS drivers are bound to a specific Talos release.
> The extensions versions also needs to be updated when upgrading Talos.
These are the steps to enabling NVIDIA support in Talos.
The published versions of the NVIDIA system extensions can be found here:
- Talos pre-installed on a node with NVIDIA GPU installed.
- Building a custom Talos installer image with NVIDIA modules
- Building NVIDIA container toolkit system extension which allows to register a custom runtime with containerd
- Upgrading Talos with the custom installer and enabling NVIDIA modules and the system extension
Both these components require that the user build and maintain their own Talos installer image and the NVIDIA container toolkit [Talos System Extension]({{< relref "system-extensions" >}}).
## Prerequisites
This guide assumes the user has access to a container registry with `push` permissions, docker installed on the build machine and the Talos host has `pull` access to the container registry.
Set the local registry and username environment variables:
```bash
export USERNAME=<username>
export REGISTRY=<registry>
```
For eg:
```bash
export USERNAME=talos-user
export REGISTRY=ghcr.io
```
> The examples below will use the sample variables set above.
Modify accordingly for your environment.
## Building the installer image
Start by cloning the [pkgs](https://github.com/siderolabs/pkgs) repository.
Now run the following command to build and push custom Talos kernel image and the NVIDIA image with the NVIDIA kernel modules signed by the kernel built along with it.
```bash
make kernel nonfree-kmod-nvidia PLATFORM=linux/amd64 PUSH=true
```
> Replace the platform with `linux/arm64` if building for ARM64
Now we need to create a custom Talos installer image.
Start by creating a `Dockerfile` with the following content:
```Dockerfile
FROM scratch as customization
COPY --from=ghcr.io/talos-user/nonfree-kmod-nvidia:{{< release >}}-nvidia /lib/modules /lib/modules
FROM ghcr.io/siderolabs/installer:{{< release >}}
COPY --from=ghcr.io/talos-user/kernel:{{< release >}}-nvidia /boot/vmlinuz /usr/install/${TARGETARCH}/vmlinuz
```
Now build the image and push it to the registry.
```bash
DOCKER_BUILDKIT=0 docker build --squash --build-arg RM="/lib/modules" -t ghcr.io/talos-user/installer:{{< release >}}-nvidia .
docker push ghcr.io/talos-user/installer:{{< release >}}-nvidia
```
> Note: buildkit has a bug [#816](https://github.com/moby/buildkit/issues/816), to disable it use DOCKER_BUILDKIT=0
## Building the system extension
Start by cloning the [extensions](https://github.com/siderolabs/extensions) repository.
Now run the following command to build and push the system extension.
```bash
make nvidia-container-toolkit PLATFORM=linux/amd64 PUSH=true TAG=510.60.02-v1.9.0
```
> Replace the platform with `linux/arm64` if building for ARM64
- [nvidia-open-gpu-kernel-modules](https://github.com/siderolabs/extensions/pkgs/container/nvidia-open-gpu-kernel-modules)
- [nvidia-container-toolkit](https://github.com/siderolabs/extensions/pkgs/container/nvidia-container-toolkit)
## Upgrading Talos and enabling the NVIDIA modules and the system extension
@ -93,7 +25,8 @@ First create a patch yaml `gpu-worker-patch.yaml` to update the machine config s
- op: add
path: /machine/install/extensions
value:
- image: ghcr.io/talos-user/nvidia-container-toolkit:510.60.02-v1.9.0
- image: ghcr.io/siderolabs/nvidia-open-gpu-kernel-modules:{{< nvidia_driver_release >}}-{{< release >}}
- image: ghcr.io/siderolabs/nvidia-container-toolkit:{{< nvidia_driver_release >}}-{{< nvidia_container_toolkit_release >}}
- op: add
path: /machine/kernel
value:
@ -108,16 +41,20 @@ First create a patch yaml `gpu-worker-patch.yaml` to update the machine config s
net.core.bpf_jit_harden: 1
```
> Update the driver version and Talos release in the above patch yaml from the published versions if there is a newer one available.
> Make sure the driver version matches for both the `nvidia-open-gpu-kernel-modules` and `nvidia-container-toolkit` extensions.
> The `nvidia-open-gpu-kernel-modules` extension is versioned as `<nvidia-driver-version>-<talos-release-version>` and the `nvidia-container-toolkit` extension is versioned as `<nvidia-driver-version>-<nvidia-container-toolkit-version>`.
Now apply the patch to all Talos nodes in the cluster having NVIDIA GPU's installed:
```bash
talosctl patch mc --patch @gpu-worker-patch.yaml
```
Now we can proceed to upgrading Talos with the installer built previously:
Now we can proceed to upgrading Talos to the same version to enable the system extension:
```bash
talosctl upgrade --image=ghcr.io/talos-user/installer:{{< release >}}-nvidia
talosctl upgrade --image=ghcr.io/siderolabs/installer:{{< release >}}
```
Once the node reboots, the NVIDIA modules should be loaded and the system extension should be installed.
@ -144,8 +81,9 @@ talosctl get extensions
which should produce an output similar to below:
```text
NODE NAMESPACE TYPE ID VERSION NAME VERSION
172.31.41.27 runtime ExtensionStatus 000.ghcr.io-frezbo-nvidia-container-toolkit-510.60.02-v1.9.0 1 nvidia-container-toolkit 510.60.02-v1.9.0
NODE NAMESPACE TYPE ID VERSION NAME VERSION
172.31.41.27 runtime ExtensionStatus 000.ghcr.io-siderolabs-nvidia-container-toolkit-515.65.01-v1.10.0 1 nvidia-container-toolkit 515.65.01-v1.10.0
172.31.41.27 runtime ExtensionStatus 000.ghcr.io-siderolabs-nvidia-open-gpu-kernel-modules-515.65.01-v1.2.0 1 nvidia-open-gpu-kernel-modules 515.65.01-v1.2.0
```
```bash
@ -155,8 +93,8 @@ talosctl read /proc/driver/nvidia/version
which should produce an output similar to below:
```text
NVRM version: NVIDIA UNIX x86_64 Kernel Module 510.60.02 Wed Mar 16 11:24:05 UTC 2022
GCC version: gcc version 11.2.0 (GCC)
NVRM version: NVIDIA UNIX x86_64 Kernel Module 515.65.01 Wed Mar 16 11:24:05 UTC 2022
GCC version: gcc version 12.2.0 (GCC)
```
## Deploying NVIDIA device plugin

View File

@ -0,0 +1 @@
{{ .Page.FirstSection.Params.nvidiaContainerToolkitRelease -}}

View File

@ -0,0 +1 @@
{{ .Page.FirstSection.Params.nvidiaDriverRelease -}}