mirror of
https://github.com/siderolabs/talos.git
synced 2025-10-01 10:41:30 +02:00
docs: nvidia oss drivers
Add docs on using NVIDIA OSS drivers Part of #6127 Signed-off-by: Noel Georgi <git@frezbo.dev>
This commit is contained in:
parent
2f2d97b6b5
commit
cfe6c2bc2d
@ -5,10 +5,12 @@ linkTitle: "Documentation"
|
||||
cascade:
|
||||
type: docs
|
||||
preRelease: true
|
||||
lastRelease: v1.2.0-alpha.1
|
||||
lastRelease: v1.2.0-beta.0
|
||||
kubernetesRelease: "v1.25.0-rc.1"
|
||||
prevKubernetesRelease: "1.24.3"
|
||||
theilaRelease: "v0.2.1"
|
||||
nvidiaContainerToolkitRelease: "v1.10.0"
|
||||
nvidiaDriverRelease: "515.65.01"
|
||||
---
|
||||
|
||||
## Welcome
|
||||
|
@ -0,0 +1,38 @@
|
||||
---
|
||||
title: "NVIDIA Fabric Manager"
|
||||
description: "In this guide we'll follow the procedure to enable NVIDIA Fabric Manager."
|
||||
aliases:
|
||||
- ../../guides/nvidia-fabricmanager
|
||||
---
|
||||
|
||||
NVIDIA GPUs that have nvlink support (for eg: A100) will need the [nvidia-fabricmanager](https://github.com/siderolabs/extensions/pkgs/container/nvidia-fabricmanager) system extension also enabled in addition to the [NVIDIA drivers]({{< relref "nvidia-gpu" >}}).
|
||||
For more information on Fabric Manager refer https://docs.nvidia.com/datacenter/tesla/fabric-manager-user-guide/index.html
|
||||
|
||||
The published versions of the NVIDIA fabricmanager system extensions is available [here](https://github.com/siderolabs/extensions/pkgs/container/nvidia-fabricmanager)
|
||||
|
||||
> The `nvidia-fabricmanager` extension version has to match with the NVIDIA driver version in use.
|
||||
|
||||
## Upgrading Talos and enabling the NVIDIA fabricmanager system extension
|
||||
|
||||
In addition to the patch defined in the [NVIDIA drivers]({{< relref "nvidia-gpu" >}}#upgrading-talos-and-enabling-the-nvidia-modules-and-the-system-extension) guide, we need to add the `nvidia-fabricmanager` system extension to the patch yaml `gpu-worker-patch.yaml`:
|
||||
|
||||
```yaml
|
||||
- op: add
|
||||
path: /machine/install/extensions
|
||||
value:
|
||||
- image: ghcr.io/siderolabs/nvidia-open-gpu-kernel-modules:{{< nvidia_driver_release >}}-{{< release >}}
|
||||
- image: ghcr.io/siderolabs/nvidia-container-toolkit:{{< nvidia_driver_release >}}-{{< nvidia_container_toolkit_release >}}
|
||||
- image: ghcr.io/siderolabs/nvidia-fabricmanager:{{< nvidia_driver_release >}}
|
||||
- op: add
|
||||
path: /machine/kernel
|
||||
value:
|
||||
modules:
|
||||
- name: nvidia
|
||||
- name: nvidia_uvm
|
||||
- name: nvidia_drm
|
||||
- name: nvidia_modeset
|
||||
- op: add
|
||||
path: /machine/sysctls
|
||||
value:
|
||||
net.core.bpf_jit_harden: 1
|
||||
```
|
@ -0,0 +1,220 @@
|
||||
---
|
||||
title: "NVIDIA GPU (Proprietary drivers)"
|
||||
description: "In this guide we'll follow the procedure to support NVIDIA GPU using proprietary drivers on Talos."
|
||||
aliases:
|
||||
- ../../guides/nvidia-gpu-proprietary
|
||||
---
|
||||
|
||||
> Enabling NVIDIA GPU support on Talos is bound by [NVIDIA EULA](https://www.nvidia.com/en-us/drivers/nvidia-license/).
|
||||
> Talos GPU support has been promoted to **beta**.
|
||||
|
||||
These are the steps to enabling NVIDIA support in Talos.
|
||||
|
||||
- Talos pre-installed on a node with NVIDIA GPU installed.
|
||||
- Building a custom Talos installer image with NVIDIA modules
|
||||
- Upgrading Talos with the custom installer and enabling NVIDIA modules and the system extension
|
||||
|
||||
This requires that the user build and maintain their own Talos installer image.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
This guide assumes the user has access to a container registry with `push` permissions, docker installed on the build machine and the Talos host has `pull` access to the container registry.
|
||||
|
||||
Set the local registry and username environment variables:
|
||||
|
||||
```bash
|
||||
export USERNAME=<username>
|
||||
export REGISTRY=<registry>
|
||||
```
|
||||
|
||||
For eg:
|
||||
|
||||
```bash
|
||||
export USERNAME=talos-user
|
||||
export REGISTRY=ghcr.io
|
||||
```
|
||||
|
||||
> The examples below will use the sample variables set above.
|
||||
Modify accordingly for your environment.
|
||||
|
||||
## Building the installer image
|
||||
|
||||
Start by cloning the [pkgs](https://github.com/siderolabs/pkgs) repository.
|
||||
|
||||
Now run the following command to build and push custom Talos kernel image and the NVIDIA image with the NVIDIA kernel modules signed by the kernel built along with it.
|
||||
|
||||
```bash
|
||||
make kernel nonfree-kmod-nvidia PLATFORM=linux/amd64 PUSH=true
|
||||
```
|
||||
|
||||
> Replace the platform with `linux/arm64` if building for ARM64
|
||||
|
||||
Now we need to create a custom Talos installer image.
|
||||
|
||||
Start by creating a `Dockerfile` with the following content:
|
||||
|
||||
```Dockerfile
|
||||
FROM scratch as customization
|
||||
COPY --from=ghcr.io/talos-user/nonfree-kmod-nvidia:{{< release >}}-nvidia /lib/modules /lib/modules
|
||||
|
||||
FROM ghcr.io/siderolabs/installer:{{< release >}}
|
||||
COPY --from=ghcr.io/talos-user/kernel:{{< release >}}-nvidia /boot/vmlinuz /usr/install/${TARGETARCH}/vmlinuz
|
||||
```
|
||||
|
||||
Now build the image and push it to the registry.
|
||||
|
||||
```bash
|
||||
DOCKER_BUILDKIT=0 docker build --squash --build-arg RM="/lib/modules" -t ghcr.io/talos-user/installer:{{< release >}}-nvidia .
|
||||
docker push ghcr.io/talos-user/installer:{{< release >}}-nvidia
|
||||
```
|
||||
|
||||
> Note: buildkit has a bug [#816](https://github.com/moby/buildkit/issues/816), to disable it use DOCKER_BUILDKIT=0
|
||||
> Replace the platform with `linux/arm64` if building for ARM64
|
||||
|
||||
## Upgrading Talos and enabling the NVIDIA modules and the system extension
|
||||
|
||||
> Make sure to use `talosctl` version {{< release >}} or later
|
||||
|
||||
First create a patch yaml `gpu-worker-patch.yaml` to update the machine config similar to below:
|
||||
|
||||
```yaml
|
||||
- op: add
|
||||
path: /machine/install/extensions
|
||||
value:
|
||||
- image: ghcr.io/siderolabs/nvidia-container-toolkit:{{< nvidia_driver_release >}}-{{< nvidia_container_toolkit_release >}}
|
||||
- op: add
|
||||
path: /machine/kernel
|
||||
value:
|
||||
modules:
|
||||
- name: nvidia
|
||||
- name: nvidia_uvm
|
||||
- name: nvidia_drm
|
||||
- name: nvidia_modeset
|
||||
- op: add
|
||||
path: /machine/sysctls
|
||||
value:
|
||||
net.core.bpf_jit_harden: 1
|
||||
```
|
||||
|
||||
Now apply the patch to all Talos nodes in the cluster having NVIDIA GPU's installed:
|
||||
|
||||
```bash
|
||||
talosctl patch mc --patch @gpu-worker-patch.yaml
|
||||
```
|
||||
|
||||
Now we can proceed to upgrading Talos with the installer built previously:
|
||||
|
||||
```bash
|
||||
talosctl upgrade --image=ghcr.io/talos-user/installer:{{< release >}}-nvidia
|
||||
```
|
||||
|
||||
Once the node reboots, the NVIDIA modules should be loaded and the system extension should be installed.
|
||||
|
||||
This can be confirmed by running:
|
||||
|
||||
```bash
|
||||
talosctl read /proc/modules
|
||||
```
|
||||
|
||||
which should produce an output similar to below:
|
||||
|
||||
```text
|
||||
nvidia_uvm 1146880 - - Live 0xffffffffc2733000 (PO)
|
||||
nvidia_drm 69632 - - Live 0xffffffffc2721000 (PO)
|
||||
nvidia_modeset 1142784 - - Live 0xffffffffc25ea000 (PO)
|
||||
nvidia 39047168 - - Live 0xffffffffc00ac000 (PO)
|
||||
```
|
||||
|
||||
```bash
|
||||
talosctl get extensions
|
||||
```
|
||||
|
||||
which should produce an output similar to below:
|
||||
|
||||
```text
|
||||
NODE NAMESPACE TYPE ID VERSION NAME VERSION
|
||||
172.31.41.27 runtime ExtensionStatus 000.ghcr.io-frezbo-nvidia-container-toolkit-510.60.02-v1.9.0 1 nvidia-container-toolkit 510.60.02-v1.9.0
|
||||
```
|
||||
|
||||
```bash
|
||||
talosctl read /proc/driver/nvidia/version
|
||||
```
|
||||
|
||||
which should produce an output similar to below:
|
||||
|
||||
```text
|
||||
NVRM version: NVIDIA UNIX x86_64 Kernel Module 510.60.02 Wed Mar 16 11:24:05 UTC 2022
|
||||
GCC version: gcc version 11.2.0 (GCC)
|
||||
```
|
||||
|
||||
## Deploying NVIDIA device plugin
|
||||
|
||||
First we need to create the `RuntimeClass`
|
||||
|
||||
Apply the following manifest to create a runtime class that uses the extension:
|
||||
|
||||
```yaml
|
||||
---
|
||||
apiVersion: node.k8s.io/v1
|
||||
kind: RuntimeClass
|
||||
metadata:
|
||||
name: nvidia
|
||||
handler: nvidia
|
||||
```
|
||||
|
||||
Install the NVIDIA device plugin:
|
||||
|
||||
```bash
|
||||
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
|
||||
helm repo update
|
||||
helm install nvidia-device-plugin nvdp/nvidia-device-plugin --version=0.11.0 --set=runtimeClassName=nvidia
|
||||
```
|
||||
|
||||
Apply the following manifest to run CUDA pod via nvidia runtime:
|
||||
|
||||
```bash
|
||||
cat <<EOF | kubectl apply -f -
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Pod
|
||||
metadata:
|
||||
name: gpu-operator-test
|
||||
spec:
|
||||
restartPolicy: OnFailure
|
||||
runtimeClassName: nvidia
|
||||
containers:
|
||||
- name: cuda-vector-add
|
||||
image: "nvidia/samples:vectoradd-cuda11.6.0"
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: 1
|
||||
<<EOF
|
||||
```
|
||||
|
||||
The status can be viewed by running:
|
||||
|
||||
```bash
|
||||
kubectl get pods
|
||||
```
|
||||
|
||||
which should produce an output similar to below:
|
||||
|
||||
```text
|
||||
NAME READY STATUS RESTARTS AGE
|
||||
gpu-operator-test 0/1 Completed 0 13s
|
||||
```
|
||||
|
||||
```bash
|
||||
kubectl logs gpu-operator-test
|
||||
```
|
||||
|
||||
which should produce an output similar to below:
|
||||
|
||||
```text
|
||||
[Vector addition of 50000 elements]
|
||||
Copy input data from the host memory to the CUDA device
|
||||
CUDA kernel launch with 196 blocks of 256 threads
|
||||
Copy output data from the CUDA device to the host memory
|
||||
Test PASSED
|
||||
Done
|
||||
```
|
@ -1,87 +1,19 @@
|
||||
---
|
||||
title: "NVIDIA GPU"
|
||||
description: "In this guide we'll follow the procedure to support NVIDIA GPU on Talos."
|
||||
title: "NVIDIA GPU (OSS drivers)"
|
||||
description: "In this guide we'll follow the procedure to support NVIDIA GPU using OSS drivers on Talos."
|
||||
aliases:
|
||||
- ../../guides/nvidia-gpu
|
||||
---
|
||||
|
||||
> Enabling NVIDIA GPU support on Talos is bound by [NVIDIA EULA](https://www.nvidia.com/en-us/drivers/nvidia-license/)
|
||||
> Talos GPU support is an **alpha** feature.
|
||||
> Enabling NVIDIA GPU support on Talos is bound by [NVIDIA EULA](https://www.nvidia.com/en-us/drivers/nvidia-license/).
|
||||
> Talos GPU support has been promoted to **beta**.
|
||||
> The Talos published NVIDIA OSS drivers are bound to a specific Talos release.
|
||||
> The extensions versions also needs to be updated when upgrading Talos.
|
||||
|
||||
These are the steps to enabling NVIDIA support in Talos.
|
||||
The published versions of the NVIDIA system extensions can be found here:
|
||||
|
||||
- Talos pre-installed on a node with NVIDIA GPU installed.
|
||||
- Building a custom Talos installer image with NVIDIA modules
|
||||
- Building NVIDIA container toolkit system extension which allows to register a custom runtime with containerd
|
||||
- Upgrading Talos with the custom installer and enabling NVIDIA modules and the system extension
|
||||
|
||||
Both these components require that the user build and maintain their own Talos installer image and the NVIDIA container toolkit [Talos System Extension]({{< relref "system-extensions" >}}).
|
||||
|
||||
## Prerequisites
|
||||
|
||||
This guide assumes the user has access to a container registry with `push` permissions, docker installed on the build machine and the Talos host has `pull` access to the container registry.
|
||||
|
||||
Set the local registry and username environment variables:
|
||||
|
||||
```bash
|
||||
export USERNAME=<username>
|
||||
export REGISTRY=<registry>
|
||||
```
|
||||
|
||||
For eg:
|
||||
|
||||
```bash
|
||||
export USERNAME=talos-user
|
||||
export REGISTRY=ghcr.io
|
||||
```
|
||||
|
||||
> The examples below will use the sample variables set above.
|
||||
Modify accordingly for your environment.
|
||||
|
||||
## Building the installer image
|
||||
|
||||
Start by cloning the [pkgs](https://github.com/siderolabs/pkgs) repository.
|
||||
|
||||
Now run the following command to build and push custom Talos kernel image and the NVIDIA image with the NVIDIA kernel modules signed by the kernel built along with it.
|
||||
|
||||
```bash
|
||||
make kernel nonfree-kmod-nvidia PLATFORM=linux/amd64 PUSH=true
|
||||
```
|
||||
|
||||
> Replace the platform with `linux/arm64` if building for ARM64
|
||||
|
||||
Now we need to create a custom Talos installer image.
|
||||
|
||||
Start by creating a `Dockerfile` with the following content:
|
||||
|
||||
```Dockerfile
|
||||
FROM scratch as customization
|
||||
COPY --from=ghcr.io/talos-user/nonfree-kmod-nvidia:{{< release >}}-nvidia /lib/modules /lib/modules
|
||||
|
||||
FROM ghcr.io/siderolabs/installer:{{< release >}}
|
||||
COPY --from=ghcr.io/talos-user/kernel:{{< release >}}-nvidia /boot/vmlinuz /usr/install/${TARGETARCH}/vmlinuz
|
||||
```
|
||||
|
||||
Now build the image and push it to the registry.
|
||||
|
||||
```bash
|
||||
DOCKER_BUILDKIT=0 docker build --squash --build-arg RM="/lib/modules" -t ghcr.io/talos-user/installer:{{< release >}}-nvidia .
|
||||
docker push ghcr.io/talos-user/installer:{{< release >}}-nvidia
|
||||
```
|
||||
|
||||
> Note: buildkit has a bug [#816](https://github.com/moby/buildkit/issues/816), to disable it use DOCKER_BUILDKIT=0
|
||||
|
||||
## Building the system extension
|
||||
|
||||
Start by cloning the [extensions](https://github.com/siderolabs/extensions) repository.
|
||||
|
||||
Now run the following command to build and push the system extension.
|
||||
|
||||
```bash
|
||||
make nvidia-container-toolkit PLATFORM=linux/amd64 PUSH=true TAG=510.60.02-v1.9.0
|
||||
```
|
||||
|
||||
> Replace the platform with `linux/arm64` if building for ARM64
|
||||
- [nvidia-open-gpu-kernel-modules](https://github.com/siderolabs/extensions/pkgs/container/nvidia-open-gpu-kernel-modules)
|
||||
- [nvidia-container-toolkit](https://github.com/siderolabs/extensions/pkgs/container/nvidia-container-toolkit)
|
||||
|
||||
## Upgrading Talos and enabling the NVIDIA modules and the system extension
|
||||
|
||||
@ -93,7 +25,8 @@ First create a patch yaml `gpu-worker-patch.yaml` to update the machine config s
|
||||
- op: add
|
||||
path: /machine/install/extensions
|
||||
value:
|
||||
- image: ghcr.io/talos-user/nvidia-container-toolkit:510.60.02-v1.9.0
|
||||
- image: ghcr.io/siderolabs/nvidia-open-gpu-kernel-modules:{{< nvidia_driver_release >}}-{{< release >}}
|
||||
- image: ghcr.io/siderolabs/nvidia-container-toolkit:{{< nvidia_driver_release >}}-{{< nvidia_container_toolkit_release >}}
|
||||
- op: add
|
||||
path: /machine/kernel
|
||||
value:
|
||||
@ -108,16 +41,20 @@ First create a patch yaml `gpu-worker-patch.yaml` to update the machine config s
|
||||
net.core.bpf_jit_harden: 1
|
||||
```
|
||||
|
||||
> Update the driver version and Talos release in the above patch yaml from the published versions if there is a newer one available.
|
||||
> Make sure the driver version matches for both the `nvidia-open-gpu-kernel-modules` and `nvidia-container-toolkit` extensions.
|
||||
> The `nvidia-open-gpu-kernel-modules` extension is versioned as `<nvidia-driver-version>-<talos-release-version>` and the `nvidia-container-toolkit` extension is versioned as `<nvidia-driver-version>-<nvidia-container-toolkit-version>`.
|
||||
|
||||
Now apply the patch to all Talos nodes in the cluster having NVIDIA GPU's installed:
|
||||
|
||||
```bash
|
||||
talosctl patch mc --patch @gpu-worker-patch.yaml
|
||||
```
|
||||
|
||||
Now we can proceed to upgrading Talos with the installer built previously:
|
||||
Now we can proceed to upgrading Talos to the same version to enable the system extension:
|
||||
|
||||
```bash
|
||||
talosctl upgrade --image=ghcr.io/talos-user/installer:{{< release >}}-nvidia
|
||||
talosctl upgrade --image=ghcr.io/siderolabs/installer:{{< release >}}
|
||||
```
|
||||
|
||||
Once the node reboots, the NVIDIA modules should be loaded and the system extension should be installed.
|
||||
@ -144,8 +81,9 @@ talosctl get extensions
|
||||
which should produce an output similar to below:
|
||||
|
||||
```text
|
||||
NODE NAMESPACE TYPE ID VERSION NAME VERSION
|
||||
172.31.41.27 runtime ExtensionStatus 000.ghcr.io-frezbo-nvidia-container-toolkit-510.60.02-v1.9.0 1 nvidia-container-toolkit 510.60.02-v1.9.0
|
||||
NODE NAMESPACE TYPE ID VERSION NAME VERSION
|
||||
172.31.41.27 runtime ExtensionStatus 000.ghcr.io-siderolabs-nvidia-container-toolkit-515.65.01-v1.10.0 1 nvidia-container-toolkit 515.65.01-v1.10.0
|
||||
172.31.41.27 runtime ExtensionStatus 000.ghcr.io-siderolabs-nvidia-open-gpu-kernel-modules-515.65.01-v1.2.0 1 nvidia-open-gpu-kernel-modules 515.65.01-v1.2.0
|
||||
```
|
||||
|
||||
```bash
|
||||
@ -155,8 +93,8 @@ talosctl read /proc/driver/nvidia/version
|
||||
which should produce an output similar to below:
|
||||
|
||||
```text
|
||||
NVRM version: NVIDIA UNIX x86_64 Kernel Module 510.60.02 Wed Mar 16 11:24:05 UTC 2022
|
||||
GCC version: gcc version 11.2.0 (GCC)
|
||||
NVRM version: NVIDIA UNIX x86_64 Kernel Module 515.65.01 Wed Mar 16 11:24:05 UTC 2022
|
||||
GCC version: gcc version 12.2.0 (GCC)
|
||||
```
|
||||
|
||||
## Deploying NVIDIA device plugin
|
||||
|
@ -0,0 +1 @@
|
||||
{{ .Page.FirstSection.Params.nvidiaContainerToolkitRelease -}}
|
1
website/layouts/shortcodes/nvidia_driver_release.html
Normal file
1
website/layouts/shortcodes/nvidia_driver_release.html
Normal file
@ -0,0 +1 @@
|
||||
{{ .Page.FirstSection.Params.nvidiaDriverRelease -}}
|
Loading…
x
Reference in New Issue
Block a user