diff --git a/website/content/v1.2/_index.md b/website/content/v1.2/_index.md index 24b82f371..85877d477 100644 --- a/website/content/v1.2/_index.md +++ b/website/content/v1.2/_index.md @@ -5,10 +5,12 @@ linkTitle: "Documentation" cascade: type: docs preRelease: true -lastRelease: v1.2.0-alpha.1 +lastRelease: v1.2.0-beta.0 kubernetesRelease: "v1.25.0-rc.1" prevKubernetesRelease: "1.24.3" theilaRelease: "v0.2.1" +nvidiaContainerToolkitRelease: "v1.10.0" +nvidiaDriverRelease: "515.65.01" --- ## Welcome diff --git a/website/content/v1.2/talos-guides/configuration/nvidia-fabricmanager.md b/website/content/v1.2/talos-guides/configuration/nvidia-fabricmanager.md new file mode 100644 index 000000000..3243022bc --- /dev/null +++ b/website/content/v1.2/talos-guides/configuration/nvidia-fabricmanager.md @@ -0,0 +1,38 @@ +--- +title: "NVIDIA Fabric Manager" +description: "In this guide we'll follow the procedure to enable NVIDIA Fabric Manager." +aliases: + - ../../guides/nvidia-fabricmanager +--- + +NVIDIA GPUs that have nvlink support (for eg: A100) will need the [nvidia-fabricmanager](https://github.com/siderolabs/extensions/pkgs/container/nvidia-fabricmanager) system extension also enabled in addition to the [NVIDIA drivers]({{< relref "nvidia-gpu" >}}). +For more information on Fabric Manager refer https://docs.nvidia.com/datacenter/tesla/fabric-manager-user-guide/index.html + +The published versions of the NVIDIA fabricmanager system extensions is available [here](https://github.com/siderolabs/extensions/pkgs/container/nvidia-fabricmanager) + +> The `nvidia-fabricmanager` extension version has to match with the NVIDIA driver version in use. + +## Upgrading Talos and enabling the NVIDIA fabricmanager system extension + +In addition to the patch defined in the [NVIDIA drivers]({{< relref "nvidia-gpu" >}}#upgrading-talos-and-enabling-the-nvidia-modules-and-the-system-extension) guide, we need to add the `nvidia-fabricmanager` system extension to the patch yaml `gpu-worker-patch.yaml`: + +```yaml +- op: add + path: /machine/install/extensions + value: + - image: ghcr.io/siderolabs/nvidia-open-gpu-kernel-modules:{{< nvidia_driver_release >}}-{{< release >}} + - image: ghcr.io/siderolabs/nvidia-container-toolkit:{{< nvidia_driver_release >}}-{{< nvidia_container_toolkit_release >}} + - image: ghcr.io/siderolabs/nvidia-fabricmanager:{{< nvidia_driver_release >}} +- op: add + path: /machine/kernel + value: + modules: + - name: nvidia + - name: nvidia_uvm + - name: nvidia_drm + - name: nvidia_modeset +- op: add + path: /machine/sysctls + value: + net.core.bpf_jit_harden: 1 +``` diff --git a/website/content/v1.2/talos-guides/configuration/nvidia-gpu-proprietary.md b/website/content/v1.2/talos-guides/configuration/nvidia-gpu-proprietary.md new file mode 100644 index 000000000..c1a4cd951 --- /dev/null +++ b/website/content/v1.2/talos-guides/configuration/nvidia-gpu-proprietary.md @@ -0,0 +1,220 @@ +--- +title: "NVIDIA GPU (Proprietary drivers)" +description: "In this guide we'll follow the procedure to support NVIDIA GPU using proprietary drivers on Talos." +aliases: + - ../../guides/nvidia-gpu-proprietary +--- + +> Enabling NVIDIA GPU support on Talos is bound by [NVIDIA EULA](https://www.nvidia.com/en-us/drivers/nvidia-license/). +> Talos GPU support has been promoted to **beta**. + +These are the steps to enabling NVIDIA support in Talos. + +- Talos pre-installed on a node with NVIDIA GPU installed. +- Building a custom Talos installer image with NVIDIA modules +- Upgrading Talos with the custom installer and enabling NVIDIA modules and the system extension + +This requires that the user build and maintain their own Talos installer image. + +## Prerequisites + +This guide assumes the user has access to a container registry with `push` permissions, docker installed on the build machine and the Talos host has `pull` access to the container registry. + +Set the local registry and username environment variables: + +```bash +export USERNAME= +export REGISTRY= +``` + +For eg: + +```bash +export USERNAME=talos-user +export REGISTRY=ghcr.io +``` + +> The examples below will use the sample variables set above. +Modify accordingly for your environment. + +## Building the installer image + +Start by cloning the [pkgs](https://github.com/siderolabs/pkgs) repository. + +Now run the following command to build and push custom Talos kernel image and the NVIDIA image with the NVIDIA kernel modules signed by the kernel built along with it. + +```bash +make kernel nonfree-kmod-nvidia PLATFORM=linux/amd64 PUSH=true +``` + +> Replace the platform with `linux/arm64` if building for ARM64 + +Now we need to create a custom Talos installer image. + +Start by creating a `Dockerfile` with the following content: + +```Dockerfile +FROM scratch as customization +COPY --from=ghcr.io/talos-user/nonfree-kmod-nvidia:{{< release >}}-nvidia /lib/modules /lib/modules + +FROM ghcr.io/siderolabs/installer:{{< release >}} +COPY --from=ghcr.io/talos-user/kernel:{{< release >}}-nvidia /boot/vmlinuz /usr/install/${TARGETARCH}/vmlinuz +``` + +Now build the image and push it to the registry. + +```bash +DOCKER_BUILDKIT=0 docker build --squash --build-arg RM="/lib/modules" -t ghcr.io/talos-user/installer:{{< release >}}-nvidia . +docker push ghcr.io/talos-user/installer:{{< release >}}-nvidia +``` + +> Note: buildkit has a bug [#816](https://github.com/moby/buildkit/issues/816), to disable it use DOCKER_BUILDKIT=0 +> Replace the platform with `linux/arm64` if building for ARM64 + +## Upgrading Talos and enabling the NVIDIA modules and the system extension + +> Make sure to use `talosctl` version {{< release >}} or later + +First create a patch yaml `gpu-worker-patch.yaml` to update the machine config similar to below: + +```yaml +- op: add + path: /machine/install/extensions + value: + - image: ghcr.io/siderolabs/nvidia-container-toolkit:{{< nvidia_driver_release >}}-{{< nvidia_container_toolkit_release >}} +- op: add + path: /machine/kernel + value: + modules: + - name: nvidia + - name: nvidia_uvm + - name: nvidia_drm + - name: nvidia_modeset +- op: add + path: /machine/sysctls + value: + net.core.bpf_jit_harden: 1 +``` + +Now apply the patch to all Talos nodes in the cluster having NVIDIA GPU's installed: + +```bash +talosctl patch mc --patch @gpu-worker-patch.yaml +``` + +Now we can proceed to upgrading Talos with the installer built previously: + +```bash +talosctl upgrade --image=ghcr.io/talos-user/installer:{{< release >}}-nvidia +``` + +Once the node reboots, the NVIDIA modules should be loaded and the system extension should be installed. + +This can be confirmed by running: + +```bash +talosctl read /proc/modules +``` + +which should produce an output similar to below: + +```text +nvidia_uvm 1146880 - - Live 0xffffffffc2733000 (PO) +nvidia_drm 69632 - - Live 0xffffffffc2721000 (PO) +nvidia_modeset 1142784 - - Live 0xffffffffc25ea000 (PO) +nvidia 39047168 - - Live 0xffffffffc00ac000 (PO) +``` + +```bash +talosctl get extensions +``` + +which should produce an output similar to below: + +```text +NODE NAMESPACE TYPE ID VERSION NAME VERSION +172.31.41.27 runtime ExtensionStatus 000.ghcr.io-frezbo-nvidia-container-toolkit-510.60.02-v1.9.0 1 nvidia-container-toolkit 510.60.02-v1.9.0 +``` + +```bash +talosctl read /proc/driver/nvidia/version +``` + +which should produce an output similar to below: + +```text +NVRM version: NVIDIA UNIX x86_64 Kernel Module 510.60.02 Wed Mar 16 11:24:05 UTC 2022 +GCC version: gcc version 11.2.0 (GCC) +``` + +## Deploying NVIDIA device plugin + +First we need to create the `RuntimeClass` + +Apply the following manifest to create a runtime class that uses the extension: + +```yaml +--- +apiVersion: node.k8s.io/v1 +kind: RuntimeClass +metadata: + name: nvidia +handler: nvidia +``` + +Install the NVIDIA device plugin: + +```bash +helm repo add nvdp https://nvidia.github.io/k8s-device-plugin +helm repo update +helm install nvidia-device-plugin nvdp/nvidia-device-plugin --version=0.11.0 --set=runtimeClassName=nvidia +``` + +Apply the following manifest to run CUDA pod via nvidia runtime: + +```bash +cat < Enabling NVIDIA GPU support on Talos is bound by [NVIDIA EULA](https://www.nvidia.com/en-us/drivers/nvidia-license/) -> Talos GPU support is an **alpha** feature. +> Enabling NVIDIA GPU support on Talos is bound by [NVIDIA EULA](https://www.nvidia.com/en-us/drivers/nvidia-license/). +> Talos GPU support has been promoted to **beta**. +> The Talos published NVIDIA OSS drivers are bound to a specific Talos release. +> The extensions versions also needs to be updated when upgrading Talos. -These are the steps to enabling NVIDIA support in Talos. +The published versions of the NVIDIA system extensions can be found here: -- Talos pre-installed on a node with NVIDIA GPU installed. -- Building a custom Talos installer image with NVIDIA modules -- Building NVIDIA container toolkit system extension which allows to register a custom runtime with containerd -- Upgrading Talos with the custom installer and enabling NVIDIA modules and the system extension - -Both these components require that the user build and maintain their own Talos installer image and the NVIDIA container toolkit [Talos System Extension]({{< relref "system-extensions" >}}). - -## Prerequisites - -This guide assumes the user has access to a container registry with `push` permissions, docker installed on the build machine and the Talos host has `pull` access to the container registry. - -Set the local registry and username environment variables: - -```bash -export USERNAME= -export REGISTRY= -``` - -For eg: - -```bash -export USERNAME=talos-user -export REGISTRY=ghcr.io -``` - -> The examples below will use the sample variables set above. -Modify accordingly for your environment. - -## Building the installer image - -Start by cloning the [pkgs](https://github.com/siderolabs/pkgs) repository. - -Now run the following command to build and push custom Talos kernel image and the NVIDIA image with the NVIDIA kernel modules signed by the kernel built along with it. - -```bash -make kernel nonfree-kmod-nvidia PLATFORM=linux/amd64 PUSH=true -``` - -> Replace the platform with `linux/arm64` if building for ARM64 - -Now we need to create a custom Talos installer image. - -Start by creating a `Dockerfile` with the following content: - -```Dockerfile -FROM scratch as customization -COPY --from=ghcr.io/talos-user/nonfree-kmod-nvidia:{{< release >}}-nvidia /lib/modules /lib/modules - -FROM ghcr.io/siderolabs/installer:{{< release >}} -COPY --from=ghcr.io/talos-user/kernel:{{< release >}}-nvidia /boot/vmlinuz /usr/install/${TARGETARCH}/vmlinuz -``` - -Now build the image and push it to the registry. - -```bash -DOCKER_BUILDKIT=0 docker build --squash --build-arg RM="/lib/modules" -t ghcr.io/talos-user/installer:{{< release >}}-nvidia . -docker push ghcr.io/talos-user/installer:{{< release >}}-nvidia -``` - -> Note: buildkit has a bug [#816](https://github.com/moby/buildkit/issues/816), to disable it use DOCKER_BUILDKIT=0 - -## Building the system extension - -Start by cloning the [extensions](https://github.com/siderolabs/extensions) repository. - -Now run the following command to build and push the system extension. - -```bash -make nvidia-container-toolkit PLATFORM=linux/amd64 PUSH=true TAG=510.60.02-v1.9.0 -``` - -> Replace the platform with `linux/arm64` if building for ARM64 +- [nvidia-open-gpu-kernel-modules](https://github.com/siderolabs/extensions/pkgs/container/nvidia-open-gpu-kernel-modules) +- [nvidia-container-toolkit](https://github.com/siderolabs/extensions/pkgs/container/nvidia-container-toolkit) ## Upgrading Talos and enabling the NVIDIA modules and the system extension @@ -93,7 +25,8 @@ First create a patch yaml `gpu-worker-patch.yaml` to update the machine config s - op: add path: /machine/install/extensions value: - - image: ghcr.io/talos-user/nvidia-container-toolkit:510.60.02-v1.9.0 + - image: ghcr.io/siderolabs/nvidia-open-gpu-kernel-modules:{{< nvidia_driver_release >}}-{{< release >}} + - image: ghcr.io/siderolabs/nvidia-container-toolkit:{{< nvidia_driver_release >}}-{{< nvidia_container_toolkit_release >}} - op: add path: /machine/kernel value: @@ -108,16 +41,20 @@ First create a patch yaml `gpu-worker-patch.yaml` to update the machine config s net.core.bpf_jit_harden: 1 ``` +> Update the driver version and Talos release in the above patch yaml from the published versions if there is a newer one available. +> Make sure the driver version matches for both the `nvidia-open-gpu-kernel-modules` and `nvidia-container-toolkit` extensions. +> The `nvidia-open-gpu-kernel-modules` extension is versioned as `-` and the `nvidia-container-toolkit` extension is versioned as `-`. + Now apply the patch to all Talos nodes in the cluster having NVIDIA GPU's installed: ```bash talosctl patch mc --patch @gpu-worker-patch.yaml ``` -Now we can proceed to upgrading Talos with the installer built previously: +Now we can proceed to upgrading Talos to the same version to enable the system extension: ```bash -talosctl upgrade --image=ghcr.io/talos-user/installer:{{< release >}}-nvidia +talosctl upgrade --image=ghcr.io/siderolabs/installer:{{< release >}} ``` Once the node reboots, the NVIDIA modules should be loaded and the system extension should be installed. @@ -144,8 +81,9 @@ talosctl get extensions which should produce an output similar to below: ```text -NODE NAMESPACE TYPE ID VERSION NAME VERSION -172.31.41.27 runtime ExtensionStatus 000.ghcr.io-frezbo-nvidia-container-toolkit-510.60.02-v1.9.0 1 nvidia-container-toolkit 510.60.02-v1.9.0 +NODE NAMESPACE TYPE ID VERSION NAME VERSION +172.31.41.27 runtime ExtensionStatus 000.ghcr.io-siderolabs-nvidia-container-toolkit-515.65.01-v1.10.0 1 nvidia-container-toolkit 515.65.01-v1.10.0 +172.31.41.27 runtime ExtensionStatus 000.ghcr.io-siderolabs-nvidia-open-gpu-kernel-modules-515.65.01-v1.2.0 1 nvidia-open-gpu-kernel-modules 515.65.01-v1.2.0 ``` ```bash @@ -155,8 +93,8 @@ talosctl read /proc/driver/nvidia/version which should produce an output similar to below: ```text -NVRM version: NVIDIA UNIX x86_64 Kernel Module 510.60.02 Wed Mar 16 11:24:05 UTC 2022 -GCC version: gcc version 11.2.0 (GCC) +NVRM version: NVIDIA UNIX x86_64 Kernel Module 515.65.01 Wed Mar 16 11:24:05 UTC 2022 +GCC version: gcc version 12.2.0 (GCC) ``` ## Deploying NVIDIA device plugin diff --git a/website/layouts/shortcodes/nvidia_container_toolkit_release.html b/website/layouts/shortcodes/nvidia_container_toolkit_release.html new file mode 100644 index 000000000..15d676f99 --- /dev/null +++ b/website/layouts/shortcodes/nvidia_container_toolkit_release.html @@ -0,0 +1 @@ +{{ .Page.FirstSection.Params.nvidiaContainerToolkitRelease -}} diff --git a/website/layouts/shortcodes/nvidia_driver_release.html b/website/layouts/shortcodes/nvidia_driver_release.html new file mode 100644 index 000000000..61c760b7a --- /dev/null +++ b/website/layouts/shortcodes/nvidia_driver_release.html @@ -0,0 +1 @@ +{{ .Page.FirstSection.Params.nvidiaDriverRelease -}}