mirror of
https://github.com/siderolabs/extensions.git
synced 2025-08-15 02:37:14 +02:00
Remove deprecated `.machine.install.extensions`, point to Talos documentation. Once Image Factory is live, we can point to it. Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
92 lines
2.0 KiB
Markdown
92 lines
2.0 KiB
Markdown
# NVIDIA Container toolkit extension
|
||
|
||
|
||
## Installation
|
||
|
||
See [Installing Extensions](https://github.com/siderolabs/extensions#installing-extensions).
|
||
|
||
|
||
## Usage
|
||
|
||
The following NVIDIA modules needs to be loaded, so add this to the talos config:
|
||
|
||
```yaml
|
||
machine:
|
||
kernel:
|
||
modules:
|
||
- name: nvidia
|
||
- name: nvidia_uvm
|
||
- name: nvidia_drm
|
||
- name: nvidia_modeset
|
||
```
|
||
|
||
`nvidia-container-cli` loads BPF programs and requires relaxed KSPP setting for [bpf_jit_harden](https://sysctl-explorer.net/net/core/bpf_jit_harden/), so Talos default setting
|
||
should be overridden:
|
||
|
||
```yaml
|
||
machine:
|
||
sysctls:
|
||
net.core.bpf_jit_harden: 1
|
||
```
|
||
|
||
> Warning! This disables [KSPP best practices](https://kernsec.org/wiki/index.php/Kernel_Self_Protection_Project/Recommended_Settings#sysctls) setting.
|
||
|
||
## Testing
|
||
|
||
Apply the following manifest to create a runtime class that uses the extension:
|
||
|
||
```yaml
|
||
---
|
||
apiVersion: node.k8s.io/v1
|
||
kind: RuntimeClass
|
||
metadata:
|
||
name: nvidia
|
||
handler: nvidia
|
||
```
|
||
|
||
Install the NVIDIA device plugin:
|
||
|
||
```bash
|
||
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
|
||
helm repo update
|
||
helm install nvidia-device-plugin nvdp/nvidia-device-plugin --version=0.14.1 --set=runtimeClassName=nvidia
|
||
```
|
||
|
||
Apply the following manifest to run CUDA pod via nvidia runtime:
|
||
|
||
```yaml
|
||
---
|
||
apiVersion: v1
|
||
kind: Pod
|
||
metadata:
|
||
name: gpu-operator-test
|
||
spec:
|
||
restartPolicy: OnFailure
|
||
runtimeClassName: nvidia
|
||
containers:
|
||
- name: cuda-vector-add
|
||
image: "nvidia/samples:vectoradd-cuda11.6.0"
|
||
resources:
|
||
limits:
|
||
nvidia.com/gpu: 1
|
||
```
|
||
|
||
|
||
The status can be viewed by running:
|
||
|
||
```bash
|
||
❯ kubectl get pods
|
||
NAME READY STATUS RESTARTS AGE
|
||
gpu-operator-test 0/1 Completed 0 13s
|
||
```
|
||
|
||
```bash
|
||
❯ kubectl logs gpu-operator-test
|
||
[Vector addition of 50000 elements]
|
||
Copy input data from the host memory to the CUDA device
|
||
CUDA kernel launch with 196 blocks of 256 threads
|
||
Copy output data from the CUDA device to the host memory
|
||
Test PASSED
|
||
Done
|
||
```
|