docs: cuda guide: include files directly and use local references
This commit is contained in:
parent
bf3f630d1b
commit
a2305bd87a
@ -3,4 +3,5 @@ mkdocs-material
|
||||
pymdown-extensions
|
||||
mkdocs-git-revision-date-localized-plugin
|
||||
mkdocs-awesome-pages-plugin
|
||||
mdx_truly_sane_lists
|
||||
mdx_truly_sane_lists
|
||||
mkdocs-include-markdown-plugin # https://github.com/mondeja/mkdocs-include-markdown-plugin
|
@ -1,135 +1,32 @@
|
||||
# Running CUDA workloads
|
||||
|
||||
If you want to run CUDA workloads on the K3S container you need to customize the container.
|
||||
If you want to run CUDA workloads on the K3s container you need to customize the container.
|
||||
CUDA workloads require the NVIDIA Container Runtime, so containerd needs to be configured to use this runtime.
|
||||
The K3S container itself also needs to run with this runtime.
|
||||
The K3s container itself also needs to run with this runtime.
|
||||
If you are using Docker you can install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html).
|
||||
|
||||
## Building a customized K3S image
|
||||
## Building a customized K3s image
|
||||
|
||||
To get the NVIDIA container runtime in the K3S image you need to build your own K3S image.
|
||||
The native K3S image is based on Alpine but the NVIDIA container runtime is not supported on Alpine yet.
|
||||
To get the NVIDIA container runtime in the K3s image you need to build your own K3s image.
|
||||
The native K3s image is based on Alpine but the NVIDIA container runtime is not supported on Alpine yet.
|
||||
To get around this we need to build the image with a supported base image.
|
||||
|
||||
### Dockerfiles:
|
||||
### Dockerfiles
|
||||
|
||||
Dockerfile.base:
|
||||
[Dockerfile.base](cuda/Dockerfile.base):
|
||||
|
||||
```Dockerfile
|
||||
FROM nvidia/cuda:11.2.0-base-ubuntu18.04
|
||||
|
||||
ENV DEBIAN_FRONTEND noninteractive
|
||||
|
||||
ARG DOCKER_VERSION
|
||||
ENV DOCKER_VERSION=$DOCKER_VERSION
|
||||
|
||||
RUN set -x && \
|
||||
apt-get update && \
|
||||
apt-get install -y \
|
||||
apt-transport-https \
|
||||
ca-certificates \
|
||||
curl \
|
||||
wget \
|
||||
tar \
|
||||
zstd \
|
||||
gnupg \
|
||||
lsb-release \
|
||||
git \
|
||||
software-properties-common \
|
||||
build-essential && \
|
||||
rm -rf /var/lib/apt/lists/*
|
||||
|
||||
RUN set -x && \
|
||||
curl -fsSL https://download.docker.com/linux/$(lsb_release -is | tr '[:upper:]' '[:lower:]')/gpg | gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg && \
|
||||
echo "deb [arch=amd64 signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/$(lsb_release -is | tr '[:upper:]' '[:lower:]') $(lsb_release -cs) stable" | tee /etc/apt/sources.list.d/docker.list > /dev/null && \
|
||||
apt-get update && \
|
||||
apt-get install -y \
|
||||
containerd.io \
|
||||
docker-ce=5:$DOCKER_VERSION~3-0~$(lsb_release -is | tr '[:upper:]' '[:lower:]')-$(lsb_release -cs) \
|
||||
docker-ce-cli=5:$DOCKER_VERSION~3-0~$(lsb_release -is | tr '[:upper:]' '[:lower:]')-$(lsb_release -cs) && \
|
||||
rm -rf /var/lib/apt/lists/*
|
||||
{% include "cuda/Dockerfile.base" %}
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
Dockerfile.k3d-gpu:
|
||||
[Dockerfile.k3d-gpu](cuda/Dockerfile.k3d-gpu):
|
||||
|
||||
```Dockerfile
|
||||
FROM nvidia/cuda:11.2.0-base-ubuntu18.04 as base
|
||||
|
||||
RUN set -x && \
|
||||
apt-get update && \
|
||||
apt-get install -y ca-certificates zstd
|
||||
|
||||
COPY k3s/build/out/data.tar.zst /
|
||||
|
||||
RUN set -x && \
|
||||
mkdir -p /image/etc/ssl/certs /image/run /image/var/run /image/tmp /image/lib/modules /image/lib/firmware && \
|
||||
tar -I zstd -xf /data.tar.zst -C /image && \
|
||||
cp /etc/ssl/certs/ca-certificates.crt /image/etc/ssl/certs/ca-certificates.crt
|
||||
|
||||
RUN set -x && \
|
||||
cd image/bin && \
|
||||
rm -f k3s && \
|
||||
ln -s k3s-server k3s
|
||||
|
||||
FROM nvidia/cuda:11.2.0-base-ubuntu18.04
|
||||
|
||||
ARG NVIDIA_CONTAINER_RUNTIME_VERSION
|
||||
ENV NVIDIA_CONTAINER_RUNTIME_VERSION=$NVIDIA_CONTAINER_RUNTIME_VERSION
|
||||
|
||||
RUN set -x && \
|
||||
echo 'debconf debconf/frontend select Noninteractive' | debconf-set-selections
|
||||
|
||||
RUN set -x && \
|
||||
apt-get update && \
|
||||
apt-get -y install gnupg2 curl
|
||||
|
||||
# Install NVIDIA Container Runtime
|
||||
RUN set -x && \
|
||||
curl -s -L https://nvidia.github.io/nvidia-container-runtime/gpgkey | apt-key add -
|
||||
|
||||
RUN set -x && \
|
||||
curl -s -L https://nvidia.github.io/nvidia-container-runtime/ubuntu18.04/nvidia-container-runtime.list | tee /etc/apt/sources.list.d/nvidia-container-runtime.list
|
||||
|
||||
RUN set -x && \
|
||||
apt-get update && \
|
||||
apt-get -y install nvidia-container-runtime=${NVIDIA_CONTAINER_RUNTIME_VERSION}
|
||||
|
||||
|
||||
COPY --from=base /image /
|
||||
|
||||
RUN set -x && \
|
||||
mkdir -p /etc && \
|
||||
echo 'hosts: files dns' > /etc/nsswitch.conf
|
||||
|
||||
RUN set -x && \
|
||||
chmod 1777 /tmp
|
||||
|
||||
# Provide custom containerd configuration to configure the nvidia-container-runtime
|
||||
RUN set -x && \
|
||||
mkdir -p /var/lib/rancher/k3s/agent/etc/containerd/
|
||||
|
||||
COPY config.toml.tmpl /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl
|
||||
|
||||
# Deploy the nvidia driver plugin on startup
|
||||
RUN set -x && \
|
||||
mkdir -p /var/lib/rancher/k3s/server/manifests
|
||||
|
||||
COPY gpu.yaml /var/lib/rancher/k3s/server/manifests/gpu.yaml
|
||||
|
||||
VOLUME /var/lib/kubelet
|
||||
VOLUME /var/lib/rancher/k3s
|
||||
VOLUME /var/lib/cni
|
||||
VOLUME /var/log
|
||||
|
||||
ENV PATH="$PATH:/bin/aux"
|
||||
|
||||
ENTRYPOINT ["/bin/k3s"]
|
||||
CMD ["agent"]
|
||||
{% include "cuda/Dockerfile.k3d-gpu" %}
|
||||
```
|
||||
|
||||
These Dockerfiles [Dockerfile.base](https://github.com/vainkop/k3d/blob/main/docs/usage/guides/cuda/Dockerfile.base) + [Dockerfile.k3d-gpu](https://github.com/vainkop/k3d/blob/main/docs/usage/guides/cuda/Dockerfile.k3d-gpu) are based on the [K3s Dockerfile](https://github.com/rancher/k3s/blob/master/package/Dockerfile)
|
||||
These Dockerfiles are based on the [K3s Dockerfile](https://github.com/rancher/k3s/blob/master/package/Dockerfile)
|
||||
The following changes are applied:
|
||||
|
||||
1. Change the base images to nvidia/cuda:11.2.0-base-ubuntu18.04 so the NVIDIA Container Runtime can be installed. The version of `cuda:xx.x.x` must match the one you're planning to use.
|
||||
@ -141,61 +38,7 @@ The following changes are applied:
|
||||
We need to configure containerd to use the NVIDIA Container Runtime. We need to customize the config.toml that is used at startup. K3s provides a way to do this using a [config.toml.tmpl](cuda/config.toml.tmpl) file. More information can be found on the [K3s site](https://rancher.com/docs/k3s/latest/en/advanced/#configuring-containerd).
|
||||
|
||||
```go
|
||||
[plugins.opt]
|
||||
path = "{{ .NodeConfig.Containerd.Opt }}"
|
||||
|
||||
[plugins.cri]
|
||||
stream_server_address = "127.0.0.1"
|
||||
stream_server_port = "10010"
|
||||
|
||||
{{- if .IsRunningInUserNS }}
|
||||
disable_cgroup = true
|
||||
disable_apparmor = true
|
||||
restrict_oom_score_adj = true
|
||||
{{end}}
|
||||
|
||||
{{- if .NodeConfig.AgentConfig.PauseImage }}
|
||||
sandbox_image = "{{ .NodeConfig.AgentConfig.PauseImage }}"
|
||||
{{end}}
|
||||
|
||||
{{- if not .NodeConfig.NoFlannel }}
|
||||
[plugins.cri.cni]
|
||||
bin_dir = "{{ .NodeConfig.AgentConfig.CNIBinDir }}"
|
||||
conf_dir = "{{ .NodeConfig.AgentConfig.CNIConfDir }}"
|
||||
{{end}}
|
||||
|
||||
[plugins.cri.containerd.runtimes.runc]
|
||||
# ---- changed from 'io.containerd.runc.v2' for GPU support
|
||||
runtime_type = "io.containerd.runtime.v1.linux"
|
||||
|
||||
# ---- added for GPU support
|
||||
[plugins.linux]
|
||||
runtime = "nvidia-container-runtime"
|
||||
|
||||
{{ if .PrivateRegistryConfig }}
|
||||
{{ if .PrivateRegistryConfig.Mirrors }}
|
||||
[plugins.cri.registry.mirrors]{{end}}
|
||||
{{range $k, $v := .PrivateRegistryConfig.Mirrors }}
|
||||
[plugins.cri.registry.mirrors."{{$k}}"]
|
||||
endpoint = [{{range $i, $j := $v.Endpoints}}{{if $i}}, {{end}}{{printf "%q" .}}{{end}}]
|
||||
{{end}}
|
||||
|
||||
{{range $k, $v := .PrivateRegistryConfig.Configs }}
|
||||
{{ if $v.Auth }}
|
||||
[plugins.cri.registry.configs."{{$k}}".auth]
|
||||
{{ if $v.Auth.Username }}username = "{{ $v.Auth.Username }}"{{end}}
|
||||
{{ if $v.Auth.Password }}password = "{{ $v.Auth.Password }}"{{end}}
|
||||
{{ if $v.Auth.Auth }}auth = "{{ $v.Auth.Auth }}"{{end}}
|
||||
{{ if $v.Auth.IdentityToken }}identitytoken = "{{ $v.Auth.IdentityToken }}"{{end}}
|
||||
{{end}}
|
||||
{{ if $v.TLS }}
|
||||
[plugins.cri.registry.configs."{{$k}}".tls]
|
||||
{{ if $v.TLS.CAFile }}ca_file = "{{ $v.TLS.CAFile }}"{{end}}
|
||||
{{ if $v.TLS.CertFile }}cert_file = "{{ $v.TLS.CertFile }}"{{end}}
|
||||
{{ if $v.TLS.KeyFile }}key_file = "{{ $v.TLS.KeyFile }}"{{end}}
|
||||
{{end}}
|
||||
{{end}}
|
||||
{{end}}
|
||||
{% include "cuda/config.toml.tmpl" %}
|
||||
```
|
||||
|
||||
### The NVIDIA device plugin
|
||||
@ -207,102 +50,34 @@ To enable NVIDIA GPU support on Kubernetes you also need to install the [NVIDIA
|
||||
* Run GPU enabled containers in your Kubernetes cluster.
|
||||
|
||||
```yaml
|
||||
apiVersion: apps/v1
|
||||
kind: DaemonSet
|
||||
metadata:
|
||||
name: nvidia-device-plugin-daemonset
|
||||
namespace: kube-system
|
||||
spec:
|
||||
selector:
|
||||
matchLabels:
|
||||
name: nvidia-device-plugin-ds
|
||||
template:
|
||||
metadata:
|
||||
# Mark this pod as a critical add-on; when enabled, the critical add-on scheduler
|
||||
# reserves resources for critical add-on pods so that they can be rescheduled after
|
||||
# a failure. This annotation works in tandem with the toleration below.
|
||||
annotations:
|
||||
scheduler.alpha.kubernetes.io/critical-pod: ""
|
||||
labels:
|
||||
name: nvidia-device-plugin-ds
|
||||
spec:
|
||||
tolerations:
|
||||
# Allow this pod to be rescheduled while the node is in "critical add-ons only" mode.
|
||||
# This, along with the annotation above marks this pod as a critical add-on.
|
||||
- key: CriticalAddonsOnly
|
||||
operator: Exists
|
||||
containers:
|
||||
- env:
|
||||
- name: DP_DISABLE_HEALTHCHECKS
|
||||
value: xids
|
||||
image: nvidia/k8s-device-plugin:1.11
|
||||
name: nvidia-device-plugin-ctr
|
||||
securityContext:
|
||||
allowPrivilegeEscalation: true
|
||||
capabilities:
|
||||
drop: ["ALL"]
|
||||
volumeMounts:
|
||||
- name: device-plugin
|
||||
mountPath: /var/lib/kubelet/device-plugins
|
||||
volumes:
|
||||
- name: device-plugin
|
||||
hostPath:
|
||||
path: /var/lib/kubelet/device-plugins
|
||||
{% include "cuda/gpu.yaml" %}
|
||||
```
|
||||
|
||||
### Build the K3S image
|
||||
### Build the K3s image
|
||||
|
||||
To build the custom image we need to build K3S because we need the generated output.
|
||||
To build the custom image we need to build K3s because we need the generated output.
|
||||
|
||||
Put the following files in a directory:
|
||||
* [Dockerfile.base](https://github.com/vainkop/k3d/blob/main/docs/usage/guides/cuda/Dockerfile.base)
|
||||
* [Dockerfile.k3d-gpu](https://github.com/vainkop/k3d/blob/main/docs/usage/guides/cuda/Dockerfile.k3d-gpu)
|
||||
|
||||
* [Dockerfile.base](cuda/Dockerfile.base)
|
||||
* [Dockerfile.k3d-gpu](cuda/Dockerfile.k3d-gpu)
|
||||
* [config.toml.tmpl](cuda/config.toml.tmpl)
|
||||
* [gpu.yaml](https://github.com/vainkop/k3d/blob/main/docs/usage/guides/cuda/gpu.yaml)
|
||||
* [build.sh](https://github.com/vainkop/k3d/blob/main/docs/usage/guides/cuda/build.sh)
|
||||
* [cuda-vector-add.yaml](https://github.com/vainkop/k3d/blob/main/docs/usage/guides/cuda/cuda-vector-add.yaml)
|
||||
* [gpu.yaml](cuda/gpu.yaml)
|
||||
* [build.sh](cuda/build.sh)
|
||||
* [cuda-vector-add.yaml](cuda/cuda-vector-add.yaml)
|
||||
|
||||
The `build.sh` script is configured using exports & defaults to `v1.21.2+k3s1`. Please set your CI_REGISTRY_IMAGE! The script performs the following steps:
|
||||
|
||||
* pulls K3S
|
||||
* builds K3S
|
||||
* pulls K3s
|
||||
* builds K3s
|
||||
* build the custom K3D Docker image
|
||||
|
||||
The resulting image is tagged as k3s-gpu:<version tag>. The version tag is the git tag but the '+' sign is replaced with a '-'.
|
||||
|
||||
[build.sh](https://github.com/vainkop/k3d/blob/main/docs/usage/guides/cuda/build.sh):
|
||||
[build.sh](cuda/build.sh):
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
|
||||
export CI_REGISTRY_IMAGE="YOUR_REGISTRY_IMAGE_URL"
|
||||
export VERSION="1.0"
|
||||
export K3S_TAG="v1.21.2+k3s1"
|
||||
export DOCKER_VERSION="20.10.7"
|
||||
export IMAGE_TAG="v1.21.2-k3s1"
|
||||
export NVIDIA_CONTAINER_RUNTIME_VERSION="3.5.0-1"
|
||||
|
||||
docker build -f Dockerfile.base --build-arg DOCKER_VERSION=$DOCKER_VERSION -t $CI_REGISTRY_IMAGE/base:$VERSION . && \
|
||||
docker push $CI_REGISTRY_IMAGE/base:$VERSION
|
||||
|
||||
rm -rf ./k3s && \
|
||||
git clone --depth 1 https://github.com/rancher/k3s.git -b "$K3S_TAG" && \
|
||||
docker run -ti -v ${PWD}/k3s:/k3s -v /var/run/docker.sock:/var/run/docker.sock $CI_REGISTRY_IMAGE/base:1.0 sh -c "cd /k3s && make" && \
|
||||
ls -al k3s/build/out/data.tar.zst
|
||||
|
||||
if [ -f k3s/build/out/data.tar.zst ]; then
|
||||
echo "File exists! Building!"
|
||||
docker build -f Dockerfile.k3d-gpu \
|
||||
--build-arg NVIDIA_CONTAINER_RUNTIME_VERSION=$NVIDIA_CONTAINER_RUNTIME_VERSION \
|
||||
-t $CI_REGISTRY_IMAGE:$IMAGE_TAG . && \
|
||||
docker push $CI_REGISTRY_IMAGE:$IMAGE_TAG
|
||||
echo "Done!"
|
||||
else
|
||||
echo "Error, file does not exist!"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
docker build -t $CI_REGISTRY_IMAGE:$IMAGE_TAG .
|
||||
{% include "cuda/build.sh" %}
|
||||
```
|
||||
|
||||
## Run and test the custom image with Docker
|
||||
@ -313,7 +88,7 @@ You can run a container based on the new image with Docker:
|
||||
docker run --name k3s-gpu -d --privileged --gpus all $CI_REGISTRY_IMAGE:$IMAGE_TAG
|
||||
```
|
||||
|
||||
Deploy a [test pod](https://github.com/vainkop/k3d/blob/main/docs/usage/guides/cuda/cuda-vector-add.yaml):
|
||||
Deploy a [test pod](cuda/cuda-vector-add.yaml):
|
||||
|
||||
```bash
|
||||
docker cp cuda-vector-add.yaml k3s-gpu:/cuda-vector-add.yaml
|
||||
@ -329,7 +104,7 @@ Tou can use the image with k3d:
|
||||
k3d cluster create local --image=$CI_REGISTRY_IMAGE:$IMAGE_TAG --gpus=1
|
||||
```
|
||||
|
||||
Deploy a [test pod](https://github.com/vainkop/k3d/blob/main/docs/usage/guides/cuda/cuda-vector-add.yaml):
|
||||
Deploy a [test pod](cuda/cuda-vector-add.yaml):
|
||||
|
||||
```bash
|
||||
kubectl apply -f cuda-vector-add.yaml
|
||||
@ -346,10 +121,11 @@ Most of the information in this article was obtained from various sources:
|
||||
|
||||
* [Add NVIDIA GPU support to k3s with containerd](https://dev.to/mweibel/add-nvidia-gpu-support-to-k3s-with-containerd-4j17)
|
||||
* [microk8s](https://github.com/ubuntu/microk8s)
|
||||
* [K3S](https://github.com/rancher/k3s)
|
||||
* [K3s](https://github.com/rancher/k3s)
|
||||
* [k3s-gpu](https://gitlab.com/vainkop1/k3s-gpu)
|
||||
|
||||
## Authors
|
||||
|
||||
- [@markrexwinkel](https://github.com/markrexwinkel)
|
||||
- [@vainkop](https://github.com/vainkop)
|
||||
* [@markrexwinkel](https://github.com/markrexwinkel)
|
||||
* [@vainkop](https://github.com/vainkop)
|
||||
* [@iwilltry42](https://github.com/iwilltry42)
|
||||
|
@ -70,6 +70,7 @@ plugins:
|
||||
- git-revision-date-localized: # https://squidfunk.github.io/mkdocs-material/plugins/revision-date/
|
||||
type: date
|
||||
- awesome-pages # https://squidfunk.github.io/mkdocs-material/plugins/awesome-pages/
|
||||
- include-markdown # https://github.com/mondeja/mkdocs-include-markdown-plugin
|
||||
|
||||
# Other Settings
|
||||
strict: true # halt processing when a warning is raised
|
Loading…
Reference in New Issue
Block a user