Files

Ignoring revisions in .git-blame-ignore-revs. Click here to bypass and see the normal blame view.

257 lines
6.4 KiB
Markdown
Raw Permalink Normal View History

2025-10-09 14:15:47 +02:00
# Nvidia GPU Support
> Note: this article assumes `services.k3s.enable = true;` is already set
## Enable the Nvidia driver
```
hardware.nvidia = {
open = true;
package = config.boot.kernelPackages.nvidiaPackages.stable; # change to match your kernel
nvidiaSettings = true;
};
# Hack for getting the nvidia driver recognized
services.xserver = {
enable = false;
videoDrivers = [ "nvidia" ];
};
nixpkgs.config.allowUnfreePredicate = pkg: builtins.elem (lib.getName pkg) [
"nvidia-x11"
"nvidia-settings"
];
```
Also, enable the Nvidia container toolkit:
```
hardware.nvidia-container-toolkit.enable = true;
hardware.nvidia-container-toolkit.mount-nvidia-executables = true;
environment.systemPackages = with pkgs; [
nvidia-container-toolkit
];
```
Rebuild your NixOS configuration.
### Verify that the GPU is accessible
Use the following command to ensure the GPU is accessible:
```
nvidia-smi
```
If there is an error in the output, a reboot may be required for the driver to be assigned to the GPU.
Additionally, `lspci -k` can be used to ensure the driver has been assigned to the GPU:
```
# lspci -k | grep -i nvidia
01:00.0 VGA compatible controller: NVIDIA Corporation TU106 [GeForce RTX 2060 Rev. A] (rev a1)
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
```
## Configure k3s
You now need to create a new file in `/var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl` with the following
```
{{ template "base" . }}
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
privileged_without_host_devices = false
runtime_engine = ""
runtime_root = ""
runtime_type = "io.containerd.runc.v2"
```
Now apply the following runtime class to k3s cluster:
```yaml
apiVersion: node.k8s.io/v1
handler: nvidia
kind: RuntimeClass
metadata:
labels:
app.kubernetes.io/component: gpu-operator
name: nvidia
```
Restart k3s:
```
systemctl restart k3s.service
```
Ensure that the Nvidia runtime is detected by k3s:
```
grep nvidia /var/lib/rancher/k3s/agent/etc/containerd/config.toml
```
Apply the DaemonSet in the [generic-cdi-plugin README](https://github.com/OlfillasOdikno/generic-cdi-plugin):
```
apiVersion: v1
kind: Namespace
metadata:
name: generic-cdi-plugin
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: generic-cdi-plugin-daemonset
namespace: generic-cdi-plugin
spec:
selector:
matchLabels:
name: generic-cdi-plugin
template:
metadata:
labels:
name: generic-cdi-plugin
app.kubernetes.io/component: generic-cdi-plugin
app.kubernetes.io/name: generic-cdi-plugin
spec:
containers:
- image: ghcr.io/olfillasodikno/generic-cdi-plugin:main
name: generic-cdi-plugin
command:
- /generic-cdi-plugin
- /var/run/cdi/nvidia-container-toolkit.json
imagePullPolicy: Always
securityContext:
privileged: true
tty: true
volumeMounts:
- name: kubelet
mountPath: /var/lib/kubelet
- name: nvidia-container-toolkit
mountPath: /var/run/cdi/nvidia-container-toolkit.json
volumes:
- name: kubelet
hostPath:
path: /var/lib/kubelet
- name: nvidia-container-toolkit
hostPath:
path: /var/run/cdi/nvidia-container-toolkit.json
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "nixos-nvidia-cdi"
operator: In
values:
- "enabled"
```
Apply the following node label (replace `#CHANGEME` with your node name):
```
kind: Node
apiVersion: v1
metadata:
name: #CHANGEME
labels:
nixos-nvidia-cdi: enabled
```
Now, GPU-enabled pods can be run with this configuration:
```
spec:
runtimeClassName: nvidia
containers:
resources:
requests:
nvidia.com/gpu-all: "1"
limits:
nvidia.com/gpu-all: "1"
```
### Test pod
This is a complete pod configuration for reference/testing:
```
---
apiVersion: v1
kind: Pod
metadata:
name: gpu-test
namespace: default
spec:
runtimeClassName: nvidia # <- THIS FOR GPU
containers:
- name: gpu-test
image: nvidia/cuda:12.6.3-base-ubuntu22.04
command: [ "/bin/bash", "-c", "--" ]
args: [ "while true; do sleep 30; done;" ]
env:
- name: NVIDIA_VISIBLE_DEVICES
value: all
- name: NVIDIA_DRIVER_CAPABILITIES
value: all
resources: # <- THIS FOR GPU
requests:
nvidia.com/gpu-all: "1"
limits:
nvidia.com/gpu-all: "1"
```
Once the pod is running, use the following command to test that the GPU was detected:
```
kubectl exec -n default -it pod/gpu-test -- nvidia-smi
```
If successful, the output will look like the following:
```
Thu Sep 25 04:17:42 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.82.09 Driver Version: 580.82.09 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 2060 Off | 00000000:01:00.0 On | N/A |
| 0% 36C P8 10W / 190W | 104MiB / 6144MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
```