NVIDIA GPU Operator 설치 - v23.9.1

GPU OPERATOR

K8s 상에 Helm 을 이용하여 NVIDIA GPU Operator 최신버전 v23.9.1 (2024년 2월 14일 기준) 을 설치 해 보겠습니다.
K8s 클러스터 구성은 아래와 같습니다.

ubuntu@master1:~$ k get nodes -o wide
NAME      STATUS   ROLES           AGE    VERSION   INTERNAL-IP      EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
master1   Ready    control-plane   137m   v1.26.5   192.168.10.102   <none>        Ubuntu 22.04.3 LTS   5.15.0-94-generic   containerd://1.6.4
worker1   Ready    worker          136m   v1.26.5   192.168.10.103   <none>        Ubuntu 22.04.3 LTS   5.15.0-94-generic   containerd://1.6.4
worker2   Ready    worker          136m   v1.26.5   192.168.10.104   <none>        Ubuntu 22.04.3 LTS   5.15.0-94-generic   containerd://1.6.4

GPU 확인

GPU 가 장착된 Node 를 확인합니다. 저의 경우 worker2 에 NVIDIA Quadro RTX8000 이 장착 되어 있습니다.

ubuntu@worker2:~$ lspci | grep -i nvidia
00:10.0 VGA compatible controller: NVIDIA Corporation TU102GL [Quadro RTX 6000/8000] (rev a1)
00:10.1 Audio device: NVIDIA Corporation TU102 High Definition Audio Controller (rev a1)
00:10.2 USB controller: NVIDIA Corporation TU102 USB 3.1 Host Controller (rev a1)
00:10.3 Serial bus controller: NVIDIA Corporation TU102 USB Type-C UCSI Controller (rev a1)

GPU Driver 설치

NVIDIA GPU Operator 배포전에 미리 GPU Driver 를 설치합니다.

ubuntu@worker2:~$ sudo apt install nvidia-driver-535 -y
ubuntu@worker2:~$ sudo nvidia-smi
Wed Feb 14 05:01:26 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Quadro RTX 8000                Off | 00000000:00:10.0 Off |                    0 |
| 32%   52C    P2              49W / 260W |      0MiB / 46080MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

NVIDIA GPU Operator 설치

Helm 을 이용하여 NVIDIA GPU Operator 설치합니다. 저의 경우에는 GPU Driver 는 미리 설치 하였으므로 Helm 옵션에서 제외 합니다.

ubuntu@master1:~$ kubectl taint node worker2 nvidia.com/gpu=present:NoSchedule
node/worker2 tainted
ubuntu@master1:~$ helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
"nvidia" has been added to your repositories
ubuntu@master1:~$ helm repo update nvidia
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "nvidia" chart repository
Update Complete. ⎈Happy Helming!⎈
ubuntu@master1:~$ helm upgrade gpu-operator nvidia/gpu-operator \
  --install \
  --namespace gpu-operator \
  --create-namespace \
  --version v23.9.1 \
  --set driver.enabled=false \
  --set toolkit.enabled=true \
  --set node-feature-discovery.enableNodeFeatureApi=true

NVIDIA GPU Operator 테스트

GPU Operator 가 잘 작동하는지 테스트 Pod 를 하나 배포해 보겠습니다.

ubuntu@master1:~$ cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  restartPolicy: Never
  containers:
    - name: cuda-container
      image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
      resources:
        limits:
          nvidia.com/gpu: 1 # requesting 1 GPU
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
EOF
pod/gpu-pod created
ubuntu@master1:~$ kubectl logs gpu-pod
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

위와 같이 "Test PASSED" 가 표시 된다면 정상적으로 설치된 것입니다.
감사합니다.

[베이넥스] DX총괄사업본부 솔루션사업부 - 김규현