NVIDIA GPU Operator 설치 - v23.9.1

K8s 상에 Helm 을 이용하여 NVIDIA GPU Operator 최신버전 v23.9.1 (2024년 2월 14일 기준) 을 설치 해 보겠습니다.
K8s 클러스터 구성은 아래와 같습니다.
ubuntu@master1:~$ k get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
master1 Ready control-plane 137m v1.26.5 192.168.10.102 <none> Ubuntu 22.04.3 LTS 5.15.0-94-generic containerd://1.6.4
worker1 Ready worker 136m v1.26.5 192.168.10.103 <none> Ubuntu 22.04.3 LTS 5.15.0-94-generic containerd://1.6.4
worker2 Ready worker 136m v1.26.5 192.168.10.104 <none> Ubuntu 22.04.3 LTS 5.15.0-94-generic containerd://1.6.4
GPU 확인
GPU 가 장착된 Node 를 확인합니다. 저의 경우 worker2 에 NVIDIA Quadro RTX8000 이 장착 되어 있습니다.
ubuntu@worker2:~$ lspci | grep -i nvidia
00:10.0 VGA compatible controller: NVIDIA Corporation TU102GL [Quadro RTX 6000/8000] (rev a1)
00:10.1 Audio device: NVIDIA Corporation TU102 High Definition Audio Controller (rev a1)
00:10.2 USB controller: NVIDIA Corporation TU102 USB 3.1 Host Controller (rev a1)
00:10.3 Serial bus controller: NVIDIA Corporation TU102 USB Type-C UCSI Controller (rev a1)
GPU Driver 설치
NVIDIA GPU Operator 배포전에 미리 GPU Driver 를 설치합니다.
ubuntu@worker2:~$ sudo apt install nvidia-driver-535 -y
ubuntu@worker2:~$ sudo nvidia-smi
Wed Feb 14 05:01:26 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05 Driver Version: 535.154.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Quadro RTX 8000 Off | 00000000:00:10.0 Off | 0 |
| 32% 52C P2 49W / 260W | 0MiB / 46080MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
NVIDIA GPU Operator 설치
Helm 을 이용하여 NVIDIA GPU Operator 설치합니다. 저의 경우에는 GPU Driver 는 미리 설치 하였으므로 Helm 옵션에서 제외 합니다.
ubuntu@master1:~$ kubectl taint node worker2 nvidia.com/gpu=present:NoSchedule
node/worker2 tainted
ubuntu@master1:~$ helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
"nvidia" has been added to your repositories
ubuntu@master1:~$ helm repo update nvidia
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "nvidia" chart repository
Update Complete. ⎈Happy Helming!⎈
ubuntu@master1:~$ helm upgrade gpu-operator nvidia/gpu-operator \
--install \
--namespace gpu-operator \
--create-namespace \
--version v23.9.1 \
--set driver.enabled=false \
--set toolkit.enabled=true \
--set node-feature-discovery.enableNodeFeatureApi=true
NVIDIA GPU Operator 테스트
GPU Operator 가 잘 작동하는지 테스트 Pod 를 하나 배포해 보겠습니다.
ubuntu@master1:~$ cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
restartPolicy: Never
containers:
- name: cuda-container
image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPU
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
EOF
pod/gpu-pod created
ubuntu@master1:~$ kubectl logs gpu-pod
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
위와 같이 "Test PASSED" 가 표시 된다면 정상적으로 설치된 것입니다.
감사합니다.
[베이넥스] DX총괄사업본부 솔루션사업부 - 김규현