Skip to main content

kubernetes 1.31 DRA Nvidia GPU测试

· 4 min read
Softwore Developer

我们通过使用 Nvidia 来验证 DRA 相关的能力。

原文:(kubernetes 1.31 DRA Nvidia GPU测试)[https://mp.weixin.qq.com/s/PZ6FYLlr9eveqL0M0q4rmg]

操作系统:ubuntu 22.04

runtime:docker

安装 docker

  1. 使用便捷脚本来安装 docker
 $ curl -fsSL https://get.docker.com -o get-docker.sh
$ sudo sh get-docker.sh

安装 kind

# For AMD64 / x86_64
[ $(uname -m) = x86_64 ] && curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.25.0/kind-linux-amd64
chmod +x ./kind
sudo mv ./kind /usr/local/bin/kind

通过 apt 安装 NVIDIA Container Toolkit

  1. 配置 repository
$ curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
  1. 更新 package
$ sudo apt-get update
  1. 安装 NVIDIA Container Toolkit
$ sudo apt-get install -y nvidia-container-toolkit

配置docker 运行时

  1. 将 NVIDIA 容器运行时配置为默认Docker 运行时.
$ sudo nvidia-ctk runtime configure --runtime=docker --set-as-default
  1. 重新启动 Docker 以应用更改
$ sudo systemctl restart docker
  1. 配置 NVIDIA 容器运行时使用卷挂载来选择要注入容器的设备.
# /etc/nvidia-container-runtime/config.toml
sudo nvidia-ctk config --in-place --set accept-nvidia-visible-devices-as-volume-mounts=true

安装 k8s 集群和 Nvidia DRA

  1. 克隆此存储库并安装集群
$ git clone https://github.com/NVIDIA/k8s-dra-driver.git
$ cd k8s-dra-driver
  1. 使用kind创建一个集群来运行演示
$ ./demo/clusters/kind/create-cluster.sh
  1. 安装 kubectl 命令和 helm
# 安装kubectl
$ curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
# 安装helm
$ curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
$ chmod 700 get_helm.sh
$ ./get_helm.sh

  1. 构建 nvidia dra 图像
# 如果上面的命令执行错误,就先执行这个命令
$ make build-image

$ ./demo/clusters/kind/build-dra-driver.sh
# 把构建出来的镜像load到kind中
$ k8s-dra-driver-cluster-image nvcr.io/nvidia/cloud-native/k8s-dra-driver:v0.1.0-ubuntu20.04 --name k8s-dra-driver-cluster
  1. 安装 DRA 到集群中
$ ./demo/clusters/kind/install-dra-driver.sh

安装成功之后,应该显示nvidia-dra-driver命名空间中运行的两个 pod:

$ kubectl get pods -n nvidia-dra-driver
NAME READY STATUS RESTARTS AGE
nvidia-k8s-dra-driver-kubelet-plugin-t5qgz 1/1 Running 0 44s

运行Demo

Case1: 同一个Pod的两个容器共享一个设备

$ kubectl apply --filename=demo/specs/quickstart/gpu-test2.yaml

Case2: 创建两个Pod共用同一张卡

$ kubectl apply --filename=demo/specs/quickstart/gpu-test3.yaml

Case3: 两个Pod共享同一类型的卡,指定型号

---
apiVersion: v1
kind: Namespace
metadata:
name: gpu-test3

---
apiVersion: resource.k8s.io/v1alpha3
kind: ResourceClaim
metadata:
namespace: gpu-test3
name: single-gpu
spec:
devices:
requests:
- name: gpu
deviceClassName: gpu.nvidia.com
selectors:
- cel:
expression: |
device.attributes['gpu.nvidia.com'].productName=='Tesla T4'

---
apiVersion: v1
kind: Pod
metadata:
namespace: gpu-test3
name: pod1
labels:
app: pod
spec:
containers:
- name: ctr
image: ubuntu:22.04
command: ["bash", "-c"]
args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"]
resources:
claims:
- name: shared-gpu
resourceClaims:
- name: shared-gpu
resourceClaimName: single-gpu

---
apiVersion: v1
kind: Pod
metadata:
namespace: gpu-test3
name: pod2
labels:
app: pod
spec:
containers:
- name: ctr
image: ubuntu:22.04
command: ["bash", "-c"]
args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"]
resources:
claims:
- name: shared-gpu
resourceClaims:
- name: shared-gpu
resourceClaimName: single-gpu