Skip to content

Commit

Permalink
Update Using NVIDIA GPU Resources.md and add NVIDIA GPU Operator inst…
Browse files Browse the repository at this point in the history
…all guide and update docs-repo structure
  • Loading branch information
leoho0722 committed Nov 30, 2024
1 parent 87449e8 commit 7c0a9d4
Show file tree
Hide file tree
Showing 9 changed files with 162 additions and 23 deletions.
14 changes: 9 additions & 5 deletions docs/Writerside/hi.tree
Original file line number Diff line number Diff line change
Expand Up @@ -7,17 +7,21 @@
start-page="Home.md">

<toc-element topic="Home.md">
<toc-element topic="airflow-mnist-example.md">
<toc-element topic="airflow-mnist-example-Install-on-Ubuntu.md"/>
<toc-element topic="GitHub-Repository-Example-Guide.md">
<toc-element topic="airflow-mnist-example.md">
<toc-element topic="airflow-mnist-example-Install-on-Ubuntu.md"/>
</toc-element>
<toc-element topic="mnist-openfaas-example.md">
<toc-element topic="mnist-openfaas-example-Install-Install-on-Ubuntu.md"/>
</toc-element>
</toc-element>
<toc-element topic="mnist-openfaas-example.md">
<toc-element topic="mnist-openfaas-example-Install-Install-on-Ubuntu.md"/>
<toc-element topic="NVIDIA-GPU.md">
<toc-element topic="Using-NVIDIA-GPU-Resources.md"/>
</toc-element>
<toc-element topic="kubernetes.md">
<toc-element topic="Kubeflow.md">
<toc-element topic="Install-Kubeflow-1-6-1.md"/>
</toc-element>
<toc-element topic="Using-NVIDIA-GPU-Resources-on-Kubernetes.md"/>
</toc-element>
<toc-element topic="it-ironman.md">
<toc-element topic="it16th.md">
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
8 changes: 8 additions & 0 deletions docs/Writerside/topics/GitHub-Repository-Example-Guide.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# GitHub Repository Example Guide

## Contents

* [airflow-mnist-example](airflow-mnist-example.md)
* [GitHub Repository](https://github.com/leoho0722/airflow-mnist-example)
* [mnist-openfaas-example](mnist-openfaas-example.md)
* [GitHub Repository](https://github.com/leoho0722/mnist-openfaas-example)
6 changes: 2 additions & 4 deletions docs/Writerside/topics/Home.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,7 @@

## Contents

* [airflow-mnist-example](airflow-mnist-example.md)
* [GitHub Repository](https://github.com/leoho0722/airflow-mnist-example)
* [mnist-openfaas-example](mnist-openfaas-example.md)
* [GitHub Repository](https://github.com/leoho0722/mnist-openfaas-example)
* [GitHub Repository Example Guide](GitHub-Repository-Example-Guide.md)
* [NVIDIA GPU](NVIDIA-GPU.md)
* [Kubernetes](kubernetes.md)
* [IT 邦鐵人賽](it-ironman.md)
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Using NVIDIA GPU Resources on Kubernetes
# Using NVIDIA GPU Resources

**Table of Contents**

Expand All @@ -10,12 +10,15 @@
- [Check NVIDIA CUDA](#check-nvidia-cuda)
- [Install NVIDIA cuDNN](#install-nvidia-cudnn)
- [Install DKMS](#install-dkms)
- [Install NVIDIA Container Toolkit](#install-nvidia-container-toolkit)
- [Installing with Apt](#installing-with-apt)
- [Configuration](#configuration)
- [Configuring Docker](#configuring-docker)
- [Configuring containerd (for Kubernetes)](#configuring-containerd-for-kubernetes)
- [Install Kubernetes NVIDIA Device Plugin](#install-kubernetes-nvidia-device-plugin)
- [Using NVIDIA GPU resources on Docker](#using-nvidia-gpu-resources-on-docker)
- [Install NVIDIA Container Toolkit](#install-nvidia-container-toolkit)
- [Installing with Apt](#installing-with-apt)
- [Configuration](#configuration)
- [Configuring Docker](#configuring-docker)
- [Configuring containerd (for Kubernetes)](#configuring-containerd-for-kubernetes)
- [Using NVIDIA GPU resources on Kubernetes](#using-nvidia-gpu-resources-on-kubernetes)
- [Method 1: Install NVIDIA GPU Operator](#method-1-install-nvidia-gpu-operator)
- [Method 2: Install Kubernetes NVIDIA Device Plugin](#method-2-install-kubernetes-nvidia-device-plugin)
- [Check Pod can run GPU Jobs or not](#check-pod-can-run-gpu-jobs-or-not)
- [Check node can use GPU resource or not](#check-node-can-use-gpu-resource-or-not)

Expand Down Expand Up @@ -156,11 +159,13 @@ sudo apt install -y dkms
sudo dkms install -m nvidia -v <NVIDIA Driver Version>
```

## Install NVIDIA Container Toolkit
## Using NVIDIA GPU resources on Docker

### Install NVIDIA Container Toolkit

[NVIDIA Container Toolkit Official Installation Guide](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)

### Installing with Apt
#### Installing with Apt

1. Configure the production repository

Expand All @@ -185,9 +190,9 @@ sudo apt-get install -y nvidia-container-toolkit

![截圖 2024-04-12 17.47.34.png](截圖_2024-04-12_17.47.34.png)

### Configuration
#### Configuration

#### Configuring Docker
##### Configuring Docker

```Shell
sudo nvidia-ctk runtime configure --runtime=docker
Expand Down Expand Up @@ -228,7 +233,7 @@ sudo systemctl daemon-reload
sudo systemctl restart docker
```

#### Configuring containerd (for Kubernetes)
##### Configuring containerd (for Kubernetes)

Before execute NVIDIA containerd for Kubernetes configure command, copy original containerd config.toml (in `/etc/containerd`) file to current directory first.

Expand Down Expand Up @@ -302,7 +307,90 @@ Finally, restart containerd service
sudo systemctl restart containerd
```

## Install Kubernetes NVIDIA Device Plugin
## Using NVIDIA GPU resources on Kubernetes

### Method 1: Install NVIDIA GPU Operator {collapsible="true"}

[NVIDIA GPU Operator Official Documentation](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html)

#### Prerequisites: Install Helm

```Shell
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \
&& chmod 700 get_helm.sh \
&& ./get_helm.sh
```

#### Add the NVIDIA Helm repository

```Shell
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
&& helm repo update
```

#### Install the GPU Operator

##### Option 1: Install the Operator with the default configuration {collapsible="true"}

```Shell
helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--version=v24.9.0
```

##### Option 2: Install the Operator with the specify version {collapsible="true"}

###### GPU Operator Version and dependent version list of related Components

| GPU Operator Version | CUDA Version | Driver Version | Container Toolkit Version | Device Plugin Version |
|----------------------|--------------|----------------|---------------------------|-----------------------|
| v24.9.0 | 12.6.2 | 550.127.05 | 1.17.0 | 0.17.0 |
| v24.6.2 | 12.6.1 | 550.90.07 | 1.16.2 | 0.16.2 |
| v24.6.1 | 12.5.1 | 550.90.07 | 1.16.1 | 0.16.2 |
| v24.6.0 | 12.5.1 | 550.90.07 | 1.16.1 | 0.16.1 |
| v24.3.0 | 12.4.1 | 550.54.15 | 1.15.0 | 0.15.0 |
| v23.9.2 | 12.3.2 | 550.54.14 | 1.14.6 | 0.14.5 |
| v23.9.1 | 12.3.1 | 535.129.03 | 1.14.3 | 0.14.3 |
| v23.9.0 | 12.2.2 | 535.104.12 | 1.14.3 | 0.14.2 |
| v23.6.2 | 12.3.1 | 535.104.05 | 1.13.4 | 0.14.1 |
| v23.6.1 | 12.2.0 | 535.104.05 | 1.13.4 | 0.14.1 |
| v23.6.0 | 12.2.0 | 535.86.10 | 1.13.4 | 0.14.1 |
| v23.3.2 | 12.1.1 | 525.105.17 | 1.13.0 | 0.14.0 |

```Shell
export GPU_OPERATOR_VERSION=v24.9.0
```

```Shell
helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--version=$GPU_OPERATOR_VERSION
```

##### Option 3: Pre-Installed NVIDIA GPU Drivers {collapsible="true"}

```Shell
helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--version=v24.9.0 \
--set driver.enabled=false
```

##### Option 4: Pre-Installed NVIDIA GPU Drivers and NVIDIA Container Toolkit {collapsible="true"}

```Shell
helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--version=v24.9.0 \
--set driver.enabled=false \
--set toolkit.enabled=false
```

### Method 2: Install Kubernetes NVIDIA Device Plugin {collapsible="true"}

[Kubernetes NVIDIA Device Plugin Official GitHub Repo](https://github.com/NVIDIA/k8s-device-plugin)

Expand All @@ -314,6 +402,24 @@ kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.

### Check Pod can run GPU Jobs or not

#### Using NVIDIA GPU Operator {collapsible="true"}

Get the Pods of gpu-operator namespace in all Worker nodes

```Shell
kubectl get pod -n gpu-operator -o wide
```

![Get the Pods of gpu-operator namespace in all Worker nodes](check_pod_on_gpu_operator_namespace_using_gpu_operator.png)

View the log output of Pod nvidia-cuda-validator deployed in all Worker nodes

![View the log output of Pod nvidia-cuda-validator deployed in all Worker nodes](view_pod_log_output_check_can_use_gpu_resources_using_gpu_operator.png)

Outputting ```cuda workload validation is successful``` means that GPU resources are successfully used in the Pod.

#### Using Kubernetes NVIDIA Device Plugin {collapsible="true"}

```Shell
cat <<EOF | kubectl apply -f -
apiVersion: v1
Expand Down Expand Up @@ -345,6 +451,24 @@ Outputting ```Test PASSED``` means that GPU resources are successfully used in t

### Check node can use GPU resource or not

#### Using NVIDIA GPU Operator {collapsible="true" id="using-nvidia-gpu-operator_1"}

```Shell
kubectl get nodes -o wide

kubectl describe node <node name> | grep nvidia.com

# Example
kubectl describe node ubuntu-d830mt | grep nvidia.com
kubectl describe node ubuntu-ms-7d98 | grep nvidia.com
```

Check if the node is labeled with the following labels

![Check if the node is labeled with the following labels](check_node_is_labeled_following_labels_using_gpu_operator.png)

#### Using Kubernetes NVIDIA Device Plugin {collapsible="true" id="using-kubernetes-nvidia-device-plugin_1"}

Check whether ```Capacity``` and ```Allocatable``` are displayed ```nvidia.com/gpu```

```Shell
Expand Down
5 changes: 5 additions & 0 deletions docs/Writerside/topics/NVIDIA-GPU.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# NVIDIA GPU

## Contents

* [Using NVIDIA GPU Resources](Using-NVIDIA-GPU-Resources.md)
2 changes: 1 addition & 1 deletion docs/Writerside/topics/kubernetes.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,5 @@

## Contents

* [Using NVIDIA GPU Resource on Kubernetes](Using-NVIDIA-GPU-Resources-on-Kubernetes.md)
* [Using NVIDIA GPU Resource on Kubernetes](Using-NVIDIA-GPU-Resources.md)
* [Kubeflow](Kubeflow.md)

0 comments on commit 7c0a9d4

Please sign in to comment.