Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU role (first part) #670

Merged
merged 34 commits into from
Jan 4, 2023
Merged
Show file tree
Hide file tree
Changes from 27 commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
e48a3fd
Nibbler: added 8xGPU vars
scimerman Oct 17, 2022
ab03944
Added GPU role
scimerman Oct 17, 2022
e2258fc
GPU: added to single goups cluster_part1
scimerman Oct 17, 2022
950d0a1
GPU: update
scimerman Oct 20, 2022
c8881e8
Merge branch 'develop' of https://github.com/rug-cit-hpc/league-of-ro…
scimerman Nov 4, 2022
9d8c7af
GPU update
scimerman Nov 4, 2022
f2ff955
GPU: update
scimerman Nov 4, 2022
bc06f59
GPU: limit hosts
scimerman Nov 4, 2022
7b92958
GPU: fixes
scimerman Nov 14, 2022
f388a18
GPU role update
scimerman Nov 18, 2022
33ad9c5
GPU role: removed unneeded files and functions
scimerman Nov 18, 2022
487f753
Updated README.md for Python dependency issue on macOS. Disabled use_…
pneerincx Nov 7, 2022
b58b38c
Patch original docker.service file for systemd as opposed to the syml…
pneerincx Nov 7, 2022
7924110
Fixed complaint from yaml linter.
pneerincx Nov 7, 2022
621389f
GPU role: removed uneeded commands
scimerman Nov 18, 2022
b2480e7
GPU update
scimerman Nov 18, 2022
0f1f46f
GPU: reinstated nvidia persistenced
scimerman Nov 21, 2022
8da330a
merge fix
scimerman Nov 21, 2022
0b19414
GPU: update
scimerman Nov 21, 2022
ba0290e
GPU: update
scimerman Nov 21, 2022
ea595cb
GPU: ansible-lint fix
scimerman Nov 21, 2022
6fd1bc6
GPU: readme
scimerman Nov 21, 2022
e8a4357
GPU: added nvidia service
scimerman Nov 22, 2022
a3efc33
Merge branch 'develop' of https://github.com/rug-cit-hpc/league-of-ro…
scimerman Nov 24, 2022
544d65e
GPU
scimerman Nov 24, 2022
032a596
GPU: pr update
scimerman Nov 24, 2022
d36765e
GPU: refractured
scimerman Nov 24, 2022
3542c16
GPU: added user services and removed gpu
scimerman Nov 25, 2022
e670846
GPU: services renamed to configuration
scimerman Nov 25, 2022
08b539b
GPU: updated readme
scimerman Nov 25, 2022
0beed5c
GPU: removed stale file
scimerman Nov 25, 2022
0b07cf1
gpu: added node 05
scimerman Dec 8, 2022
a3e2e89
GPU: gpu node 05, added ip addresses
scimerman Dec 8, 2022
5543bd8
Merge branch 'develop' into gpu
pneerincx Jan 4, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions roles/cluster/defaults/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ cluster_common_packages:
- ncurses-static
- net-tools
- openssl
- pciutils
- openssl11 # Required for openldap-ltb RPMs.
- qt5-qtbase
- qt5-qtxmlpatterns
Expand All @@ -37,4 +38,5 @@ cluster_common_packages:
- urw-base35-fonts
- vim
- wget
- yum-utils
...
67 changes: 67 additions & 0 deletions roles/gpu/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# NVidia GPU installation role for Centos 7

This role follows the latest instructions of the newest version of available
drivers, avaiable at [NVIDIA CUDA Installation Guide for
Linux](https://docs.nvidia.com/cuda/pdf/CUDA_Installation_Guide_Linux.pdf).

The driver can be installed via yum repository, but the version limiting and
driver version control is quite hard to implement. Therefore the driver is
installed by downloading and running the cuda .run file.

The driver features Dynamic Kernel Module Support (DKMS) and will be recompiled
automatically when a new kernel is installed.


## Role outline

- it expects `gpu_count` variable to be defined per invididual machine, and then
- it attempts to gather the GPU device status by running `nvidia-smi` command
- it detects the NVidia driver version
- executes the GPU driver installation tasks
- checks if machine needs to be rebooted and reboots it, if needed
- yum install on machine packages that is needed for driver install and compile
- yum also installs a (after a reboot - is correctly matching) version of kernel
- downloads the cuda .run driver file from nvidia website (version defined in defualts)
- installs and compile the Dynamic Kernel Module Support driver
- services tasks are deployed on all machines with `gpu_count` defined
- creates a local nvidia (defaults GID 601) group
- creates a local nvidia (defaults UID 601) user
- blacklists nouveau
- installs `nvidia-persistenced.service` file, that will be executed as nvidia user
- reboots the machine
- checks if number of GPU devices reported from `nvidia-smi` is same as in `gpu_count`

## Solved issues - described

`gpu_count` is needed to install the driver, since any other `automatic` detection is
failing sooner or later. To list few:

- `lspci` found one nvidia device when there were 8,
- `nvidia-smi` reported no device found, when it actually should found some,
- and `nvidia-smi` had up-and-running 3 GPU's when it should be 8

This was just while testing, but I can expect more.

`gpu_count` instead defines the correct "truth", and can test aginst it - that is
if all the GPUs are actually working correctly.

Persistenced service script was modified based on trial and error, but is taken
mostly from the example files that come with the driver installation, and can be
found in the folder

/usr/share/doc/NVIDIA_GLX-1.0/samples/nvidia-persistenced-init.tar.bz2

## Other comments

- The smaller Nvidia .run driver installation file is also avaialable, but then
number of commands and options are missing on system (for example `nvidia-smi`)
- The long term availablitiy of .run file on nvidia website is not of concern as
the cuda archive website is in 2022 still containing the old versions from 2007
- driver installation is possible via yum repository, but it is harder to implement
for two reasons:
- the version needs to be limitied for nvidia-driver rpm and 15 (!) other packages
- it seems that not all old versions are available on repository, only 'recent' ones
- nvidia advises against using the `persistenced mode` as it is slowly deprecated and
instead reccomends the use of `persistenced daemon`

[cuda archive website](https://developer.nvidia.com/cuda-toolkit-archive)
11 changes: 11 additions & 0 deletions roles/gpu/defaults/main.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
---
gpu_cuda_version: '11.7.1'
gpu_driver_version: '515.65.01'
gpu_url_dir: 'https://developer.download.nvidia.com/compute/cuda/{{ gpu_cuda_version }}/local_installers/'
gpu_runfile: 'cuda_{{ gpu_cuda_version }}_{{ gpu_driver_version }}_linux.run'

nvidia_user: nvidia
nvidia_uid: 601 # a regular user with UID >500 and <1000, but no login
nvidia_group: nvidia
nvidia_gid: 601
...
2 changes: 2 additions & 0 deletions roles/gpu/files/blacklist-nouveau.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
blacklist nouveau
options nouveau modeset=0
10 changes: 10 additions & 0 deletions roles/gpu/files/nvidia-persistenced.service
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
[Unit]
Description=Initialize GPU at the startup of the system

[Service]
ExecStart=/usr/bin/nvidia-persistenced --verbose
RestartSec=15
Restart=always

[Install]
WantedBy=multi-user.target
16 changes: 16 additions & 0 deletions roles/gpu/handlers/main.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
---
- name: Enable / restart nvidia-persistenced service
ansible.builtin.systemd:
name: nvidia-persistenced.service
state: restarted
enabled: true
daemon_reload: true
become: true
listen: 'nvidia_service'

- name: Restart server
ansible.builtin.reboot:
msg: "Reboot initiated by Ansible"
listen: 'reboot_server'
become: true
...
56 changes: 56 additions & 0 deletions roles/gpu/tasks/driver.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
---
- name: Check if system needs to be restarted
ansible.builtin.command: '/bin/needs-restarting -r'
register: needs_restarting
failed_when: 'needs_restarting.rc > 1'
changed_when: 'needs_restarting.rc == 1'
become: true
notify: reboot_server

- name: Reboot system if needed
ansible.builtin.meta: flush_handlers

- name: Gather facts to get the latest kernel version
ansible.builtin.setup:
become: true

- name: Install yum requirements for gpu driver installation
ansible.builtin.yum:
state: 'installed'
update_cache: true
name:
- 'kernel-devel-{{ ansible_kernel }}'
- tar
- bzip2
- make
- automake
- gcc
- gcc-c++
- pciutils
- elfutils-libelf-devel
- libglvnd-devel
- bind-utils
- wget
become: true

- name: Download a driver installation file from NVidia
ansible.builtin.get_url:
url: '{{ gpu_url_dir }}/{{ gpu_runfile }}'
dest: '/root/{{ gpu_runfile }}'
mode: '0700'
become: true

- name: Install driver from .run file
ansible.builtin.command: '/root/{{ gpu_runfile }} --silent --driver'
register: install_result
failed_when: install_result.rc != 0
when: true
become: true

- name: Remove installation file
ansible.builtin.file:
path: '/root/{{ gpu_runfile }}'
state: absent
become: true

...
99 changes: 99 additions & 0 deletions roles/gpu/tasks/gpu.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
---
- name: Install yum requirements for gpu driver installation
ansible.builtin.yum:
state: 'installed'
update_cache: true
name:
- tar
- bzip2
- make
- automake
- gcc
- gcc-c++
- pciutils
- elfutils-libelf-devel
- libglvnd-devel
- bind-utils
- wget
become: true

scimerman marked this conversation as resolved.
Show resolved Hide resolved
- name: Gather facts to get the latest kernel version
ansible.builtin.setup:
become: true

- name: Install kernel developement package matching running kernel version
ansible.builtin.yum:
name: 'kernel-devel-{{ ansible_kernel }}'
register: yum_result
failed_when: yum_result.rc != 0
when: true
scimerman marked this conversation as resolved.
Show resolved Hide resolved
become: true

- name: Download a driver installation file from NVidia
ansible.builtin.get_url:
url: '{{ gpu_url_dir }}/{{ gpu_runfile }}'
dest: '/root/{{ gpu_runfile }}'
mode: '0700'
become: true

scimerman marked this conversation as resolved.
Show resolved Hide resolved
- name: Install driver from .run file
scimerman marked this conversation as resolved.
Show resolved Hide resolved
ansible.builtin.command: '/root/{{ gpu_runfile }} --silent --driver'
register: install_result
failed_when: install_result.rc != 0
when: true
scimerman marked this conversation as resolved.
Show resolved Hide resolved
become: true

- name: Remove installation file
scimerman marked this conversation as resolved.
Show resolved Hide resolved
ansible.builtin.file:
path: '/root/{{ gpu_runfile }}'
state: absent
become: true

- name: 'Add nvidia group.'
ansible.builtin.group:
name: '{{ nvidia_group }}'
gid: '{{ nvidia_gid }}'
become: true

- name: 'Add nvidia user.'
ansible.builtin.user:
name: '{{ nvidia_user }}'
uid: '{{ nvidia_uid }}'
group: '{{ nvidia_group }}'
system: true
shell: /sbin/nologin
create_home: false
become: true

- name: Install NVidia persistence service
ansible.builtin.template:
src: nvidia-persistenced.service
dest: /etc/systemd/system/nvidia-persistenced.service
owner: root
group: root
mode: '0644'
become: true
notify: 'nvidia_service'

- name: Copy blacklist-nouveau.conf file into modprobe.d to disable Nouveau drivers
ansible.builtin.copy:
src: blacklist-nouveau.conf
dest: /etc/modprobe.d/blacklist-nouveau.conf
owner: root
group: root
mode: '0644'
become: true
notify: 'reboot_server'

- name: Enforce reboot, so that we can check if drivers are correctly installed
ansible.builtin.meta: flush_handlers

- name: Final check to confirm all devices are working
ansible.builtin.command: 'nvidia-smi -L'
register: smi
when: true
changed_when: false
failed_when: ( smi.rc != 0) or
( smi.stdout|default([])|lower|regex_findall('nvidia')|length != gpu_count )
become: false # running nvidia-smi as root stops the service
...
25 changes: 25 additions & 0 deletions roles/gpu/tasks/main.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
---
- name: Check how many NVidia devices is up and running (might take some time)
scimerman marked this conversation as resolved.
Show resolved Hide resolved
ansible.builtin.command: 'nvidia-smi -L'
register: smi
when: gpu_count|default(0) >= 1
changed_when: false
failed_when: false

- name: Check driver version
ansible.builtin.command: '/usr/sbin/modinfo nvidia'
register: modinfo
changed_when: false
failed_when: false
when: gpu_count|default(0) >= 1

- name: Install GPU driver if not all GPU devices are present and working
ansible.builtin.include_tasks: driver.yml
when: gpu_count|default(0) >= 1 and
(( smi.stdout|default([])|lower|regex_findall('nvidia')|length != gpu_count ) or
gpu_driver_version not in modinfo.stdout|default("")|regex_search("version:.*"))

- name: Configure user and services
ansible.builtin.include_tasks: user_services.yml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

user_services.yml does not exist in this PR and gpu.yml does exist but is no longer used anywhere as far as I can tell. Looks like one or more commits are missing...

when: gpu_count|default(0) >= 1
...
14 changes: 14 additions & 0 deletions roles/gpu/templates/nvidia-persistenced.service
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
[Unit]
Description=Initialize GPU at the startup of the system
Before=slurmd.service

[Service]
ExecStart=/usr/bin/nvidia-persistenced --verbose --user {{ nvidia_user }}
ExecStopPost=/bin/rm -rf /var/run/nvidia-persistenced
Type=forking
PIDFile=/var/run/nvidia-persistenced/nvidia-persistenced.pid
RestartSec=15
Restart=always

[Install]
WantedBy=multi-user.target
1 change: 1 addition & 0 deletions single_group_playbooks/cluster_part1.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
- figlet_motd
- node_exporter
- cluster
- gpu # needs to run after role 'cluster'
- resolver
- coredumps
...
7 changes: 7 additions & 0 deletions single_role_playbooks/gpu.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
---
- name: GPU installation role
hosts:
- compute_vm
roles:
- gpu
...
7 changes: 4 additions & 3 deletions static_inventories/nibbler_cluster.yml
Original file line number Diff line number Diff line change
Expand Up @@ -90,8 +90,8 @@ all:
deploy_admin_interface:
hosts:
nb-dai:
cloud_flavor: m1.small
local_volume_size_extra: 200
cloud_flavor: m1.large
local_volume_size_extra: 3000
user_interface:
hosts:
nibbler:
Expand Down Expand Up @@ -127,7 +127,8 @@ all:
hosts:
nb-vcompute04:
vars:
cloud_flavor: gpu.A40
cloud_flavor: gpu.A40_8
gpu_count: 8
local_volume_size_extra: 1
slurm_sockets: 32
slurm_cores_per_socket: 1
Expand Down