Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU role (first part) #670

Merged
merged 34 commits into from
Jan 4, 2023
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
e48a3fd
Nibbler: added 8xGPU vars
scimerman Oct 17, 2022
ab03944
Added GPU role
scimerman Oct 17, 2022
e2258fc
GPU: added to single goups cluster_part1
scimerman Oct 17, 2022
950d0a1
GPU: update
scimerman Oct 20, 2022
c8881e8
Merge branch 'develop' of https://github.com/rug-cit-hpc/league-of-ro…
scimerman Nov 4, 2022
9d8c7af
GPU update
scimerman Nov 4, 2022
f2ff955
GPU: update
scimerman Nov 4, 2022
bc06f59
GPU: limit hosts
scimerman Nov 4, 2022
7b92958
GPU: fixes
scimerman Nov 14, 2022
f388a18
GPU role update
scimerman Nov 18, 2022
33ad9c5
GPU role: removed unneeded files and functions
scimerman Nov 18, 2022
487f753
Updated README.md for Python dependency issue on macOS. Disabled use_…
pneerincx Nov 7, 2022
b58b38c
Patch original docker.service file for systemd as opposed to the syml…
pneerincx Nov 7, 2022
7924110
Fixed complaint from yaml linter.
pneerincx Nov 7, 2022
621389f
GPU role: removed uneeded commands
scimerman Nov 18, 2022
b2480e7
GPU update
scimerman Nov 18, 2022
0f1f46f
GPU: reinstated nvidia persistenced
scimerman Nov 21, 2022
8da330a
merge fix
scimerman Nov 21, 2022
0b19414
GPU: update
scimerman Nov 21, 2022
ba0290e
GPU: update
scimerman Nov 21, 2022
ea595cb
GPU: ansible-lint fix
scimerman Nov 21, 2022
6fd1bc6
GPU: readme
scimerman Nov 21, 2022
e8a4357
GPU: added nvidia service
scimerman Nov 22, 2022
a3efc33
Merge branch 'develop' of https://github.com/rug-cit-hpc/league-of-ro…
scimerman Nov 24, 2022
544d65e
GPU
scimerman Nov 24, 2022
032a596
GPU: pr update
scimerman Nov 24, 2022
d36765e
GPU: refractured
scimerman Nov 24, 2022
3542c16
GPU: added user services and removed gpu
scimerman Nov 25, 2022
e670846
GPU: services renamed to configuration
scimerman Nov 25, 2022
08b539b
GPU: updated readme
scimerman Nov 25, 2022
0beed5c
GPU: removed stale file
scimerman Nov 25, 2022
0b07cf1
gpu: added node 05
scimerman Dec 8, 2022
a3e2e89
GPU: gpu node 05, added ip addresses
scimerman Dec 8, 2022
5543bd8
Merge branch 'develop' into gpu
pneerincx Jan 4, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions roles/cluster/defaults/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ cluster_common_packages:
- ncurses-static
- net-tools
- openssl
- pciutils
- qt5-qtbase
- qt5-qtxmlpatterns
- readline-static
Expand All @@ -36,4 +37,5 @@ cluster_common_packages:
- urw-base35-fonts
- vim
- wget
- yum-utils
...
20 changes: 20 additions & 0 deletions roles/gpu/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# NVidia GPU installation role for Centos 7

This role follows the latest instructions of the newest version of available
drivers, avaiable at [NVIDIA CUDA Installation Guide for
Linux](https://docs.nvidia.com/cuda/pdf/CUDA_Installation_Guide_Linux.pdf).

## Role outline

- installs `pciutils` tools
- checks if there is pci device from nvidia and if there is, then it
- installs on system needed yum packages that can later build the driver
- downloads the .run driver from nvidia (driver version is defined in defualts)
- installs and compile the driver module
- blacklists nouveau
- installs systemd service file, that automatically loads the driver upons system
boot, and that reloads the driver when/if it has failed operating

## TO-DO
- extensive testing and benchmarking
- role for development software installation
6 changes: 6 additions & 0 deletions roles/gpu/defaults/main.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
---
gpu_cuda_version: '11.7.1'
gpu_driver_version: '515.65.01'
gpu_url_directory: 'https://developer.download.nvidia.com/compute/cuda/{{ gpu_cuda_version }}/local_installers/'
gpu_runfile: 'cuda_{{ gpu_cuda_version }}_{{ gpu_driver_version }}_linux.run'
...
2 changes: 2 additions & 0 deletions roles/gpu/files/blacklist-nouveau.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
blacklist nouveau
options nouveau modeset=0
10 changes: 10 additions & 0 deletions roles/gpu/files/nvidia-persistenced.service
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
[Unit]
Description=Initialize GPU at the startup of the system

[Service]
ExecStart=/usr/bin/nvidia-persistenced --verbose
RestartSec=15
Restart=always

[Install]
WantedBy=multi-user.target
7 changes: 7 additions & 0 deletions roles/gpu/handlers/main.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
---
- name: Restart server because of the pending updates
ansible.builtin.reboot:
msg: "Reboot initiated by Ansible, because of the pending updates"
listen: "reboot_server"
become: true
...
23 changes: 23 additions & 0 deletions roles/gpu/tasks/configuration.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
---
- name: Copy blacklist-nouveau.conf file into modprobe.d to disable Nouveau drivers
ansible.builtin.copy:
src: blacklist-nouveau.conf
dest: /etc/modprobe.d/blacklist-nouveau.conf
owner: root
group: root
mode: '0644'
become: true

- name: Install NVidia persistence service
ansible.builtin.copy:
src: nvidia-persistenced.service
dest: /etc/systemd/system/nvidia-persistenced.service
become: true

- name: Enable a nvidia-persistenced service
ansible.builtin.systemd:
name: nvidia-persistenced.service
state: started
enabled: true
become: true
...
67 changes: 67 additions & 0 deletions roles/gpu/tasks/gpu.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
---
- name: Install yum requirements for gpu driver installation
ansible.builtin.yum:
state: 'installed'
update_cache: true
name:
- tar
- bzip2
- make
- automake
- gcc
- gcc-c++
- pciutils
- elfutils-libelf-devel
- libglvnd-devel
- bind-utils
- wget
become: true

scimerman marked this conversation as resolved.
Show resolved Hide resolved
# ansible_kernel variable is not working, as after reboot, still holds old kernel
scimerman marked this conversation as resolved.
Show resolved Hide resolved
- name: Get current kernel version
ansible.builtin.command: '/usr/bin/uname -r'
register: uname_output
failed_when: uname_output.rc != 0
when: true
become: true

- name: Set kernel version fact
ansible.builtin.set_fact:
kernel_version: "{{ uname_output.stdout }}"

- name: Install kernel developement package matching running kernel version
ansible.builtin.yum:
name: 'kernel-devel-{{ kernel_version }}'
register: yum_result
failed_when: yum_result.rc != 0
when: true
scimerman marked this conversation as resolved.
Show resolved Hide resolved
become: true

- name: Download a driver installation file from NVidia
ansible.builtin.get_url:
url: '{{ gpu_url_directory }}/{{ gpu_runfile }}'
dest: '/root/{{ gpu_runfile }}'
mode: '0700'
become: true

scimerman marked this conversation as resolved.
Show resolved Hide resolved
- name: "Check if driver downloaded"
ansible.builtin.stat:
path: '/root/{{ gpu_runfile }}'
when: true
register: driver_downloaded
become: true

- name: Install driver from .run file
scimerman marked this conversation as resolved.
Show resolved Hide resolved
ansible.builtin.command: '/root/{{ gpu_runfile }} --silent --driver'
register: install_result
failed_when: install_result.rc != 0
when: driver_downloaded.stat.exists
scimerman marked this conversation as resolved.
Show resolved Hide resolved
notify: reboot_server
become: true

- name: Remove installation file
scimerman marked this conversation as resolved.
Show resolved Hide resolved
ansible.builtin.file:
path: '/root/{{ gpu_runfile }}'
state: absent
become: true
...
32 changes: 32 additions & 0 deletions roles/gpu/tasks/main.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
---
- name: Check if system needs to be restarted
ansible.builtin.command: '/bin/needs-restarting -r'
register: needs_restarting
failed_when: 'needs_restarting.rc > 1'
changed_when: 'needs_restarting.rc == 1'
become: true
notify: reboot_server

- name: Flush handlers
ansible.builtin.meta: flush_handlers

- name: Check if we have CUDA capable system
ansible.builtin.command: 'lspci'
register: lspci_nv
when: true
become: true

- name: Check if we have already configured NVidia devices
ansible.builtin.command: 'nvidia-smi'
register: detect_devices
failed_when: false
when: true
become: true

- name: Run GPU driver installation role
ansible.builtin.include: gpu.yml
when: ('"nvidia" in lspci_nv.stdout | lower') and (detect_devices.rc != 0)

- name: Set configuration files for service and modprobe
ansible.builtin.include: configuration.yml
...
1 change: 1 addition & 0 deletions single_group_playbooks/cluster_part1.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@
- figlet_motd
- node_exporter
- cluster
- gpu # needs to run after role 'cluster'
- resolver
- coredumps
...
6 changes: 6 additions & 0 deletions single_role_playbooks/gpu.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
---
- hosts:
- compute_vm
roles:
- gpu
...
2 changes: 1 addition & 1 deletion static_inventories/nibbler_cluster.yml
Original file line number Diff line number Diff line change
Expand Up @@ -127,7 +127,7 @@ all:
hosts:
nb-vcompute04:
vars:
cloud_flavor: gpu.A40
cloud_flavor: gpu.A40_8
local_volume_size_extra: 1
slurm_sockets: 32
slurm_cores_per_socket: 1
Expand Down