GPU role (first part) #670

scimerman · 2022-11-04T16:59:44Z

This is the first part of the role: driver installation on GPU compute nodes.
Did not have time to extensively test it, but so far the GPU's worked. Tested on a one GPU and multiple GPU machine, plus on a few non-gpu machines.
First part of Readme provided.
The software (libraries and compilers) has not yet been implemented.

…bots into gpu

roles/gpu/tasks/gpu.yml

pneerincx

See inline questions/comments.

…ssh_args for synchronize tasks due to incompatibility issue with latest mitogen release.

…ink, which is only present when the service is enabled.

roles/gpu/README.md

roles/gpu/tasks/gpu.yml

roles/gpu/tasks/main.yml

roles/gpu/tasks/gpu.yml

pneerincx · 2022-11-23T16:59:02Z

roles/gpu/tasks/main.yml

+- name: Check if system needs to be restarted
+  ansible.builtin.command: '/bin/needs-restarting -r'
+  register: needs_restarting
+  failed_when: 'needs_restarting.rc > 1'
+  changed_when: 'needs_restarting.rc == 1'
+  become: true
+  notify: reboot_server
+
+- name: Reboot system if needed
+  ansible.builtin.meta: flush_handlers
+
+- name: Check how many NVidia devices is up and running (might take some time)
+  ansible.builtin.command: 'nvidia-smi -L'
+  register: smi
+  changed_when: false
+  failed_when: false
+  become: false    # running nvidia-smi as root stops the service
+
+- name: Install GPU driver if not all GPU devices are present and working
+  ansible.builtin.include_tasks: gpu.yml
+  when: ( gpu_count is defined ) and
+        ( smi.stdout|default([])|lower|regex_findall('nvidia')|length != gpu_count )


Suggested change

- name: Check if system needs to be restarted

ansible.builtin.command: '/bin/needs-restarting -r'

register: needs_restarting

failed_when: 'needs_restarting.rc > 1'

changed_when: 'needs_restarting.rc == 1'

become: true

notify: reboot_server

- name: Reboot system if needed

ansible.builtin.meta: flush_handlers

- name: Check how many NVidia devices is up and running (might take some time)

ansible.builtin.command: 'nvidia-smi -L'

register: smi

changed_when: false

failed_when: false

become: false # running nvidia-smi as root stops the service

- name: Install GPU driver if not all GPU devices are present and working

ansible.builtin.include_tasks: gpu.yml

when: ( gpu_count is defined ) and

( smi.stdout|default([])|lower|regex_findall('nvidia')|length != gpu_count )

- name: Install GPU driver if one or more GPUs were specified for this machine.

ansible.builtin.include_tasks: gpu.yml

when: gpu_count is defined and gpu_count | length >= 1

I do not like this being moved into gpu.yml.
gpu_count on it own does not say define anything about status of the system.
nvidia-smi is the only way to reliable detect status of gpu devices.
Therefore we need to run the nvidia-smi, and based on this result decide if we need to (re)install gpu driver.

If nvidia-smi is executed on non-gpu machine - fine. I will detect (together with gpu_count that this machine needs to be skipped).

We could put into gpu.yml the "Check if system needs to be restarted".
But then again, it is run only if it needed. So I see no harm running it - perhaps quite opposite.

gpu_count on it own does not say define anything about status of the system.
nvidia-smi is the only way to reliable detect status of gpu devices.

This contracts what was written in the README.md:

gpu_count is needed to install the driver, since any other automatic detection is
failing sooner or later.

This includes nvidia-smi reporting wrong values. Either nvidia-smi or some other tool automagically detects whether GPUs are present and the driver needs to be installed or we define the expected state in a variable like gpu_count and rely on that.

Even if nvidia-smi would report the correct number of GPUs, we should not skip gpu.yml. The driver may be installed, but there are various other tasks that need to re-run to ensure idempotency:

Check if correct version of the driver was installed.

Create nvidia group and user with correct UID and GID.

Install blacklist-nouveau.conf

etc.

This contracts what was written in the README.md:
And all that is correct. No changes of readme needed. Perhaps a word or two added, if it is not clear.
This includes nvidia-smi reporting wrong values. Either nvidia-smi or some other tool automagically detects whether GPUs are present and the driver needs to be installed or we define the expected state in a variable like gpu_count and rely on that.

No. This are two different thing.

gpu_count is explaining the count of working gpu's expected. And this cannot be automatically detected - reliably

(if all is working as it should) nvidia-smi detects how many working gpu's there are

=========================================================
Long story short; main question is:

is it ok to skip the configuration part when all GPU's are up and running?

I advocate for yes.

=========================================================
Long story long:

Even if nvidia-smi would report the correct number of GPUs, we should not skip gpu.yml.

I think if everything works, we should skip the whole thing. That was the entire purpose of detecting with nvidia-smi. If it works, then there is nothing to be configured: the service is up and running, blacklisting of non-nvidia driver was done, nvidia user was created ...

Check if correct version of the driver was installed.

I don't see how the driver version will change on it's own. If all working, why do we need to check driver version?

Create nvidia group and user with correct UID and GID.

Without nvidia user and nvidia group, nothing works -> gpu.yml gets triggered

pneerincx

See inline comments. gpu role would always restart machines after testing needs-restarting command and would always try to run the nvidia-smi command even if gpu_count was not specified for a machine. I tried to refactor that to only install stuff if GPUs were specified for a machine and if yes make sure all steps of this role are re-run to make sure the whole thing is idempotent.

…bots into gpu

pneerincx

See inline comment.

pneerincx · 2022-11-24T17:16:44Z

roles/gpu/tasks/main.yml

+        gpu_driver_version not in modinfo.stdout|default("")|regex_search("version:.*"))
+
+- name: Configure user and services
+  ansible.builtin.include_tasks: user_services.yml


user_services.yml does not exist in this PR and gpu.yml does exist but is no longer used anywhere as far as I can tell. Looks like one or more commits are missing...

scimerman · 2023-01-04T11:35:35Z

@pneerincx I have addressed the issues you have raised, and deploy several times the playbook on GPUs. Playbook fails on the missing storage, which it seems to be 'coming soon ...' for too long now. I would recommend that we merge what we have done so far.
Another PR is comming for the ansible-pipelines - added easybuild PyTorch and TensorFlow.

scimerman added 8 commits October 17, 2022 13:08

Nibbler: added 8xGPU vars

e48a3fd

Added GPU role

ab03944

GPU: added to single goups cluster_part1

e2258fc

GPU: update

950d0a1

Merge branch 'develop' of https://github.com/rug-cit-hpc/league-of-ro…

c8881e8

…bots into gpu

GPU update

9d8c7af

GPU: update

f2ff955

GPU: limit hosts

bc06f59

scimerman requested review from pneerincx, Gerbenvandervries and marieke-bijlsma November 4, 2022 16:59

pneerincx reviewed Nov 7, 2022

View reviewed changes

roles/gpu/tasks/gpu.yml Outdated Show resolved Hide resolved

pneerincx reviewed Nov 7, 2022

View reviewed changes

roles/gpu/tasks/gpu.yml Outdated Show resolved Hide resolved

pneerincx reviewed Nov 7, 2022

View reviewed changes

roles/gpu/tasks/gpu.yml Outdated Show resolved Hide resolved

pneerincx reviewed Nov 7, 2022

View reviewed changes

roles/gpu/tasks/gpu.yml Outdated Show resolved Hide resolved

pneerincx requested changes Nov 7, 2022

View reviewed changes

scimerman and others added 9 commits November 14, 2022 13:16

GPU: fixes

7b92958

GPU role update

f388a18

GPU role: removed unneeded files and functions

33ad9c5

Updated README.md for Python dependency issue on macOS. Disabled use_…

487f753

…ssh_args for synchronize tasks due to incompatibility issue with latest mitogen release.

Patch original docker.service file for systemd as opposed to the syml…

b58b38c

…ink, which is only present when the service is enabled.

Fixed complaint from yaml linter.

7924110

GPU role: removed uneeded commands

621389f

GPU update

b2480e7

GPU: reinstated nvidia persistenced

0f1f46f

scimerman force-pushed the gpu branch from 805ad62 to 0f1f46f Compare November 21, 2022 11:04

scimerman added 4 commits November 21, 2022 12:07

merge fix

8da330a

GPU: update

0b19414

GPU: update

ba0290e

GPU: ansible-lint fix

ea595cb

GPU: added nvidia service

e8a4357

scimerman requested a review from pneerincx November 23, 2022 14:16