Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU role (first part) #670

Merged
merged 34 commits into from
Jan 4, 2023
Merged

GPU role (first part) #670

merged 34 commits into from
Jan 4, 2023

Conversation

scimerman
Copy link
Contributor

This is the first part of the role: driver installation on GPU compute nodes.
Did not have time to extensively test it, but so far the GPU's worked. Tested on a one GPU and multiple GPU machine, plus on a few non-gpu machines.
First part of Readme provided.
The software (libraries and compilers) has not yet been implemented.

roles/gpu/tasks/gpu.yml Outdated Show resolved Hide resolved
roles/gpu/tasks/gpu.yml Outdated Show resolved Hide resolved
roles/gpu/tasks/gpu.yml Outdated Show resolved Hide resolved
roles/gpu/tasks/gpu.yml Outdated Show resolved Hide resolved
Copy link
Contributor

@pneerincx pneerincx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See inline questions/comments.

@scimerman scimerman requested a review from pneerincx November 23, 2022 14:16
roles/gpu/README.md Outdated Show resolved Hide resolved
roles/gpu/README.md Outdated Show resolved Hide resolved
roles/gpu/README.md Outdated Show resolved Hide resolved
roles/gpu/README.md Outdated Show resolved Hide resolved
roles/gpu/tasks/gpu.yml Outdated Show resolved Hide resolved
roles/gpu/tasks/gpu.yml Outdated Show resolved Hide resolved
roles/gpu/tasks/gpu.yml Outdated Show resolved Hide resolved
roles/gpu/tasks/gpu.yml Outdated Show resolved Hide resolved
Comment on lines 2 to 23
- name: Check if system needs to be restarted
ansible.builtin.command: '/bin/needs-restarting -r'
register: needs_restarting
failed_when: 'needs_restarting.rc > 1'
changed_when: 'needs_restarting.rc == 1'
become: true
notify: reboot_server

- name: Reboot system if needed
ansible.builtin.meta: flush_handlers

- name: Check how many NVidia devices is up and running (might take some time)
ansible.builtin.command: 'nvidia-smi -L'
register: smi
changed_when: false
failed_when: false
become: false # running nvidia-smi as root stops the service

- name: Install GPU driver if not all GPU devices are present and working
ansible.builtin.include_tasks: gpu.yml
when: ( gpu_count is defined ) and
( smi.stdout|default([])|lower|regex_findall('nvidia')|length != gpu_count )
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- name: Check if system needs to be restarted
ansible.builtin.command: '/bin/needs-restarting -r'
register: needs_restarting
failed_when: 'needs_restarting.rc > 1'
changed_when: 'needs_restarting.rc == 1'
become: true
notify: reboot_server
- name: Reboot system if needed
ansible.builtin.meta: flush_handlers
- name: Check how many NVidia devices is up and running (might take some time)
ansible.builtin.command: 'nvidia-smi -L'
register: smi
changed_when: false
failed_when: false
become: false # running nvidia-smi as root stops the service
- name: Install GPU driver if not all GPU devices are present and working
ansible.builtin.include_tasks: gpu.yml
when: ( gpu_count is defined ) and
( smi.stdout|default([])|lower|regex_findall('nvidia')|length != gpu_count )
- name: Install GPU driver if one or more GPUs were specified for this machine.
ansible.builtin.include_tasks: gpu.yml
when: gpu_count is defined and gpu_count | length >= 1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not like this being moved into gpu.yml.
gpu_count on it own does not say define anything about status of the system.
nvidia-smi is the only way to reliable detect status of gpu devices.
Therefore we need to run the nvidia-smi, and based on this result decide if we need to (re)install gpu driver.

If nvidia-smi is executed on non-gpu machine - fine. I will detect (together with gpu_count that this machine needs to be skipped).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could put into gpu.yml the "Check if system needs to be restarted".
But then again, it is run only if it needed. So I see no harm running it - perhaps quite opposite.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gpu_count on it own does not say define anything about status of the system.
nvidia-smi is the only way to reliable detect status of gpu devices.

This contracts what was written in the README.md:

gpu_count is needed to install the driver, since any other automatic detection is
failing sooner or later.

This includes nvidia-smi reporting wrong values. Either nvidia-smi or some other tool automagically detects whether GPUs are present and the driver needs to be installed or we define the expected state in a variable like gpu_count and rely on that.

Even if nvidia-smi would report the correct number of GPUs, we should not skip gpu.yml. The driver may be installed, but there are various other tasks that need to re-run to ensure idempotency:

  • Check if correct version of the driver was installed.
  • Create nvidia group and user with correct UID and GID.
  • Install blacklist-nouveau.conf
  • etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This contracts what was written in the README.md:
And all that is correct. No changes of readme needed. Perhaps a word or two added, if it is not clear.
This includes nvidia-smi reporting wrong values. Either nvidia-smi or some other tool automagically detects whether GPUs are present and the driver needs to be installed or we define the expected state in a variable like gpu_count and rely on that.

No. This are two different thing.

  • gpu_count is explaining the count of working gpu's expected. And this cannot be automatically detected - reliably
  • (if all is working as it should) nvidia-smi detects how many working gpu's there are

=========================================================
Long story short; main question is:

  • is it ok to skip the configuration part when all GPU's are up and running?

I advocate for yes.

=========================================================
Long story long:

Even if nvidia-smi would report the correct number of GPUs, we should not skip gpu.yml.

I think if everything works, we should skip the whole thing. That was the entire purpose of detecting with nvidia-smi. If it works, then there is nothing to be configured: the service is up and running, blacklisting of non-nvidia driver was done, nvidia user was created ...

Check if correct version of the driver was installed.

I don't see how the driver version will change on it's own. If all working, why do we need to check driver version?

Create nvidia group and user with correct UID and GID.

Without nvidia user and nvidia group, nothing works -> gpu.yml gets triggered

Copy link
Contributor

@pneerincx pneerincx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See inline comments. gpu role would always restart machines after testing needs-restarting command and would always try to run the nvidia-smi command even if gpu_count was not specified for a machine. I tried to refactor that to only install stuff if GPUs were specified for a machine and if yes make sure all steps of this role are re-run to make sure the whole thing is idempotent.

Copy link
Contributor

@pneerincx pneerincx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See inline comment.

gpu_driver_version not in modinfo.stdout|default("")|regex_search("version:.*"))

- name: Configure user and services
ansible.builtin.include_tasks: user_services.yml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

user_services.yml does not exist in this PR and gpu.yml does exist but is no longer used anywhere as far as I can tell. Looks like one or more commits are missing...

@scimerman
Copy link
Contributor Author

@pneerincx I have addressed the issues you have raised, and deploy several times the playbook on GPUs. Playbook fails on the missing storage, which it seems to be 'coming soon ...' for too long now. I would recommend that we merge what we have done so far.
Another PR is comming for the ansible-pipelines - added easybuild PyTorch and TensorFlow.

@pneerincx pneerincx merged commit d34e0a9 into rug-cit-hpc:develop Jan 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants