-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU role (first part) #670
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See inline questions/comments.
…ssh_args for synchronize tasks due to incompatibility issue with latest mitogen release.
…ink, which is only present when the service is enabled.
roles/gpu/tasks/main.yml
Outdated
- name: Check if system needs to be restarted | ||
ansible.builtin.command: '/bin/needs-restarting -r' | ||
register: needs_restarting | ||
failed_when: 'needs_restarting.rc > 1' | ||
changed_when: 'needs_restarting.rc == 1' | ||
become: true | ||
notify: reboot_server | ||
|
||
- name: Reboot system if needed | ||
ansible.builtin.meta: flush_handlers | ||
|
||
- name: Check how many NVidia devices is up and running (might take some time) | ||
ansible.builtin.command: 'nvidia-smi -L' | ||
register: smi | ||
changed_when: false | ||
failed_when: false | ||
become: false # running nvidia-smi as root stops the service | ||
|
||
- name: Install GPU driver if not all GPU devices are present and working | ||
ansible.builtin.include_tasks: gpu.yml | ||
when: ( gpu_count is defined ) and | ||
( smi.stdout|default([])|lower|regex_findall('nvidia')|length != gpu_count ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- name: Check if system needs to be restarted | |
ansible.builtin.command: '/bin/needs-restarting -r' | |
register: needs_restarting | |
failed_when: 'needs_restarting.rc > 1' | |
changed_when: 'needs_restarting.rc == 1' | |
become: true | |
notify: reboot_server | |
- name: Reboot system if needed | |
ansible.builtin.meta: flush_handlers | |
- name: Check how many NVidia devices is up and running (might take some time) | |
ansible.builtin.command: 'nvidia-smi -L' | |
register: smi | |
changed_when: false | |
failed_when: false | |
become: false # running nvidia-smi as root stops the service | |
- name: Install GPU driver if not all GPU devices are present and working | |
ansible.builtin.include_tasks: gpu.yml | |
when: ( gpu_count is defined ) and | |
( smi.stdout|default([])|lower|regex_findall('nvidia')|length != gpu_count ) | |
- name: Install GPU driver if one or more GPUs were specified for this machine. | |
ansible.builtin.include_tasks: gpu.yml | |
when: gpu_count is defined and gpu_count | length >= 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not like this being moved into gpu.yml
.
gpu_count
on it own does not say define anything about status of the system.
nvidia-smi
is the only way to reliable detect status of gpu devices.
Therefore we need to run the nvidia-smi
, and based on this result decide if we need to (re)install gpu driver.
If nvidia-smi
is executed on non-gpu machine - fine. I will detect (together with gpu_count
that this machine needs to be skipped).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could put into gpu.yml
the "Check if system needs to be restarted".
But then again, it is run only if it needed. So I see no harm running it - perhaps quite opposite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gpu_count on it own does not say define anything about status of the system.
nvidia-smi is the only way to reliable detect status of gpu devices.
This contracts what was written in the README.md:
gpu_count
is needed to install the driver, since any otherautomatic
detection is
failing sooner or later.
This includes nvidia-smi
reporting wrong values. Either nvidia-smi
or some other tool automagically detects whether GPUs are present and the driver needs to be installed or we define the expected state in a variable like gpu_count
and rely on that.
Even if nvidia-smi
would report the correct number of GPUs, we should not skip gpu.yml
. The driver may be installed, but there are various other tasks that need to re-run to ensure idempotency:
- Check if correct version of the driver was installed.
- Create nvidia group and user with correct UID and GID.
- Install blacklist-nouveau.conf
- etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This contracts what was written in the README.md:
And all that is correct. No changes of readme needed. Perhaps a word or two added, if it is not clear.
This includes nvidia-smi reporting wrong values. Either nvidia-smi or some other tool automagically detects whether GPUs are present and the driver needs to be installed or we define the expected state in a variable like gpu_count and rely on that.
No. This are two different thing.
gpu_count
is explaining the count of working gpu's expected. And this cannot be automatically detected - reliably- (if all is working as it should)
nvidia-smi
detects how many working gpu's there are
=========================================================
Long story short; main question is:
- is it ok to skip the configuration part when all GPU's are up and running?
I advocate for yes.
=========================================================
Long story long:
Even if nvidia-smi would report the correct number of GPUs, we should not skip gpu.yml.
I think if everything works, we should skip the whole thing. That was the entire purpose of detecting with nvidia-smi. If it works, then there is nothing to be configured: the service is up and running, blacklisting of non-nvidia driver was done, nvidia user was created ...
Check if correct version of the driver was installed.
I don't see how the driver version will change on it's own. If all working, why do we need to check driver version?
Create nvidia group and user with correct UID and GID.
Without nvidia user and nvidia group, nothing works -> gpu.yml gets triggered
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See inline comments. gpu role would always restart machines after testing needs-restarting
command and would always try to run the nvidia-smi command even if gpu_count
was not specified for a machine. I tried to refactor that to only install stuff if GPUs were specified for a machine and if yes make sure all steps of this role are re-run to make sure the whole thing is idempotent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See inline comment.
roles/gpu/tasks/main.yml
Outdated
gpu_driver_version not in modinfo.stdout|default("")|regex_search("version:.*")) | ||
|
||
- name: Configure user and services | ||
ansible.builtin.include_tasks: user_services.yml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
user_services.yml
does not exist in this PR and gpu.yml
does exist but is no longer used anywhere as far as I can tell. Looks like one or more commits are missing...
@pneerincx I have addressed the issues you have raised, and deploy several times the playbook on GPUs. Playbook fails on the missing storage, which it seems to be 'coming soon ...' for too long now. I would recommend that we merge what we have done so far. |
This is the first part of the role: driver installation on GPU compute nodes.
Did not have time to extensively test it, but so far the GPU's worked. Tested on a one GPU and multiple GPU machine, plus on a few non-gpu machines.
First part of Readme provided.
The software (libraries and compilers) has not yet been implemented.