-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added cluster_info #752
Added cluster_info #752
Conversation
Tested and updated call methods which are now working as intended |
Co-authored-by: Jeff Ohrstrom <johrstrom@osc.edu>
lib/ood_core/job/adapters/slurm.rb
Outdated
sinfo_out2 = call("sinfo", "-Nhao %n/%G/%T").lines.uniq | ||
gpu_total = sinfo_out2.grep(/gpu:/).count | ||
gpu_free = sinfo_out2.grep(/gpu:/).grep(/idle/).count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sinfo -Nhao %n/%G/%T
I think this is just node status. I don't know if allocated
means the GRES was also allocated. Meaning, I have a job on a node that has a GPU, but I didn't request or allocate it.
I think we actually want something more like sinfo -Nha -O 'GresUsed'
that indicates whether the GRES is in use or not (as a pose to the the Node being in use or not).
@treydock please advise - is that the right way to get the total GPUs in use vs. GPUs total.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh i see, we're returning gpu_nodes_active
. No I think the better information is GPUs active. There are multiple GPUs on a given node.
I think reporting # of GPU Nodes is secondary to reporting on # of GPUs themselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Collecting GRES used with sinfo is actually a complicated process. I ran into this with Slurm exporter. The information about available/used GRES is only available via --Format
flag so the printf %G
isn't possible to get used, that's just available.
First you must get longest GRES line (they can get very long). Also you can likely cache this as GRES definitions only change when slurm.conf and gres.conf are changed and slurmctld is restarted.
sinfo -o '%G'
Then get GRES data:
sinfo -a -h --Node --Format=nodehost,gres:%d,gresused:%d
You replace %d
with value of longest GRES plus a few if you want a buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case not clear, the sinfo -o '%G'
you have to iterate over all lines to find longest and use that for the length with --Format
lib/ood_core/job/adapters/slurm.rb
Outdated
def gpus_from_gres(gres) | ||
gres.to_s.scan(/gpu:[^,]*(\d+)/).flatten.map(&:to_i).sum | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need this redefinition here anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that is kept, I think that regex is wrong. Where it might cause issues:
gres/gpu:v100-32g=2
That's a GRES definition on Pitzer, numbers are in the subtype. Depending on where you query the GRES, ie which command, you could also split on =
and get last element.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also less complicated clusters will have like gres/gpu:a100=4
(for Ascend). This is why I think a regex is going to cause more problems then simply splitting on the known delimiters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regex still works on these cases #754. The regex gets the last set of digits before a comma or the end of the string if preceded by gpu:, so this holds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need this redefinition here anymore.
gpus_from_gres isn't defined here since it's defined in Slurm while get_cluster_info is defined in Batch, which is nested in Slurm. Should Batch extend Slurm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lol that's hilarious. This'll do just fine in a pinch.
def gpus_from_gres
@slurm.gpus_from_gres
end
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This'll do just fine in a pinch.
def gpus_from_gres @slurm.gpus_from_gres end
That doesn't work either:
undefined method `gpus_from_gres' for nil:NilClass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right because I've got it backwards. Yea I'd say let's move that function into Batch
then just reference it through @slurm.gpus_from_gres
. Other adapters have a helper class which is where it'd go, but alas, we did not use that here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right because I've got it backwards. Yea I'd say let's move that function into
Batch
then just reference it through@slurm.gpus_from_gres
. Other adapters have a helper class which is where it'd go, but alas, we did not use that here.
That doesn't work either. Slurm.parse_job_info can't load @batch.gpus_from_gres
because it can't access batch. To fix this, I defined gpus_from_gres
just above the Batch class definition, so that it is accessible everywhere in both classes. 7eca11b
Co-authored-by: Jeff Ohrstrom <johrstrom@osc.edu>
Can't test current version because it uses the
call
method which requires loading the slurm config at/etc/slurm/slurm.conf
, but was tested and works with regular backtick calls. I also would like to get feedback about my comments.┆Issue is synchronized with this Asana task by Unito