Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Slurm heterogeneous jobs #346

Merged
merged 20 commits into from
Aug 25, 2023

Conversation

al-rigazzi
Copy link
Collaborator

@al-rigazzi al-rigazzi commented Aug 23, 2023

This PR adds (limited) support for Slurm heterogeneous jobs. Basically, het jobs overload the syntax used by MPMD workloads, and this can be the cause of problems when running on Slurm. This PR adds several checks to see if the allocation is heterogeneous and prevents the user from running MPMD models (or single_cmd orchestrators).

Still missing:

  • Test for full_wlm section

@al-rigazzi al-rigazzi requested review from ashao and removed request for ashao August 23, 2023 18:08
@al-rigazzi al-rigazzi added area: settings Issues related to Batch or Run settings area: launcher Issues related to any of the launchers within SmartSim area: workload manager Issues specific to workload managers labels Aug 23, 2023
@al-rigazzi al-rigazzi self-assigned this Aug 23, 2023
Copy link
Member

@ashao ashao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for putting this in and for working through and around Slurm

@@ -270,6 +274,17 @@ def set_walltime(self, walltime: str) -> None:
"""
self.run_args["time"] = str(walltime)

def set_het_group(self, het_group: int) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One other thought that I just had. Thrown an error here and also in make_mpmd in case someone tries to set a het_group and also mpmd at the same time.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep!

@codecov
Copy link

codecov bot commented Aug 23, 2023

Codecov Report

Merging #346 (32f38fa) into develop (d7a6b60) will not change coverage.
The diff coverage is 100.00%.

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff            @@
##           develop     #346   +/-   ##
========================================
  Coverage    87.04%   87.04%           
========================================
  Files           59       59           
  Lines         3551     3551           
========================================
  Hits          3091     3091           
  Misses         460      460           
Files Changed Coverage Δ
smartsim/database/orchestrator.py 83.91% <100.00%> (ø)

Copy link
Member

@MattToast MattToast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!! I have a couple of small knit picks, and some potential follow on work that may or may not be relevant. Lmk what you think!!

smartsim/database/orchestrator.py Outdated Show resolved Hide resolved
smartsim/settings/slurmSettings.py Outdated Show resolved Hide resolved
smartsim/_core/launcher/slurm/slurmLauncher.py Outdated Show resolved Hide resolved
smartsim/settings/slurmSettings.py Outdated Show resolved Hide resolved
smartsim/settings/slurmSettings.py Outdated Show resolved Hide resolved
tests/full_wlm/test_het_job.py Outdated Show resolved Hide resolved
tests/full_wlm/test_het_job.py Outdated Show resolved Hide resolved
tests/full_wlm/test_het_job.py Outdated Show resolved Hide resolved
tests/full_wlm/test_het_job.py Outdated Show resolved Hide resolved
Copy link
Member

@MattToast MattToast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking care of those knit picks!! Everything here LGTM!!

@al-rigazzi al-rigazzi merged commit 1d4d7a9 into CrayLabs:develop Aug 25, 2023
10 checks passed
@al-rigazzi al-rigazzi deleted the slurm-het-jobs branch August 25, 2023 14:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: launcher Issues related to any of the launchers within SmartSim area: settings Issues related to Batch or Run settings area: workload manager Issues specific to workload managers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants