Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mapped missing Slurm job state codes #2

Merged
merged 1 commit into from
Jul 24, 2021
Merged

Conversation

stuvet
Copy link

@stuvet stuvet commented Jul 24, 2021

Expected Behaviour

  • Use of the default makeClusterFunctionsSlurm function would map all job state codes returned by squeue to reasonable defaults for general purpose.

Problem

  • Unmapped Slurm job state codes in makeClusterFunctionsSlurm were resulting in an NA status returned by getStatusTable, triggering errors downstream or leaving running jobs orphaned by batchtools.

Mapping Strategy

  • Full list of Slurm job state codes available here

Queued

  • Job is awaiting reources, the infrastructure is being configured/booted, or the job has been requeued.
    • PD,CF,RF,RH,RQ,SE

Running

  • Job is running, suspended, completing or otherwise retaining CPU resources, including resizing, being signalled, staging outfiles or in the 'stopped' state.
    • R,S,CG,RS,SI,SO,ST

Expired

  • RD (RESV_DEL_HOLD) was initially mapped to queued, but querying squeue by status=RD throws an error on slurm v20.11.4, so left unhandled to result in an expired status.
  • Job is not anticipated to require resources in the future, including failure of infrastructure, exit code, cancellation, completion, out of memory, preemption, & timeout.
    • BF,CA,CD,DL,F,NF,OOM,PR,RV,TO,RD

Custom Mapping

  • This commit will solve the majority of errors caused by running squeue at the wrong moment, when an unmapped job state code for a running job would trigger batchtools to report an incorrect expired status.
  • This commit will not solve all infrastructure-specific issues, for instance where Slurm requeues jobs after preemption. Users that need finer control over mapping could try the default makeClusterFunctions

@stuvet stuvet merged commit a3aeafe into master Jul 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant