Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nexus: Add --user $USER to squeue command #3796

Merged
merged 5 commits into from
Feb 3, 2022

Conversation

mcbennet
Copy link
Contributor

@mcbennet mcbennet commented Feb 1, 2022

Proposed changes

Adding flag --me to squeue command in machines.py. This prevents users from being forced to provide their usernames to avoid strange errors when other users format their job names in weird ways. This also resolves #2380

What type(s) of changes does this code introduce?

  • Other (please describe): Add stability

Does this introduce a breaking change?

  • No. All nxs-test tests pass

What systems has this change been tested on?

Laptop. Fedora Linux

Checklist

  • Yes. This PR is up to date with current the current state of 'develop'
  • N/A. Code added or changed in the PR has been clang-formatted
  • N/A. This PR adds tests to cover any new code, or to catch a bug that is being fixed
  • N/A. Documentation has been added (if appropriate)

@mcbennet mcbennet changed the title Add --me to squeue command Nexus: Add --me to squeue command Feb 1, 2022
@jtkrogel
Copy link
Contributor

jtkrogel commented Feb 2, 2022

What Slurm versions support the --me flag?

@mcbennet
Copy link
Contributor Author

mcbennet commented Feb 2, 2022

I found this running list of updates (https://github.com/SchedMD/slurm/blob/master/NEWSupdates). It does not mention --me being added to squeue all the way back to v2.1.0 -- so I assume it was supported all the back to this point(?).

CADES uses v20.02.4 and supports the --me flag
Cori uses v20.11.08 and supports the --me flag

Perhaps someone at Sandia could let us know if squeue --me is supported on those machines? @camelto2

@prckent
Copy link
Contributor

prckent commented Feb 2, 2022

It looks to have been added to the longoptions of slurm in SchedMD/slurm@8483780

@mcbennet
Copy link
Contributor Author

mcbennet commented Feb 2, 2022

Thanks Paul, but I am correct that the commit there relates to "auto completion" functionality? And it looks to be commited in v21, however, Cori and CADES support the flag and use v20

@mcbennet
Copy link
Contributor Author

mcbennet commented Feb 2, 2022

It was added to the man page 5 months ago it appears

Line 1362 of
https://github.com/SchedMD/slurm/blame/master/doc/man/man1/squeue.1

@ye-luo
Copy link
Contributor

ye-luo commented Feb 2, 2022

Can you just use os.getenv('USER') and then -u $USER? This should work in all job management I have ever seen.

@mcbennet
Copy link
Contributor Author

mcbennet commented Feb 2, 2022

Correction to my previous comment, it appears to show up in man page for the first time on Feb 2020

https://github.com/SchedMD/slurm/blob/cc6a1be5ee2cf0e1ac5a7c14a6df97023701e8e8/doc/man/man1/squeue.1

@ye-luo This might be okay, but will let Jaron weigh in

@prckent
Copy link
Contributor

prckent commented Feb 2, 2022

Wearing my sometimes sysadmin hat, updating slurm is one of the last things i would want to do on a running system. So we can't rely on this --me feature being available or slurm being highly up to date => this can only be a nice to have and not overly important feature. I think Jaron is the decider here. Catching any errors would be wise.

@mcbennet
Copy link
Contributor Author

mcbennet commented Feb 2, 2022

Just throwing out one other alternative that is worth chewing on. We could consider formatting squeue's output such that it only gives what nexus needs (jobid and status only?). Something like

For instance,

squeue -o "%.18i %.2t" 

Or something similar. The parsing done by Nexus will need a minor change as well

@jtkrogel
Copy link
Contributor

jtkrogel commented Feb 2, 2022

I like the simplicity offered here, but I have to agree with Paul that the chance of breaking based on the local environment is too high.

For the feature to be introduced, I think the current behavior with user should be preserved and a new flag used instead to invoke querying with --me (perhaps invoked as user_me=True with a default of False).

@mcbennet
Copy link
Contributor Author

mcbennet commented Feb 2, 2022

The underlying "issue" is not resolved unless we change the default behavior in some way. What users can run into (on Cori for instance) is some strange error like job state jackm is unrecognized -- this is due to squeue not providing a consistent format across output lines then Nexus is not able to parse correctly. IMHO, the solution offered by @ye-luo or the solution to reformat the squeue output via -o are clean ways to resolve this and do not require recent slurm versions

@jtkrogel
Copy link
Contributor

jtkrogel commented Feb 2, 2022

Perhaps squeue --user $USER can be the default (test it at a few sites please). There needs to be a way to turn this behavior off, perhaps by passing user=False to settings.

@mcbennet
Copy link
Contributor Author

mcbennet commented Feb 3, 2022

The following command has been run on multiple machines at multiple sites

execute('squeue --user=$USER')

The tested machines and their results are:

  • OLCF/Andes (SUCCESS)
  • NERSC/Cori (SUCCESS)
  • Sandia machines (SUCCESS)
  • ORNL/CADES (SUCCESS)

The SLURM machines that Nexus supports:

Edison (RETIRED)
Cori
Chama
Uno
Eclipse
Attaway
Skybridge
Solo
Stampede2
CadesSlurm
Rhea
Andes

@jtkrogel jtkrogel changed the title Nexus: Add --me to squeue command Nexus: Add --user $USER to squeue command Feb 3, 2022
@jtkrogel
Copy link
Contributor

jtkrogel commented Feb 3, 2022

Good. I think we can proceed with a change to using --user=$USER.

Perhaps the logic should be:

if isinstance(self.user,bool) and self.user==False:
    extra = ''
elif self.user is not None:
    extra = ' -u {}'.format(self.user)
else:
    extra = ' --user=$USER'
#end if

@mcbennet
Copy link
Contributor Author

mcbennet commented Feb 3, 2022

Thank you @jtkrogel . pushed

@jtkrogel
Copy link
Contributor

jtkrogel commented Feb 3, 2022

LGTM. Tests pass?

@mcbennet
Copy link
Contributor Author

mcbennet commented Feb 3, 2022

Yes. 100% of nxs-test tests pass

@jtkrogel jtkrogel self-requested a review February 3, 2022 18:49
Copy link
Contributor

@jtkrogel jtkrogel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great

@ye-luo
Copy link
Contributor

ye-luo commented Feb 3, 2022

Test this please

@ye-luo ye-luo enabled auto-merge February 3, 2022 19:04
@ye-luo ye-luo merged commit 8a6fe41 into QMCPACK:develop Feb 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

How to contine the running using NEXUS
4 participants