Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running as part of global-workflow fails in exgrid2grid_step2.sh with srun: no record for task id 1 #133

Closed
DWesl opened this issue Aug 29, 2024 · 2 comments

Comments

@DWesl
Copy link
Contributor

DWesl commented Aug 29, 2024

Running for a C768 run as part of global-workflow produces a specification with nodes=1, ppn=4, and tpp=1.
Running with ush/run_verif_global_in_global_workflow.sh produces a job with nproc=${npe_node_metp_gfs}=1.
When run on HERA, scripts/exgrid2grid_step1.sh launches the METplus job with srun --multi-prog /path/to/task-file, where task-file has nproc lines detailing commands to execute. srun then fails because it can't find as many tasks as it wants; I think it is defaulting to four tasks.

Changing scripts/exgrid2grid_step1.sh to specify --ntasks ${nproc} as part of the srun command allows the process to finish. A better solution probably involves changing how ush/run_verif_global_in_global_workflow.sh determines nproc: man sbatch suggests SLURM_NTASKS, but global-workflow probably has a variable to specify the number of threads that would be less closely tied to the job manager.

@DavidHuber-NOAA
Copy link
Contributor

@DWesl This was recently fixed in the global-workflow as part of an overhaul of the resource configuration system. The job now runs with a single task by default. See NOAA-EMC/global-workflow#2804 and let me know if updating your global-workflow resolves the issue.

@DWesl
Copy link
Contributor Author

DWesl commented Aug 30, 2024

The new setting in verif-global (should have checked this earlier) references nproc and defaults to one:

## Run settings for machines
export MPMD="YES"
export nproc=${nproc:-1}

and global-workflow sets nproc in config.metp, which should solve the problem more generally.

@DWesl DWesl closed this as completed Aug 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants