Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in rt.sh jet section when ACCNR is not set #1014

Closed
climbfuji opened this issue Jan 20, 2022 · 11 comments
Closed

Bug in rt.sh jet section when ACCNR is not set #1014

climbfuji opened this issue Jan 20, 2022 · 11 comments
Assignees
Labels
bug Something isn't working

Comments

@climbfuji
Copy link
Collaborator

climbfuji commented Jan 20, 2022

Description

A recent PR (#981) updated the jet section in rt.sh to allow setting ACCNR instead of using the hardcoded h-nems. A fallback/default value h-nems was intended in this part:

ACCNR=${ACCNR:-h-nems}
PARTITION=xjet
DISKNM=/lfs4/HFIP/h-nems/emc.nemspara/RT
dprefix=${dprefix:-/lfs4/HFIP/$ACCNR/$USER}

The problem is that bash gets confused by the - in the h-nems default value, which results in

dprefix=/lfs4/HFIP/nems/$USER

and causes rt.sh abort with

+ ACCNR=nems
+ PARTITION=xjet
+ DISKNM=/lfs4/HFIP/h-nems/emc.nemspara/RT
+ dprefix=/lfs4/HFIP/nems/emc.nemspara
+ STMP=/lfs4/HFIP/nems/emc.nemspara/RT_BASELINE
+ PTMP=/lfs4/HFIP/nems/emc.nemspara/RT_RUNDIRS
+ SCHEDULER=slurm
+ cp fv3_conf/fv3_slurm.IN_jet fv3_conf/fv3_slurm.IN
+ cp fv3_conf/compile_slurm.IN_jet fv3_conf/compile_slurm.IN
+ mkdir -p /lfs4/HFIP/nems/emc.nemspara/RT_BASELINE/emc.nemspara
mkdir: cannot create directory '/lfs4/HFIP/nems': Permission denied
++ echo 'rt.sh error on line 395'

To Reproduce:

Try running rt.sh on jet without setting ACCNR. Then, oops, look at the bug.

@climbfuji climbfuji added the bug Something isn't working label Jan 20, 2022
@junwang-noaa
Copy link
Collaborator

I got the error too, so I was setting the ACCNR to "h-nems". I think the ACCNR was set to nems by default.sh. Maybe we need to add a line in jet section in rt.sh to reset ACCNR to "h-nems" if it's "nems".

@climbfuji
Copy link
Collaborator Author

I got the error too, so I was setting the ACCNR to "h-nems". I think the ACCNR was set to nems by default.sh. Maybe we need to add a line in jet section in rt.sh to reset ACCNR to "h-nems" if it's "nems".

Your explanation makes a lot more sense than my assumption that the syntax could be messed up by a dash in the default value. I don't think there should be an entry in default_vars.sh at all, since the ACCNR depends on the HPC.

@SamuelTrahanNOAA
Copy link
Collaborator

If the ACCNR default should depend on the platform, then default_vars.sh should not set it. A special case of overriding nems with h-nems (even when it is set manually to nems by the user) seem sloppy.

Also, ACCR is the wrong variable, you should set ACCNR=h-nems in your script before running rt.sh. If you set ACCR, then the error is yours. Was that a typo in your title? Or an actual error in your execution of rt.sh?

@climbfuji climbfuji changed the title Bug in rt.sh jet section when ACCR is not set Bug in rt.sh jet section when ACCNR is not set Jan 20, 2022
@climbfuji
Copy link
Collaborator Author

Typo in the tile, fixed it.

@SamuelTrahanNOAA
Copy link
Collaborator

I will test a fix for this (move the default to rt.sh) and issue a PR tomorrow-ish. It'll have to be tested on all platforms, in both baseline generation and regression test mode. That means we should merge it with a PR that changes the results.

@climbfuji
Copy link
Collaborator Author

I will test a fix for this (move the default to rt.sh) and issue a PR tomorrow-ish. It'll have to be tested on all platforms, in both baseline generation and regression test mode. That means we should merge it with a PR that changes the results.

Thanks a lot for doing this, Sam! Yes, agreed, should be part of a baseline-changing PR. And needs to be tested manually as well, because auto-rt is always setting ACCNR (that's why the error didn't show up).

@junwang-noaa
Copy link
Collaborator

Thanks, Sam. According to current commit queue, it will be with PR#982 planned on 1/25.

@SamuelTrahanNOAA
Copy link
Collaborator

Barring extraordinary events, that should be doable.

@SamuelTrahanNOAA
Copy link
Collaborator

The fix is in #1016 but I only did minimal testing to avoid wasting resources. I tested enough to ensure ACCNR was passed correctly. Please give me another PR to merge into #1016 as soon as possible so I can do full testing.

There may be an unwanted side-effect: If someone was setting ACCNR incorrectly before running rt.sh, that incorrect value will now be passed through to the batch system. Make sure your automated test suites are setting the right ACCNR value, or none at all!

@SamuelTrahanNOAA
Copy link
Collaborator

@climbfuji When PR #1009 was merged, it should have fixed your problem. Can you confirm and close the ticket?

@junwang-noaa
Copy link
Collaborator

The autoRT runs fine on jet now. I will close the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
3 participants