-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Series of small WCOSS2 updates - round 5 #546
Series of small WCOSS2 updates - round 5 #546
Conversation
- adjust natstr on WCOSS2 to not add vscatter for archive jobs - correct a bug with setting nth_fcst_gfs; the gdas nth_fcst was being used for the gfs and nth_fv3[fcst]_gfs was being ignored; need to set threads differently between gdas and gfs fcst jobs Refs: NOAA-EMC#399
- update C384 and C768 settings in config.fv3 based on tests of gdasfcst, gfsfcst, and gdasefcs jobs - gfsfcst still runs long, further optimization will occur Refs: NOAA-EMC#399
- resource updates for most jobs after tests on WCOSS2 - add memory settings for many jobs (not previously defined) Refs: NOAA-EMC#399
- change FIX_DIR path on WCOSS2 in link_fv3gfs.sh to point to more official emc.global space - new FIX_DIR=/lfs/h2/emc/global/save/emc.global/FIX/fix_nco_gfsv16 Refs: NOAA-EMC#399
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The resource difference for some of these jobs is remarkable. When support is added to develop, we may need to change how resources are computed so that it is by-machine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good.
I have some minor comments about the memory request to 500GB
. Why not request the full memory of 512GB
rather than the arbitrary 500GB
.
Also I spotted a bug in workflow_utils.py
that probably got carried from python2 to python3 conversion.
@@ -133,6 +134,7 @@ elif [ $step = "anal" ]; then | |||
export npe_node_anal=$(echo "$npe_node_max / $nth_anal" | bc) | |||
export nth_cycle=$npe_node_max | |||
export npe_node_cycle=$(echo "$npe_node_max / $nth_cycle" | bc) | |||
if [[ "$machine" = "WCOSS2" ]]; then export memory_anal="500GB"; fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed?
The total memory on the WCOSS2 node is 512GB. If this resource request is not passed, does it explicitly request the full memory or does it do some calculation based on the number of cores requested?
@GeorgeVandenberghe-NOAA might know the answer.
@@ -152,17 +154,17 @@ elif [ $step = "analdiag" ]; then | |||
export nth_analdiag=1 | |||
export npe_node_analdiag=$npe_node_max | |||
if [[ "$machine" == "WCOSS_C" ]]; then export memory_analdiag="3072M"; fi | |||
if [[ "$machine" = "WCOSS2" ]]; then export memory_analdiag="500GB"; fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question as above.
Seeing this is a single node job, may be request max memory instead of something that is arbitrarily smaller than the max.
ush/rocoto/workflow_utils.py
Outdated
if cdump in ['gfs'] and f'nth_{task}_gfs' in cfg.keys(): | ||
threads = cfg[f'nth_{ltask}_gfs'] | ||
else: | ||
threads = cfg[f'nth_{ltask}'] | ||
except KeyError: | ||
threads = cfg["',)nth_epos"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bug from the python2 to python3 conversion PR.
I don't know what the ',)
is doing here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have now removed the errant ',)
text from threads = cfg["',)nth_epos"]
. Committed @ 6239494. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that was errant. I remember fixing something with that at one point in develop. Let me see if I can track down that commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @WalterKolczynski-NOAA , I have further cleaned up that section of workflow_utils.py based on your change @ 68385d9. Committed to my branch @ 521bad7.
Yes please. Note: I haven't changed anything in config.fv3.emc.dyn nor config.resources.emc.dyn since some more thought and testing on the R&D machines will be needed to update them. I've focused on the ops versions of those configs for now. |
- removed errant "',)" in thread setting Refs: NOAA-EMC#399
- found other instance of errant "',)" text in workflow_utils.py - searched for more, did not uncover any further instances Refs: NOAA-EMC#399
- remove "try"s from tasks and threads sections of get_resources function - remove "except KeyError" checks in tasks/threads sections Refs: NOAA-EMC#399
This PR contains the following WCOSS2 updates:
Refs: #399