Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Series of small WCOSS2 updates - round 5 #546

Conversation

KateFriedman-NOAA
Copy link
Member

@KateFriedman-NOAA KateFriedman-NOAA commented Jan 5, 2022

This PR contains the following WCOSS2 updates:

  1. update WAFS tag to gfs_wafs.v6.2.4
  2. update workflow_utils.py to a) not add vscatter native setting on WCOSS2 for archive jobs and b) correct bug that ignored nth_*_gfs variables (needed to set threads differently between gdasfcst and gfsfcst jobs)
  3. update config.fv3.nco.static and config.resources.nco.static with resource settings based on WCOSS2 testing (further optimization still to come)
  4. update FIX_DIR path in link_fv3gfs.sh to more official emc.global space (/lfs/h2/emc/global/save/emc.global/FIX/fix_nco_gfsv16)

Refs: #399

- adjust natstr on WCOSS2 to not add vscatter for archive jobs
- correct a bug with setting nth_fcst_gfs; the gdas nth_fcst was being
used for the gfs and nth_fv3[fcst]_gfs was being ignored; need to set
threads differently between gdas and gfs fcst jobs

Refs: NOAA-EMC#399
- update C384 and C768 settings in config.fv3 based on tests of
gdasfcst, gfsfcst, and gdasefcs jobs
- gfsfcst still runs long, further optimization will occur

Refs: NOAA-EMC#399
- resource updates for most jobs after tests on WCOSS2
- add memory settings for many jobs (not previously defined)

Refs: NOAA-EMC#399
- change FIX_DIR path on WCOSS2 in link_fv3gfs.sh to point to more
official emc.global space
- new FIX_DIR=/lfs/h2/emc/global/save/emc.global/FIX/fix_nco_gfsv16

Refs: NOAA-EMC#399
@KateFriedman-NOAA KateFriedman-NOAA added the port Porting the system to a new platform label Jan 5, 2022
@KateFriedman-NOAA KateFriedman-NOAA self-assigned this Jan 5, 2022
Copy link
Contributor

@WalterKolczynski-NOAA WalterKolczynski-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The resource difference for some of these jobs is remarkable. When support is added to develop, we may need to change how resources are computed so that it is by-machine.

Copy link
Contributor

@aerorahul aerorahul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good.
I have some minor comments about the memory request to 500GB. Why not request the full memory of 512GB rather than the arbitrary 500GB.

Also I spotted a bug in workflow_utils.py that probably got carried from python2 to python3 conversion.

@@ -133,6 +134,7 @@ elif [ $step = "anal" ]; then
export npe_node_anal=$(echo "$npe_node_max / $nth_anal" | bc)
export nth_cycle=$npe_node_max
export npe_node_cycle=$(echo "$npe_node_max / $nth_cycle" | bc)
if [[ "$machine" = "WCOSS2" ]]; then export memory_anal="500GB"; fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed?
The total memory on the WCOSS2 node is 512GB. If this resource request is not passed, does it explicitly request the full memory or does it do some calculation based on the number of cores requested?

@GeorgeVandenberghe-NOAA might know the answer.

@@ -152,17 +154,17 @@ elif [ $step = "analdiag" ]; then
export nth_analdiag=1
export npe_node_analdiag=$npe_node_max
if [[ "$machine" == "WCOSS_C" ]]; then export memory_analdiag="3072M"; fi
if [[ "$machine" = "WCOSS2" ]]; then export memory_analdiag="500GB"; fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as above.
Seeing this is a single node job, may be request max memory instead of something that is arbitrarily smaller than the max.

if cdump in ['gfs'] and f'nth_{task}_gfs' in cfg.keys():
threads = cfg[f'nth_{ltask}_gfs']
else:
threads = cfg[f'nth_{ltask}']
except KeyError:
threads = cfg["',)nth_epos"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bug from the python2 to python3 conversion PR.
I don't know what the ',) is doing here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have now removed the errant ',) text from threads = cfg["',)nth_epos"]. Committed @ 6239494. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that was errant. I remember fixing something with that at one point in develop. Let me see if I can track down that commit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found it. It was fixed in develop as part of PR #500: 68385d9

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @WalterKolczynski-NOAA , I have further cleaned up that section of workflow_utils.py based on your change @ 68385d9. Committed to my branch @ 521bad7.

@KateFriedman-NOAA
Copy link
Member Author

When support is added to develop, we may need to change how resources are computed so that it is by-machine.

Yes please. Note: I haven't changed anything in config.fv3.emc.dyn nor config.resources.emc.dyn since some more thought and testing on the R&D machines will be needed to update them. I've focused on the ops versions of those configs for now.

- removed errant "',)" in thread setting

Refs: NOAA-EMC#399
- found other instance of errant "',)" text in workflow_utils.py
- searched for more, did not uncover any further instances

Refs: NOAA-EMC#399
- remove "try"s from tasks and threads sections of get_resources
function
- remove "except KeyError" checks in tasks/threads sections

Refs: NOAA-EMC#399
@KateFriedman-NOAA KateFriedman-NOAA merged commit 5c060c9 into NOAA-EMC:feature/ops-wcoss2 Jan 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
port Porting the system to a new platform
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants