Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update module files to build gsi on Gaea-C5 #746

Merged
merged 7 commits into from
May 20, 2024

Conversation

DavidBurrows-NCO
Copy link
Contributor

@DavidBurrows-NCO DavidBurrows-NCO commented May 9, 2024

Description
Update modulefiles/gsi_gaea.intel.lua and ush/module-setup.sh to build GSI on Gaea-C5. The new module file is minimal and follows gsi_hera.intel.lua.

Refs #696
Refs NOAA-EMC/global-workflow/issues/2535

Type of change

  • New feature (non-breaking change which adds functionality)

How Has This Been Tested?
Cloned and compiled on Gaea and Hera

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

@RussTreadon-NOAA RussTreadon-NOAA linked an issue May 9, 2024 that may be closed by this pull request
@RussTreadon-NOAA
Copy link
Contributor

@DavidHuber-NOAA , if you have time would you please review this PR?

@DavidBurrows-NCO - any Gaea users we should ask to review this PR?

@DavidHuber-NOAA
Copy link
Collaborator

I would suggest @jswhit.

@DavidBurrows-NCO
Copy link
Contributor Author

@RussTreadon-NOAA just to clone and build?
Thanks @DavidHuber-NOAA

@RussTreadon-NOAA
Copy link
Contributor

It would be good to get verification from users that they can clone, build, and execute gsi.x and enkf.x on Gaea.

@DavidHuber-NOAA , do you know if we have the GSI and EnKF ctests CASES directory on Gaea? If this directory is on Gaea, ctests could be run to ensure gsi.x and enkf.x are functional.

@DavidBurrows-NCO
Copy link
Contributor Author

@RussTreadon-NOAA It looks like ctests are run from the regression directory. Is that correct? I want to dig into the ctest workflow

@RussTreadon-NOAA
Copy link
Contributor

GSI Wiki GSI Ctests (regression tests) provides a brief overview of GSI ctests. These are not ctests in the sense of unit tests. The CMake ctest capability is used to sequentially submit scripts in regression for various gsi and enkf configurations.

I do not recall running the GSI ctests on Gaea. I logged into Gaea this morning. It appears my last login dates back to March 2023. I do not use Gaea so my user support ability is limited.

@DavidBurrows-NCO
Copy link
Contributor Author

Thanks! The mechanics are there to run on Gaea in regression/..just need to update for Gaea-C5. I'll collect some information...

@DavidHuber-NOAA
Copy link
Collaborator

@DavidBurrows-NCO @RussTreadon-NOAA There is some old regression test data in /gpfs/f5/epic/world-shared/GSI_data/CASES/regtest that should be updated. Once that's complete, you will likely need to update the regression_var.sh and sub_gaea scripts. This commit worked for me to run the regression tests, though some failed/crashed: DavidHuber-NOAA@0282b2d.

@DavidHuber-NOAA
Copy link
Collaborator

I should note that that commit is ~3 months old, so there may be some additional changes needed.

@DavidBurrows-NCO
Copy link
Contributor Author

Excellent. Thanks @DavidHuber-NOAA

@DavidHuber-NOAA
Copy link
Collaborator

I see now that I also had to make these changes: 04a737d. I believe that's it.

@@ -190,7 +186,7 @@ export savdir="$ptmp"
export JCAP="62"

# Case Study analysis dates
export global_adate="2024022300"
export global_adate="2022110900"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should remain at 2024022300, though that may require updated RT data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. /gpfs/f5/epic/world-shared/GSI_data/CASES/regtest/gfs/prod/ should be populated with enkfgdas.202402* and gdas.202402* directories and files from Hera /scratch1/NCEPDEV/da/Russ.Treadon/CASES/regtest/gfs/prod

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DavidHuber-NOAA @RussTreadon-NOAA Yep, will do. I was just trying to get a run going before copying files, but that's exactly what I'll do. Revert that global_adate change and copy those dirs from Hera to Gaea

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DavidHuber-NOAA @RussTreadon-NOAA Looks like I don't have write permissions at

ls -l /gpfs/f5/epic/world-shared/
drwxr-sr-x 4 role.epic epic 4.0K Jan 22 10:00 GSI_data

unless I request a role.epic account. Should I do that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DavidHuber-NOAA @RussTreadon-NOAA I reverted this change, began populating /gpfs/f5/ufs-ard/world-shared/GSI_data/ with regression test data from Hera, and updated GSI_BINARY_SOURCE_DIR. Let me know if you still have issues cloning on Gaea.

unload("cray-mpich")
unload("cray-python")
unload("darshan")
prepend_path("MODULEPATH", "/ncrc/proj/epic/spack-stack/spack-stack-1.6.0/envs/unified-env/install/modulefiles/Core")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this really needed? (I think gsi-addon-dev provides everything)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jswhit I tested with just gsi-addon-dev load and don't see any issues. I committed an updated module file.

@RussTreadon-NOAA RussTreadon-NOAA removed the request for review from jswhit2 May 10, 2024 18:30
@RussTreadon-NOAA
Copy link
Contributor

Clone DavidBurrows-NCO:gaea_build on Gaea in /gpfs/f5/nggps_emc/scratch/Russ.Treadon/git/gsi/pr746.

The build aborts while trying to copy GSI binary fix files from GSI_BINARY_SOURCE_DIR=/gpfs/f5/epic/proj-shared/global/glopara/data/fix/gsi/20240208. My account, Russ.Treadon, does not belong to the epic group.
Permission restrictions on /gpfs/f5/epic/proj-shared prevent me from accessing the contents of GSI_BINARY_SOURCE

@jswhit
Copy link
Contributor

jswhit commented May 10, 2024

Clone DavidBurrows-NCO:gaea_build on Gaea in /gpfs/f5/nggps_emc/scratch/Russ.Treadon/git/gsi/pr746.

The build aborts while trying to copy GSI binary fix files from GSI_BINARY_SOURCE_DIR=/gpfs/f5/epic/proj-shared/global/glopara/data/fix/gsi/20240208. My account, Russ.Treadon, does not belong to the epic group. Permission restrictions on /gpfs/f5/epic/proj-shared prevent me from accessing the contents of GSI_BINARY_SOURCE

Looks like that directory needs to be copied to /gpfs/f5/epic/world-shared/GSI_data/fix

@RussTreadon-NOAA RussTreadon-NOAA self-assigned this May 13, 2024
@DavidBurrows-NCO
Copy link
Contributor Author

Morning... @RussTreadon-NOAA I am seeing similar errors in all 7 jobs except my global_4denvar now failed with Failed to create directory .
@DavidHuber-NOAA I reran all 7 tests this AM with options -VV --debug --output-on-failure to try and elicit more failure feedback. I also changed the permissions in my scratch directory, so you and everyone should have access to /gpfs/f5/ufs-ard/scratch/David.Burrows/David.Burrows/gsi_tmp/ptmp_20240514save where I moved ptmp to.
I have some meetings this AM but will look more closely this afternoon at these failures. They seem system related but we'll see...Thanks!

@RussTreadon-NOAA
Copy link
Contributor

Thank you @DavidBurrows-NCO for confirming that you also are encountering problems. I am most familiar with the global gsi.x configuration. I set lrun_subdirs=.false., in the global_4denvar section of regression/regression_namelists.sh. This got the run further. The code hangs while reading the input netcdf atmospheric background fields.

@DavidHuber-NOAA
Copy link
Collaborator

@DavidBurrows-NCO Looking into the global_4denvar failure (to create subdirectories), it seems the directory it is trying to make already exists (that's the meaning of ierror=17). And based on the duplicate messages of

INIT_DIRECTORIES:  ***ERROR** Failed to create directory dir.0000 for PE 
 0000     ierror=           17
 ****STOP2****  ABORTING EXECUTION w/code=         678
 INIT_DIRECTORIES:  ***ERROR** Failed to create directory dir.0000 for PE 
 0000     ierror=           17
 ****STOP2****  ABORTING EXECUTION w/code=         678

it seems to me that multiple PEs have a mype of 0. I'm not sure how that happens, but it does seem like an issue with mpich or the way the MPI module is being invoked. Can I take a look at the build log?

@RussTreadon-NOAA
Copy link
Contributor

Remove --mpi=pmi2 from APRUN in gaea section of regression/regression_param.sh. global_4denvar ran through both outer loops. Task 0 reaches the wtage call and prints the message

     ENDING DATE-TIME    MAY 14,2024  09:44:14.705  135  TUE   2460445
     PROGRAM GSI_ANL HAS ENDED.
* . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * .

However, gsi.x continues running up to the specified 10 minute wall clock limit. It's hanging somewhere ... maybe on mpi_finalize in gsimod.F90?

@RussTreadon-NOAA
Copy link
Contributor

Prints added to gsimod.F90 before and after mpi_finalize confirm that all 96 tasks in loproc global_4denvar gsi.x reach mpi_finalize. No tasks return from mpi_finalize.

@DavidBurrows-NCO
Copy link
Contributor Author

@RussTreadon-NOAA I retried the runs after removing --mpi=pmi2. I placed the output here: /gpfs/f5/ufs-ard/scratch/David.Burrows/David.Burrows/gsi_tmp/ptmp_no_pmi2. All the jobs except global_enkf_loproc reach PROGRAM GSI_ANL HAS ENDED then die a few minutes later with rc=137 (memory issue???)
@DavidHuber-NOAA I put the build output here: /gpfs/f5/ufs-ard/world-shared/for_dave.huber/build.out

@RussTreadon-NOAA
Copy link
Contributor

stdout in /gpfs/f5/ufs-ard/scratch/David.Burrows/David.Burrows/gsi_tmp/ptmp_no_pmi2/tmpreg_global_4denvar/global_4denvar_loproc_updat/ indicates that gsi.x failed in a manner similar to what I observe in my execution of global_4denvar. The run appears hang in mpi_finalize.

global_enkf dies with forrtl: error (76): Abort trap signal

@DavidBurrows-NCO
Copy link
Contributor Author

Morning @RussTreadon-NOAA @DavidHuber-NOAA I made some progress on the ctests. In the original module file, there were unload("darshan") and unload("cray-python") statements. I initially left those 2 statements in as unload("darshan-runtime") and unload("cray-python"). cray-python isn't loaded after module reset, so I removed that. I also removed the darshan-runtime unload and mpi_finalize appears to be completing now. I committed those changes.

@DavidBurrows-NCO
Copy link
Contributor Author

@RussTreadon-NOAA I made one more update to gsi_gaea module file unload("cray-libsci") which allows global_enkf to run now.

@RussTreadon-NOAA
Copy link
Contributor

@DavidBurrows-NCO , thank you for recent updates to DavidBurrows-NCO:gaea_build. I installed gaea_build at 5981b57 on Gaea. Ran ctests with following results

    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_glbens
    Start 4: netcdf_fv3_regional
    Start 5: hafs_4denvar_glbens
    Start 6: hafs_3denvar_hybens
    Start 7: global_enkf
1/7 Test #7: global_enkf ......................   Passed  1331.55 sec
2/7 Test #4: netcdf_fv3_regional ..............   Passed  1390.11 sec
3/7 Test #3: rrfs_3denvar_glbens ..............   Passed  1391.40 sec
4/7 Test #1: global_4denvar ...................   Passed  2052.15 sec
5/7 Test #5: hafs_4denvar_glbens ..............***Failed  2054.17 sec
6/7 Test #2: rtma .............................   Passed  3186.70 sec
7/7 Test #6: hafs_3denvar_hybens ..............   Passed  3268.79 sec

86% tests passed, 1 tests failed out of 7

Total Test time (real) = 3268.84 sec

The hafs_4denvar_glbens failure is due to

The runtime for hafs_4denvar_glbens_loproc_updat is 314.698594 seconds.  This has exceeded maximum allowable threshold time of 310.402717 seconds, resulting in Failure time-thresh of the regression test.

A check of the gsi.x wall times shows

hafs_4denvar_glbens_hiproc_contrl/stdout:The total amount of wall time                        = 361.666237
hafs_4denvar_glbens_hiproc_updat/stdout:The total amount of wall time                        = 377.771941
hafs_4denvar_glbens_loproc_contrl/stdout:The total amount of wall time                        = 282.184289
hafs_4denvar_glbens_loproc_updat/stdout:The total amount of wall time                        = 314.698594

This being the first time I have run GSI ctests on Gaea I do not know if this is a fatal fail. Wall time variability is observed on other machines. It is usually viewed as a non-fatal fail.

As mentioned in other GSI issues and/or PRs, the checks in regression_test.sh and regression_test_enkf.sh should be revisited. Some checks should be removed. Other checks should be revised. Do so is beyond the scope of this PR.

@RussTreadon-NOAA
Copy link
Contributor

Reran hafs_4denvar_glbens and this time the test passed.

Test project /gpfs/f5/nggps_emc/scratch/Russ.Treadon/git/gsi/pr746/build
    Start 5: hafs_4denvar_glbens
1/1 Test #5: hafs_4denvar_glbens ..............   Passed  1529.36 sec

100% tests passed, 0 tests failed out of 1

Total Test time (real) = 1534.10 sec

This time the gsi.x wall times were

hafs_4denvar_glbens_hiproc_contrl/stdout:The total amount of wall time                        = 394.879253
hafs_4denvar_glbens_hiproc_updat/stdout:The total amount of wall time                        = 360.180398
hafs_4denvar_glbens_loproc_contrl/stdout:The total amount of wall time                        = 304.074771
hafs_4denvar_glbens_loproc_updat/stdout:The total amount of wall time                        = 250.881181

@RussTreadon-NOAA
Copy link
Contributor

Thank you @DavidHuber-NOAA for filling in the blanks in my memory.

While --export=ALL works on Gaea, my preference is to explicitly specify task count, thread count, and other options on the srun command line. This is how g-w env/*.env files define APRUN commands. Of course, the GSI is not g-w. Since what we have for Gaea works, we can leave it as is.

@DavidBurrows-NCO
Copy link
Contributor Author

@RussTreadon-NOAA I meant to ask this earlier, but are the numbers in regression/regression_param.sh okay for loproc and hiproc runs? I used the numbers from @DavidHuber-NOAA, which look to be WCOSS2 values.

@RussTreadon-NOAA
Copy link
Contributor

@DavidBurrows-NCO , I don't use Gaea so I am not familiar with the system configuration. Which of the machines on which we currently build GSI and run ctests is most similar to Gaea in terms of cores and memory per node? If Gaea configuration is similar to machine-a, I'd look at the job configuration values for machine-a as the starting point for Gaea job configuration.

@DavidBurrows-NCO
Copy link
Contributor Author

@RussTreadon-NOAA Following wcoss2 makes sense then. Thanks.

@RussTreadon-NOAA
Copy link
Contributor

Install DavidBurrows-NCO:gaea_build at 5981b57 and develop at ebeaba1 on WCOSS2 (Cactus), Hera, and Orion. For Gaea gaea_build was installed in the develop slot. The authoritative develop at ebeaba1 does not build on Gaea ... hence this PR.

Ran ctests on each machine with the following results

Cactus

Test project /lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/pr746/build
    Start 4: netcdf_fv3_regional
    Start 1: global_4denvar
    Start 5: hafs_4denvar_glbens
    Start 6: hafs_3denvar_hybens
    Start 2: rtma
    Start 7: global_enkf
    Start 3: rrfs_3denvar_glbens
1/7 Test #4: netcdf_fv3_regional ..............   Passed  546.76 sec
2/7 Test #3: rrfs_3denvar_glbens ..............   Passed  670.13 sec
3/7 Test #7: global_enkf ......................   Passed  911.48 sec
4/7 Test #2: rtma .............................   Passed  969.55 sec
5/7 Test #6: hafs_3denvar_hybens ..............   Passed  1213.68 sec
6/7 Test #5: hafs_4denvar_glbens ..............   Passed  1273.01 sec
7/7 Test #1: global_4denvar ...................   Passed  1742.75 sec

100% tests passed, 0 tests failed out of 7

Total Test time (real) = 1742.90 sec

Gaea

Test project /gpfs/f5/nggps_emc/scratch/Russ.Treadon/git/gsi/pr746/build
    Start 1: global_4denvar
    Start 5: hafs_4denvar_glbens
    Start 2: rtma
    Start 3: rrfs_3denvar_glbens
    Start 4: netcdf_fv3_regional
    Start 6: hafs_3denvar_hybens
    Start 7: global_enkf
1/7 Test #4: netcdf_fv3_regional ..............   Passed  549.20 sec
2/7 Test #7: global_enkf ......................   Passed  671.97 sec
3/7 Test #3: rrfs_3denvar_glbens ..............   Passed  796.45 sec
4/7 Test #2: rtma .............................   Passed  1279.37 sec
5/7 Test #6: hafs_3denvar_hybens ..............   Passed  1470.84 sec
6/7 Test #5: hafs_4denvar_glbens ..............   Passed  1712.73 sec
7/7 Test #1: global_4denvar ...................   Passed  2196.72 sec

100% tests passed, 0 tests failed out of 7

Total Test time (real) = 2196.83 sec

Hera

Test project /scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/pr746/build
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_glbens
    Start 4: netcdf_fv3_regional
    Start 5: hafs_4denvar_glbens
    Start 6: hafs_3denvar_hybens
    Start 7: global_enkf
1/7 Test #4: netcdf_fv3_regional ..............   Passed  727.50 sec
2/7 Test #3: rrfs_3denvar_glbens ..............   Passed  793.89 sec
3/7 Test #7: global_enkf ......................   Passed  1487.05 sec
4/7 Test #2: rtma .............................   Passed  1511.04 sec
5/7 Test #6: hafs_3denvar_hybens ..............   Passed  1531.51 sec
6/7 Test #5: hafs_4denvar_glbens ..............   Passed  1885.75 sec
7/7 Test #1: global_4denvar ...................   Passed  2287.29 sec

100% tests passed, 0 tests failed out of 7

Total Test time (real) = 2287.33 sec

Orion

Test project /work2/noaa/da/rtreadon/git/gsi/pr746/build
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_glbens
    Start 4: netcdf_fv3_regional
    Start 5: hafs_4denvar_glbens
    Start 6: hafs_3denvar_hybens
    Start 7: global_enkf
1/7 Test #4: netcdf_fv3_regional ..............   Passed  602.82 sec
2/7 Test #3: rrfs_3denvar_glbens ..............   Passed  606.41 sec
3/7 Test #7: global_enkf ......................   Passed  788.65 sec
4/7 Test #2: rtma .............................   Passed  1088.10 sec
5/7 Test #6: hafs_3denvar_hybens ..............   Passed  1161.37 sec
6/7 Test #5: hafs_4denvar_glbens ..............   Passed  1461.75 sec
7/7 Test #1: global_4denvar ...................   Passed  1743.04 sec

100% tests passed, 0 tests failed out of 7

Total Test time (real) = 1743.05 sec

The Passed results are expected. This PR does not alter the build or execution on WCOSS2, Hera, or Orion.

On Gaea the Passed is also expected since both the contrl and updat are DavidBurrows-NCO:gaea_build. The key for the Gaea test is that gsi.x and enkf.x can be built and run on Gaea.

@RussTreadon-NOAA RussTreadon-NOAA self-requested a review May 17, 2024 17:29
Copy link
Contributor

@RussTreadon-NOAA RussTreadon-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ctests have been run on various machines with acceptable results.

Approve.

@RussTreadon-NOAA
Copy link
Contributor

@DavidHuber-NOAA , do you have any more requests or comments for this PR? ctests have been run on various platforms with acceptable results.

Copy link
Collaborator

@DavidHuber-NOAA DavidHuber-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks @DavidBurrows-NCO!

@DavidHuber-NOAA
Copy link
Collaborator

@RussTreadon-NOAA Thanks for the nudge. Everything looks in order.

@RussTreadon-NOAA
Copy link
Contributor

Thank you @DavidHuber-NOAA for the approval. I'll work with the GSI Handling Review team to schedule merger of this PR into GSI develop.

@RussTreadon-NOAA RussTreadon-NOAA merged commit 59d7578 into NOAA-EMC:develop May 20, 2024
2 of 4 checks passed
@DavidBurrows-NCO
Copy link
Contributor Author

@DavidHuber-NOAA @RussTreadon-NOAA Thank you both for your help through the process!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Port to Gaea-C5
4 participants