Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build ufs_weather_model on Gaea-C5 + update cubed sphere gitmodules for perturbation/increments for cold starts #2271 #2269

Merged
merged 26 commits into from
Jun 7, 2024

Conversation

DavidBurrows-NCO
Copy link
Contributor

@DavidBurrows-NCO DavidBurrows-NCO commented May 6, 2024

Commit Queue Requirements:

  • Fill out all sections of this template.
  • All sub component pull requests have been reviewed by their code managers.
  • Run the full Intel+GNU RT suite (compared to current baselines) on either Hera/Derecho/Hercules
  • Commit 'test_changes.list' from previous step

Description:

  • Updates modulefiles/ufs_gaea.intel.lua to compile ufs_weather_model on Gaea-C5

Commit Message:

  • UFSWM - Update modulefiles/ufs_gaea.intel.lua to compile ufs_weather_model on Gaea-C5
    • FV3 -
    • atmos_cubed_sphere - move the call to read_da_inc outside the if external_ic/restart logic

Priority:

  • Normal

Git Tracking

UFSWM:

Sub component Pull Requests:

UFSWM Blocking Dependencies:

  • None

Changes

Regression Test Changes (Please commit test_changes.list):

  • No Baseline Changes.

Input data Changes:

  • None.

Library Changes/Upgrades:

  • No Updates

Testing Log:

  • RDHPCS
    • Hera
    • Orion
    • Hercules
    • Jet
    • Gaea
    • Derecho
  • WCOSS2
    • Dogwood/Cactus
    • Acorn
  • CI
  • opnReqTest (complete task if unnecessary)

@DavidBurrows-NCO DavidBurrows-NCO marked this pull request as ready for review May 6, 2024 17:47
@DavidBurrows-NCO
Copy link
Contributor Author

Hello..This PR is in support of NOAA-EMC/global-workflow#2535. Please let me know how I should proceed with testing. Since this PR will not affect other platforms, do I need to run RT on Hera/Derecho/Hercules? Thanks for your time!

@jkbk2004
Copy link
Collaborator

jkbk2004 commented May 6, 2024

@DavidBurrows-NCO Develop branch builds and runs ok on gaea-c5. Can you explain why we need the code change of this PR? Any issue porting global workflow to gaea-c5?

@BrianCurtis-NOAA
Copy link
Collaborator

@DavidBurrows-NCO Develop branch builds and runs ok on gaea-c5. Can you explain why we need the code change of this PR? Any issue porting global workflow to gaea-c5?

@jkbk2004 Have you looked into @DavidBurrows-NCO changes to see if it works for the UFSWM on Gaea-C5?

@jkbk2004
Copy link
Collaborator

jkbk2004 commented May 7, 2024

@DavidBurrows-NCO Develop branch builds and runs ok on gaea-c5. Can you explain why we need the code change of this PR? Any issue porting global workflow to gaea-c5?

@jkbk2004 Have you looked into @DavidBurrows-NCO changes to see if it works for the UFSWM on Gaea-C5?

@jkbk2004 jkbk2004 closed this May 7, 2024
@jkbk2004 jkbk2004 reopened this May 7, 2024
@jkbk2004
Copy link
Collaborator

jkbk2004 commented May 7, 2024

Sorry for accidentally closing/re-opening. Anyway, @BrianCurtis-NOAA @DavidBurrows-NCO the gaea modulefile of this pr is a lot clean. Let me throw a run on gaea. @RatkoVasic-NOAA can you take a look at this modulefile update for gaea?

@BrianCurtis-NOAA
Copy link
Collaborator

Sorry for accidentally closing/re-opening. Anyway, @BrianCurtis-NOAA @DavidBurrows-NCO the gaea modulefile of this pr is a lot clean. Let me throw a run on gaea. @RatkoVasic-NOAA can you take a look at this modulefile update for gaea?

One concerning things to me was the removal of the stack-python as it might cause issues with the abort_dep_tasks.py script.

@DavidBurrows-NCO
Copy link
Contributor Author

Good Morning...the previous module file didn't contain a load cmake, so you get the following error upon building:

David.Burrows@gaea56 07:58 ufsm $ ./build.sh 
UFS MODEL DIR: /gpfs/f5/epic/scratch/David.Burrows/ufs_fails_to_build_on_gaea/ufsm
cmake: /opt/cray/pe/gcc/10.3.0/snos/lib64/libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by cmake)

I added the module load cmake then did some clean up as well. The modulefile for gaea now resembles gfs_utils, ufs_utils, and upp.
@BrianCurtis-NOAA I was thinking the same about python. I see a mix of python, stack-python, and no python in the other module files. I can add it back if necessary.
Thanks everyone!

@RatkoVasic-NOAA
Copy link
Collaborator

This changes worked for me. @BrianCurtis-NOAA can you give me example so I can try to get failure with abort_dep_tasks.py

@jkbk2004
Copy link
Collaborator

jkbk2004 commented May 7, 2024

This changes worked for me. @BrianCurtis-NOAA can you give me example so I can try to get failure with abort_dep_tasks.py

Mostly with importing python ecflow package in https://github.com/ufs-community/ufs-weather-model/blob/develop/tests/abort_dep_tasks.py#L3. But it sounds like it runs ok. @RatkoVasic-NOAA a decision point on your side is to specifically control thru stack-python. I think its safe to keep moving with the stack-python. @DavidBurrows-NCO can you put the stack-python line back. So we can move on.

@DavidBurrows-NCO
Copy link
Contributor Author

@BrianCurtis-NOAA @RatkoVasic-NOAA I added stack-python load back into the module file and tested a build.

@BrianCurtis-NOAA
Copy link
Collaborator

This changes worked for me. @BrianCurtis-NOAA can you give me example so I can try to get failure with abort_dep_tasks.py

This concern is just based on rt.sh failing with running ECFlow on Gaea. @RatkoVasic-NOAA are you using ecflow to run the rt.sh suite?

@RatkoVasic-NOAA
Copy link
Collaborator

@BrianCurtis-NOAA I didn't run full test suite, just about 10 test cases to see if it brakes compilation or some executions. I used rocoto. I can now try with ecflow same thing.

@RatkoVasic-NOAA
Copy link
Collaborator

Selected 10 tests worked using both ecflow and rocoto for me.

@BrianCurtis-NOAA
Copy link
Collaborator

Selected 10 tests worked using both ecflow and rocoto for me.

OK Great. @jkbk2004 please run the full rt suite on Gaea-C5 using ecflow and we can then merge this PR with another non-baseline changing PR once that's completed.

@jkbk2004
Copy link
Collaborator

jkbk2004 commented May 7, 2024

Selected 10 tests worked using both ecflow and rocoto for me.

OK Great. @jkbk2004 please run the full rt suite on Gaea-C5 using ecflow and we can then merge this PR with another non-baseline changing PR once that's completed.

Sure!

@jkbk2004
Copy link
Collaborator

jkbk2004 commented May 8, 2024

@DavidBurrows-NCO This rap clm_lake case is crashing with this pr. experiment is at /gpfs/f5/epic/scratch/Jong.Kim/RT_RUNDIRS
/Jong.Kim/FV3_RT/rt_48937/rap_clm_lake_debug_intel/err. Develop runs ok. @SamuelTrahanNOAA just FYI

@DavidBurrows-NCO
Copy link
Contributor Author

@jkbk2004 I am also getting a failure with that test. Here are some highlights from the output:


 0: PASS: fcstRUN phase 1, n_atmsteps =               11 time is        14.979133
  0:  ---isec,seconds        3600        3600
  0:   gfs diags time since last bucket empty:    1.00000000000000      hrs
  0:  write out restart at n_atmsteps=          11  seconds=        3600
  0:  integration length=  0.9166667
  0: PASS: fcstRUN phase 2, n_atmsteps =               11 time is         0.788611
  0:   aft fcst run output time=        3600 FBcount=           8 na=          12
...
  0:  liq_aero  3.4789200E+10  1.1008354E+07  1.7738830E+09
  0:  ice_aero  1.4927048E+08  1.2824743E-13   3184276.
  0:  sgs_tke   149.9711      9.9780387E-05   1.240300
  0: NOTE from PE     0: Potential error in diag_manager_end: slp NOT available, check if output interval > runlength. Netcdf fill_values are written
  0: NOTE from PE     0: Potential error in diag_manager_end: vort850 NOT available, check if output interval > runlength. Netcdf fill_values are written
  0: NOTE from PE     0: Potential error in diag_manager_end: vort200 NOT available, check if output interval > runlength. Netcdf fill_values are written
  0: NOTE from PE     0: Potential error in diag_manager_end: us NOT available, check if output interval > runlength. Netcdf fill_values are written
  0: NOTE from PE     0: Potential error in diag_manager_end: u1000 NOT available, check if output interval > runlength. Netcdf fill_values are written
...
144: Stack trace terminated abnormally.
145: forrtl: error (73): floating divide by zero

Are you getting the same error? I don't see any other hints in the working directory for a reason for this failure.

@SamuelTrahanNOAA
Copy link
Collaborator

I cannot see your directories due to restrictive permissions.

What you've shown me so far are errors that have nothing to do with CLM lake. I cannot debug farther until I see the stack trace.

@DavidBurrows-NCO
Copy link
Contributor Author

@SamuelTrahanNOAA Can I move the working dir somewhere on Gaea for you? My Gaea groups are epic and ufs-ard.
The only stack trace I see in the working dir is:

144: Stack trace terminated abnormally.
145: forrtl: error (73): floating divide by zero

Thanks, Sam!

@SamuelTrahanNOAA
Copy link
Collaborator

I move the working dir somewhere on Gaea for you?

I don't know where your working directory resides.

You don't necessarily need to move it; I just need the permissions to allow world access.

This is not a stack trace:

145: forrtl: error (73): floating divide by zero

The stack trace would include a trace of the files and line numbers to the point in the stack with the divide by zero.

@DavidBurrows-NCO
Copy link
Contributor Author

@SamuelTrahanNOAA The working dir is here: /gpfs/f5/ufs-ard/scratch/David.Burrows/RT_RUNDIRS/David.Burrows/FV3_RT/rt_200821.
perms for files there are: -rw-r--r--, so you should be good.

@zach1221 zach1221 added hercules-RT Run Hera regression testing orion-RT derecho-RT Run regression tests on Derecho and removed hera-RT Run Hera regression testing hercules-RT Run Hera regression testing orion-RT derecho-RT Run regression tests on Derecho labels Jun 6, 2024
@BrianCurtis-NOAA
Copy link
Collaborator

@zach1221 @FernandoAndrade-NOAA @BrianCurtis-NOAA Finally, it's all updated ok. I think it's safe to re-do the tests on RDHPCS side.

HYCOM-interface has a change but no .gitmodule entry. Please fix.

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Jun 6, 2024

@zach1221 @FernandoAndrade-NOAA @BrianCurtis-NOAA Finally, it's all updated ok. I think it's safe to re-do the tests on RDHPCS side.

HYCOM-interface has a change but no .gitmodule entry. Please fix.

HYCOM hash is already committed one on emc/develop. Good to go with hash update.

@zach1221
Copy link
Collaborator

zach1221 commented Jun 7, 2024

@jkbk2004 @FernandoAndrade-NOAA I talked to Brian offline. We're good to skip acorn.

@zach1221 zach1221 merged commit a183a52 into ufs-community:develop Jun 7, 2024
3 checks passed
@DavidBurrows-NCO DavidBurrows-NCO deleted the gaea_build branch June 10, 2024 12:16
@DavidBurrows-NCO
Copy link
Contributor Author

Thanks for everyone's help with this PR @jkbk2004 @zach1221 @FernandoAndrade-NOAA @BrianCurtis-NOAA @SamuelTrahanNOAA @RatkoVasic-NOAA @DusanJovic-NOAA @junwang-noaa @weihuang-jedi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
No Baseline Change No Baseline Change Ready for Commit Queue The PR is ready for the Commit Queue. All checkboxes in PR template have been checked.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Compile ufs_weather_model on Gaea-C5