Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor fv3atm history & restart to reduce redundant code. Add rrfs-sd and clm lake to quilt restart. #1769

Merged

Conversation

SamuelTrahanNOAA
Copy link
Collaborator

@SamuelTrahanNOAA SamuelTrahanNOAA commented May 25, 2023

Description

This is a major refactoring of the sfc restart code. It eliminates code duplication between read vs. write and quilt vs. non-quilt. Most notably, it merges the sfc read & write grid copy loops for quilt and non-quilt into a single function. With these changes, the code will be easier to maintain, and read vs. write bugs will be easier to find. See NOAA-EMC/fv3atm#660 for full details on that.

You'll see several new quilt restart regression tests, including 32-bit physics, clm lake, and rrfs-sd. Those new tests are also run for the gnu compiler, so we'll know the quilt restart will continue to work for the non-Intel world. I've also replaced one RRFS restart test. It was run without smoke, but should have been run with smoke.

There's a fix for another bug, #1781, wherein nccmp generates many gigabytes of output when fields differ. The fix is to add -q -S. This disables printing of data and metadata differences, but enables a table of statistics comparing the data.

Also, a fix for but #1780: when rt.sh aborts due to another rt.sh running, it deletes the lock directory. That directory should only be deleted by the rt.sh that created it. This happens due to an error in the script's error handler.

Top of commit queue on: TBD

Input data additions/changes

  • No changes are expected to input data.

Anticipated changes to regression tests:

  • Changes are expected to the following tests.

I had to generate baselines for these tests on Hera:

Expand for full list.
COMPILE | 13 | intel | -DAPP=ATM -DCCPP_SUITES=FV3_RAP,FV3_RAP_sfcdiff,FV3_HRRR,FV3_HRRR_flake,FV3_RRFS_v1beta,FV3_RRFS_v1nssl -D32BIT=ON | | fv3 |
RUN | hrrr_control                                      |                            | baseline |
RUN | hrrr_control_qr                                   |                            | baseline |
RUN | hrrr_control_decomp                               |                            |          |
RUN | hrrr_control_2threads                             |                            |          |
RUN | hrrr_control_restart                              |                            |          | hrrr_control
RUN | hrrr_control_restart_qr                           |                            |          | hrrr_control_qr

RUN | rrfs_smoke_conus13km_hrrr_warm                    |                            | baseline |
RUN | rrfs_smoke_conus13km_hrrr_warm_qr                 |                            |          |
RUN | rrfs_smoke_conus13km_hrrr_warm_restart_mismatch   |                            | baseline | rrfs_smoke_conus13km_hrrr_warm
RUN | rrfs_smoke_conus13km_hrrr_warm_restart_qr_mismatch |                           |          | rrfs_smoke_conus13km_hrrr_warm_qr

RUN | rrfs_conus13km_hrrr_warm                          |                            | baseline |

COMPILE | 19 | intel | -DAPP=ATM -DCCPP_SUITES=FV3_RAP,FV3_HRRR,FV3_HRRR_flake -D32BIT=ON -DCCPP_32BIT=ON | | fv3 |
RUN | hrrr_control_dyn32_phy32                          |                            | baseline |
RUN | hrrr_control_qr_dyn32_phy32                       |                            | baseline |
RUN | hrrr_control_2threads_dyn32_phy32                 |                            |          |
RUN | hrrr_control_decomp_dyn32_phy32                   |                            |          |
RUN | hrrr_control_restart_dyn32_phy32                  |                            |          | hrrr_control_dyn32_phy32
RUN | hrrr_control_restart_qr_dyn32_phy32               |                            |          | hrrr_control_qr_dyn32_phy32

COMPILE | 39 | gnu | -DAPP=ATM -DCCPP_SUITES=FV3_RAP,FV3_RAP_sfcdiff,FV3_HRRR,FV3_RRFS_v1beta -D32BIT=ON | + hera cheyenne | fv3 |
RUN | hrrr_control                                      | + hera cheyenne            | baseline |
RUN | hrrr_control_2threads                             | + hera cheyenne            |          |
RUN | hrrr_control_decomp                               | + hera cheyenne            |          |
RUN | hrrr_control_restart                              | + hera cheyenne            |          | hrrr_control
RUN | rrfs_smoke_conus13km_hrrr_warm                    | + hera cheyenne            | baseline |
RUN | rrfs_smoke_conus13km_hrrr_warm_qr                 | + hera cheyenne            |          |
RUN | rrfs_smoke_conus13km_hrrr_warm_restart_mismatch   | + hera cheyenne            | baseline | rrfs_smoke_conus13km_hrrr_warm
RUN | rrfs_smoke_conus13km_hrrr_warm_restart_qr_mismatch | + hera cheyenne           |          | rrfs_smoke_conus13km_hrrr_warm_qr

RUN | rrfs_conus13km_hrrr_warm                          |                            | baseline |

COMPILE | 42 | gnu | -DAPP=ATM -DCCPP_SUITES=FV3_RAP,FV3_HRRR,FV3_HRRR_flake -D32BIT=ON -DCCPP_32BIT=ON | + hera cheyenne | fv3 |
RUN | hrrr_control_dyn32_phy32                          | + hera cheyenne            | baseline |
RUN | hrrr_control_qr_dyn32_phy32                       | + hera cheyenne            | baseline |
RUN | hrrr_control_2threads_dyn32_phy32                 | + hera cheyenne            |          |
RUN | hrrr_control_decomp_dyn32_phy32                   | + hera cheyenne            |          |
RUN | hrrr_control_restart_dyn32_phy32                  | + hera cheyenne            |          | hrrr_control_dyn32_phy32
RUN | hrrr_control_restart_qr_dyn32_phy32               | + hera cheyenne            |          | hrrr_control_qr_dyn32_phy32

Subcomponents involved:

  • AQM
  • CDEPS
  • CICE
  • CMEPS
  • CMakeModules
  • FV3
  • GOCART
  • HYCOM
  • MOM6
  • NOAHMP
  • WW3
  • stochastic_physics
  • none

Combined with PR's (If Applicable):

Commit Queue Checklist:

  • Link PR's from all sub-components involved
  • Confirm reviews completed in sub-component PR's
  • Add all appropriate labels to this PR.
  • Run full RT suite on either Hera/Cheyenne with both Intel/GNU compilers
  • Add list of any failed regression tests to "Anticipated changes to regression tests" section.

Linked PR's and Issues:

Component PRs:

Related issues:

Testing Day Checklist:

  • This PR is up-to-date with the top of all sub-component repositories except for those sub-components which are the subject of this PR.
  • Move new/updated input data on RDHPCS Hera and propagate input data changes to all supported systems.

Testing Log (for CM's):

  • RDHPCS
    • Intel
      • Hera
      • Orion
      • Jet
      • Gaea
      • Cheyenne
    • GNU
      • Hera
      • Cheyenne
  • WCOSS2
    • Dogwood/Cactus
    • Acorn
  • CI
    • Completed
  • opnReqTest
    • N/A
    • Log attached to comment

@SamuelTrahanNOAA SamuelTrahanNOAA changed the title Refactor FV3GFS restart to reduce redundant code. Add rrfs-sd and clm lake to quilt restart. Refactor fv3atm history & restart to reduce redundant code. Add rrfs-sd and clm lake to quilt restart. May 29, 2023
@zach1221
Copy link
Collaborator

@BrianCurtis-NOAA I think I had a similar issue on Jet, and I had to recreate the baselines again for those failing cases, then re-copy them over.

@BrianCurtis-NOAA
Copy link
Collaborator

Seemed to be OK this time. I know Acorn is a test machine, but still concerned about seeing this issue again. Can address later if we do see it again.

@zach1221
Copy link
Collaborator

Looks like all the testing is complete, so we can now begin the merge process.

@FernandoAndrade-NOAA
Copy link
Collaborator

FernandoAndrade-NOAA commented Jun 20, 2023

Ok, the FV3 subcomponent PR was merged in, please go ahead and revert gitmodules and update hash.
@SamuelTrahanNOAA

@SamuelTrahanNOAA
Copy link
Collaborator Author

I have reverted the gitmodules and updated the hash. You may proceed with the merge.

jkbk2004
jkbk2004 previously approved these changes Jun 20, 2023
@SamuelTrahanNOAA
Copy link
Collaborator Author

I just noticed the conus13km cases are not actually doing a quilt restart (even the ones with qr) because of a missing line in a model_configure file. I am testing now if that functionality works. The regional domain quilt restart was not the intent of this PR, so I think this PR can be merged anyway. However, if you want to go ahead with the merge, I suggest you disable those tests to save CPU time.

That's these two:

  • rrfs_smoke_conus13km_hrrr_warm_restart_qr_mismatch
  • rrfs_smoke_conus13km_hrrr_warm_qr

I will fix them in my next PR (32-bit physics in RRFS) if they don't already work.

@SamuelTrahanNOAA
Copy link
Collaborator Author

You can also leave rt.conf how it is, and you'll just run two tests that are duplicates of older ones.

Please let me know what you want to do.

@FernandoAndrade-NOAA
Copy link
Collaborator

@SamuelTrahanNOAA I'll go ahead and disable those two in the meantime, thanks

@SamuelTrahanNOAA
Copy link
Collaborator Author

I have disabled those four tests in rt.conf.

@SamuelTrahanNOAA
Copy link
Collaborator Author

I can happily report those two tests do pass when I add the missing line to model_configure_rrfs_conus13km.IN. I have made this change to my other PR already.

diff --git a/tests/parm/model_configure_rrfs_conus13km.IN b/tests/parm/model_configure_rrfs_conus13km.IN
index 5be098ed..6feac2c0 100644
--- a/tests/parm/model_configure_rrfs_conus13km.IN
+++ b/tests/parm/model_configure_rrfs_conus13km.IN
@@ -12,6 +12,7 @@ restart_interval:        @[RESTART_INTERVAL]
 output_1st_tstep_rst:    .false.
 
 quilting:                @[QUILTING]
+quilting_restart:        @[QUILTING_RESTART]
 write_groups:            @[WRITE_GROUP]
 write_tasks_per_group:   @[WRTTASK_PER_GROUP]
 output_history:          @[OUTPUT_HISTORY]

@FernandoAndrade-NOAA
Copy link
Collaborator

@SamuelTrahanNOAA that's great to hear, thanks for verifying that those tests are passing and good to go, I'll go ahead and finish up on merging this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Baseline Updates Current baselines will be updated. jenkins-ci Jenkins CI: ORT build/test on docker container Ready for Commit Queue The PR is ready for the Commit Queue. All checkboxes in PR template have been checked.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants