Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add p7a test using tiled FV3 Fix files, P7a ICs and NoahMP; add open-water normalization in CMEPS (#549) #585

Merged
merged 75 commits into from
May 25, 2021

Conversation

DeniseWorthen
Copy link
Collaborator

@DeniseWorthen DeniseWorthen commented May 19, 2021

PR Checklist

  • Ths PR is up-to-date with the top of all sub-component repositories except for those sub-components which are the subject of this PR. Please consult the ufs-weather-model wiki if you are unsure how to do this.

  • This PR has been tested using a branch which is up-to-date with the top of all sub-component repositories except for those sub-components which are the subject of this PR

  • An Issue describing the work contained in this PR has been created either in the subcomponent(s) or in the ufs-weather-model. The Issue should be created in the repository that is most relevant to the changes in contained in the PR. The Issue and the dependent sub-component PR
    are specified below.

  • If new or updated input data is required by this PR, it is clearly stated in the text of the PR.

Instructions: All subsequent sections of text should be filled in as appropriate.

The information provided below allows the code managers to understand the changes relevant to this PR, whether those changes are in the ufs-weather-model repository or in a subcomponent repository. Ufs-weather-model code managers will use the information provided to add any applicable labels, assign reviewers and place it in the Commit Queue. Once the PR is in the Commit Queue, it is the PR owner's responsiblity to keep the PR up-to-date with the develop branch of ufs-weather-model.

Description

  • adds fv3_conf/cpld_bmark_tiled_run.IN to use tiled FV3 fix files
  • adds a directory BM7_IC in FV3_input_frac for the 8 initial condition dates. The input for 20140301 is used in the regression test cpld_bmarkfrac_wave_v16_noahmp. The remaining dates can be used with rt_35d.conf.
  • updates input.benchmark_v16.nml.IN to use configurable variables to implement either original v16 (P6) settings or v16 (P7a) settings
  • adds cpld_bmarkfrac_wave_v16_noahmp and cpld_bmarkfrac_wave_v16_noahmp_nsst tests. The nsst test is not added to rt.conf
  • updates 35d tests and scripts to run P7a. The 35d test for 2 dates were run for 1 day and the ICs verified to contain the correct data.
  • adds open water normalization in CMEPS to use open-water fraction normalization when mapping lwnet,sensible,latent and momentum fluxes from ATM->OCN

New input data is required: FV3_fix_tiled and FV3_input_frac/BM7_IC. These have been added to the exisiting input-data-20210518 on Hera and from there to the remaining platforms.

  • The FV3_fix_tiled contains the contents of the directory: /scratch1/NCEPDEV/global/Helin.Wei/save/fix_sfc. The subdirectories from this directory were re-named, e.g. sfc.C96 was renamed to just C96.

  • The FV3_input_frac/BM7_IC contains the contents of the directory: /scratch2/NCEPDEV/stmp3/Bing.Fu/o/p7ic/com/gens/dev/merge/C384_025. The subdirectory name C384 was renamed to C384_L127 to match the current RT directory structure.

Issue(s) addressed

Fixes issue #582
Fixes issue #366

Testing

Without the aofrac normalization, On hera.intel and gaea.intell, all existing coupled baselines passed against the develop-20210513 after the additional input data was added. New baselines will be required for the new cpld_bmarkfrac_wave_v16_noahmp test.

Updating this PR to include open water normalization in CMEPS will change all coupled baselines

How were these changes tested? What compilers / HPCs was it tested with? Are the changes covered by regression tests? (If not, why? Do new tests need to be added?) Have regression tests and unit tests (utests) been run? On which platforms and with which compilers? (Note that unit tests can only be run on tier-1 platforms)

  • hera.intel
  • hera.gnu
  • orion.intel
  • cheyenne.intel
  • cheyenne.gnu
  • gaea.intel
  • jet.intel
  • wcoss_cray
  • wcoss_dell_p3

Dependencies

If testing this branch requires non-default branches in other repositories, list them. Those branches should have matching names (ideally).

CMEPS PR #44

DeniseWorthen and others added 30 commits March 27, 2021 12:30
This reverts commit 7b826d4.
@BrianCurtis-NOAA
Copy link
Collaborator

Machine: cheyenne
Compiler: gnu
Job: BL
Repo location: /glade/work/briancurtis/git/BrianCurtis-NOAA/ufs-weather-model/tests/auto/pr/647735813/20210524090007/ufs-weather-model
Please make changes and add the following label back:
cheyenne-gnu-BL

@DeniseWorthen
Copy link
Collaborator Author

@BrianCurtis-NOAA I don't see the problem w/ the cheyenne-gnu. Could you take a look?

@BrianCurtis-NOAA
Copy link
Collaborator

Yeah, i'll look now

@BrianCurtis-NOAA
Copy link
Collaborator

Machine: gaea
Compiler: intel
Job: BL
Repo location: /lustre/f2/pdata/ncep/emc.nemspara/autort/pr/647735813/20210524150013/ufs-weather-model
Please manually delete: /lustre/f2/scratch/emc.nemspara/FV3_RT/rt_26151
Repo location: /lustre/f2/pdata/ncep/emc.nemspara/autort/pr/647735813/20210524162610/ufs-weather-model
Please manually delete: /lustre/f2/scratch/emc.nemspara/FV3_RT/rt_14166
Test cpld_debugfrac 030 failed in check_result failed
Test cpld_debugfrac 030 failed in run_test failed
Test cpld_debug 029 failed in check_result failed
Test cpld_debug 029 failed in run_test failed
Test fv3_wrtGlatlon_netcdf 039 failed in check_result failed
Test fv3_wrtGlatlon_netcdf 039 failed in run_test failed
Test fv3_satmedmfq 062 failed in check_result failed
Test fv3_satmedmfq 062 failed in run_test failed
Test fv3_wrtGauss_nemsio 040 failed in check_result failed
Test fv3_wrtGauss_nemsio 040 failed in run_test failed
Test fv3_control_32bit 048 failed in check_result failed
Test fv3_control_32bit 048 failed in run_test failed
Test fv3_decomp 032 failed in check_result failed
Test fv3_decomp 032 failed in run_test failed
Test fv3_2threads 033 failed in check_result failed
Test fv3_2threads 033 failed in run_test failed
Test fv3_ca 043 failed in check_result failed
Test fv3_ca 043 failed in run_test failed
Test fv3_multigases 047 failed in check_result failed
Test fv3_multigases 047 failed in run_test failed
Test fv3_wrtGauss_netcdf_esmf 036 failed in check_result failed
Test fv3_wrtGauss_netcdf_esmf 036 failed in run_test failed
Test fv3_wrtGauss_netcdf 037 failed in check_result failed
Test fv3_wrtGauss_netcdf 037 failed in run_test failed
Test fv3_stochy 042 failed in check_result failed
Test fv3_stochy 042 failed in run_test failed
Test fv3_satmedmf 061 failed in check_result failed
Test fv3_satmedmf 061 failed in run_test failed
Test fv3_control 031 failed in check_result failed
Test fv3_control 031 failed in run_test failed
Test fv3_lheatstrg 046 failed in check_result failed
Test fv3_lheatstrg 046 failed in run_test failed
Test fv3_gfdlmp_32bit 063 failed in check_result failed
Test fv3_gfdlmp_32bit 063 failed in run_test failed
Test fv3_lndp 044 failed in check_result failed
Test fv3_lndp 044 failed in run_test failed
Test fv3_gfdlmprad_32bit_post 064 failed in check_result failed
Test fv3_gfdlmprad_32bit_post 064 failed in run_test failed
Test fv3_thompson 070 failed in check_result failed
Test fv3_thompson 070 failed in run_test failed
Test fv3_rrfs_v1alpha 067 failed in check_result failed
Test fv3_rrfs_v1alpha 067 failed in run_test failed
Test fv3_thompson_no_aero 071 failed in check_result failed
Test fv3_thompson_no_aero 071 failed in run_test failed
Test fv3_rrfs_v1beta 072 failed in check_result failed
Test fv3_rrfs_v1beta 072 failed in run_test failed
Test fv3_rap 068 failed in check_result failed
Test fv3_rap 068 failed in run_test failed
Test fv3_hrrr 069 failed in check_result failed
Test fv3_hrrr 069 failed in run_test failed
Test fv3_gsd 066 failed in check_result failed
Test fv3_gsd 066 failed in run_test failed
Test cpld_decomp 006 failed in check_result failed
Test cpld_decomp 006 failed in run_test failed
Test cpld_ca 008 failed in check_result failed
Test cpld_ca 008 failed in run_test failed
Test fv3_csawmg 060 failed in check_result failed
Test fv3_csawmg 060 failed in run_test failed
Test cpld_2threads 005 failed in check_result failed
Test cpld_2threads 005 failed in run_test failed
Test fv3_wrtGauss_nemsio_c192 041 failed in check_result failed
Test fv3_wrtGauss_nemsio_c192 041 failed in run_test failed
Test fv3_stretched 049 failed in check_result failed
Test fv3_stretched 049 failed in run_test failed
Test fv3_cpt 065 failed in check_result failed
Test fv3_cpt 065 failed in run_test failed
Test fv3_stretched_nest 050 failed in check_result failed
Test fv3_stretched_nest 050 failed in run_test failed
Test cpld_controlfrac 003 failed in check_result failed
Test cpld_controlfrac 003 failed in run_test failed
Test cpld_satmedmf 007 failed in check_result failed
Test cpld_satmedmf 007 failed in run_test failed
Test cpld_control 001 failed in check_result failed
Test cpld_control 001 failed in run_test failed
Test cpld_control_wave 028 failed in check_result failed
Test cpld_control_wave 028 failed in run_test failed
Test cpld_control_c192 009 failed in check_result failed
Test cpld_control_c192 009 failed in run_test failed
Test cpld_controlfrac_c192 011 failed in check_result failed
Test cpld_controlfrac_c192 011 failed in run_test failed
Test fv3_wrtGauss_netcdf_parallel 038 failed in run_test failed
Test fv3_gfdlmp 057 failed in check_result failed
Test fv3_gfdlmp 057 failed in run_test failed
Test fv3_gfdlmprad_gwd 058 failed in check_result failed
Test fv3_gfdlmprad_gwd 058 failed in run_test failed
Test fv3_gfdlmprad_noahmp 059 failed in check_result failed
Test fv3_gfdlmprad_noahmp 059 failed in run_test failed
Test cpld_bmarkfrac 019 failed in check_result failed
Test cpld_bmarkfrac 019 failed in run_test failed
Test cpld_bmarkfrac_v16 021 failed in check_result failed
Test cpld_bmarkfrac_v16 021 failed in run_test failed
Test cpld_control_c384 013 failed in check_result failed
Test cpld_control_c384 013 failed in run_test failed
Test cpld_controlfrac_c384 015 failed in check_result failed
Test cpld_controlfrac_c384 015 failed in run_test failed
Test fv3_regional_quilt_netcdf_parallel 055 failed in check_result failed
Test fv3_regional_quilt_netcdf_parallel 055 failed in run_test failed
Test cpld_bmarkfrac_wave 025 failed in check_result failed
Test cpld_bmarkfrac_wave 025 failed in run_test failed
Test fv3_regional_quilt 053 failed in check_result failed
Test fv3_regional_quilt 053 failed in run_test failed
Test fv3_regional_quilt_hafs 054 failed in check_result failed
Test fv3_regional_quilt_hafs 054 failed in run_test failed
Test fv3_regional_control 051 failed in check_result failed
Test fv3_regional_control 051 failed in run_test failed
Test fv3_regional_quilt_RRTMGP 056 failed in check_result failed
Test fv3_regional_quilt_RRTMGP 056 failed in run_test failed
Test cpld_bmarkfrac_v16_nsst 022 failed in check_result failed
Test cpld_bmarkfrac_v16_nsst 022 failed in run_test failed
Test cpld_bmark_wave 024 failed in check_result failed
Test cpld_bmark_wave 024 failed in run_test failed
Test cpld_bmark 017 failed in check_result failed
Test cpld_bmark 017 failed in run_test failed
Test cpld_bmarkfrac_wave_v16 026 failed in check_result failed
Test cpld_bmarkfrac_wave_v16 026 failed in run_test failed
Test cpld_bmarkfrac_wave_v16_noahmp 027 failed in check_result failed
Test cpld_bmarkfrac_wave_v16_noahmp 027 failed in run_test failed
Please make changes and add the following label back:
gaea-intel-BL

@DeniseWorthen
Copy link
Collaborator Author

DeniseWorthen commented May 24, 2021

@BrianCurtis-NOAA There seems to have been a large BL creation failure which was not caught. If I look at the log here

/lustre/f2/pdata/ncep/emc.nemspara/autort/pr/647735813/20210524150013/ufs-weather-model

I can see that none of the S2S apps were actually compiled (I don't know why). But it didn't find this as a failure, so none of the jobs were marked as 'failed'. The script moved the baselines into place, but only from rt_069 and higher. Unfortunately, since it seems the test was a 'pass', the run directory was removed ( /lustre/f2/scratch/emc.nemspara/FV3_RT/rt_26151)?

I can see some baselines present in the baseline directory. But when it tried to run the verify and a lot more jobs failed---again, not sure why but most seem to be because the baseline was missing.

I'm concerned that there is a disk quota issue, since there are a lot of rt_ directories (some from March) here: /lustre/f2/scratch/emc.nemspara/FV3_RT

I'm going to delete the oldest of them and try to run manually.

@BrianCurtis-NOAA
Copy link
Collaborator

BrianCurtis-NOAA commented May 24, 2021

I do not currently confirm that the move of files from REGRESSION_TEST_<compiler> to the baseline storage area was successful once the command is issued to do so. This leaves the potential for a disk quota issue to arise if the baseline storage area is getting full or is full without the code complaining of anything.

This is good to know so I can implement such a check in my autort development.

@DeniseWorthen
Copy link
Collaborator Author

All RTs have completed successfully. We're ready for merge after run-ci completes and CMEPS is merged.

@DeniseWorthen DeniseWorthen merged commit c622d7d into ufs-community:develop May 25, 2021
@DeniseWorthen DeniseWorthen deleted the feature/p7a_test branch May 25, 2021 12:22
@DeniseWorthen
Copy link
Collaborator Author

Orion initial failed tests, PR 585:

Auto-BL was successful (develop-20210524)
Auto-RT failed tests:

cpld_control_wave: err log shows

0: slurmstepd: error: *** STEP 2110073.0 ON Orion-19-19 CANCELLED AT 2021-05-23T16:52:24 ***
211: [211:Orion-19-24] unexpected DAPL connection event 0x4008 from 341
211: Fatal error in PMPI_Wait: Internal MPI error!, error stack:
211: PMPI_Wait(219)...........: MPI_Wait(request=0x17db8fa0, status=0x1) failed
211: MPIR_Wait_impl(81).......: fail failed
211: PMPIDI_CH3I_Progress(850): fail failed
211: (unknown)(): Internal MPI error!
211: In: PMI_Abort(68309776, Fatal error in PMPI_Wait: Internal MPI error!, error stack:
211: PMPI_Wait(219)...........: MPI_Wait(request=0x17db8fa0, status=0x1) failed
211: MPIR_Wait_impl(81).......: fail failed
211: PMPIDI_CH3I_Progress(850): fail failed
211: (unknown)(): Internal MPI error!)

cpld_controlfrac_c384: err log shows

0: in fcst run phase 2, na= 168
129:
129: FATAL from PE 129: NaN in input field of mpp_reproducing_sum(_2d), this indicates numerical instability

cpld_restart_bmarkfrac: err log shows

113: forrtl: severe (174): SIGSEGV, segmentation fault occurred
113: Image PC Routine Line Source
113: fv3.exe 0000000004CE2DD9 Unknown Unknown Unknown
113: libpthread-2.17.s 00002B37035B55D0 Unknown Unknown Unknown
113: fv3.exe 0000000004DB1D10 Unknown Unknown Unknown
113: fv3.exe 0000000000C34F5A _ZNSt6vectorIiSaI 372 stl_algobase.h
113: fv3.exe 0000000000C2ACEB _ZN5ESMCI2DD25set 8502 ESMCI_Array.C
113: fv3.exe 0000000000C22247 _ZN5ESMCI27sparse 2010 sparseMatMulStoreLinSeqVect.h

fv3_rrfs_v1beta: test ran, but all comparisons failed. Repeating the RT against same baseline allowed gave PASS

cpld_bmarkfrac_v16: test ran, but all comparisons failed. Repeating the RT against the same baseline gave PASS

pjpegion pushed a commit to NOAA-PSL/ufs-weather-model that referenced this pull request Apr 4, 2023
epic-cicd-jenkins pushed a commit that referenced this pull request Apr 17, 2023
* Add option for fv3gfs_aqm

* Change parameter name
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Baseline Updates Current baselines will be updated. New Input Data Req'd This PR requires new data to be sync across platforms Waiting for Reviews The PR is waiting for reviews from associated component PR's.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create fv3_conf/cpld_bmark_run.IN for new fix-file tiled input
8 participants