Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[develop] Bring change of production/AQM.v7 back to develop #828

Merged
merged 20 commits into from
Jun 29, 2023

Conversation

chan-hoo
Copy link
Collaborator

@chan-hoo chan-hoo commented Jun 13, 2023

DESCRIPTION OF CHANGES:

Multiple scripts relevant to AQM are updated to meet the NCO standards and EE2 reviewer's comments based on the production/AQM.v7 branch.

  • The NCO variables such as NET, RUN, envir are renamed with the suffix _default meaning the default value. This is because they should be defined in job cards as can be seen in the NCO standards (pp.4, Table 1). In the current rocoto-based workflow, the job cards do not exist (instead, it uses the XML file). Therefore, they can be set in the configuration file and ush/machine/ file. However, for ecFlow, these variables should be replaced with those in the job cards. (The ecFlow option will be added in a separate PR later). This is done in the ush/job_preamble.sh script which is sourced in the J-job scripts.
  • NCO wants to define COMIN/COMOUT, COMINmodel, DATAROOT using the compath call available in prod_util.
  • NCO does not want to add a new directory (COMINext) to COM. The use of COMINext is removed.
  • NCO wants to use the error handling calls such as 'err_chk' and 'err_exit' in 'prod_util' which is available only on WCOSS2.
  • NCO uses the YES/NO flags while SRW App uses TRUE/FALSE. To resolve this mismatch, the boolify calls for the NCO variables are added to job_preamble.sh.
  • The 'pgmerr' function is added to POST_STEP.
  • The forecast hour in some output file names is changed to 3 digit.
  • The working directory DATA is defined and created in job_preamble.sh for the NCO mode. Therefore, if-statement for the 'community' mode is added to the J-job scripts.
  • The KEEPDATA capability does not work correctly for the shared tasks. This is fixed in this PR.
  • The input data directories are renamed to DCOMINdata.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

TESTS CONDUCTED:

  • WE2E fundamental on Hera

  • AQM community sample run on Hera

  • AQM nco realtime sample run on WCOSS2

  • hera.intel

  • orion.intel

  • cheyenne.intel

  • cheyenne.gnu

  • gaea.intel

  • jet.intel

  • wcoss2.intel

  • NOAA Cloud (indicate which platform)

  • Jenkins

  • fundamental test suite

  • comprehensive tests (specify which if a subset was used)

ISSUE:

Fixes issue mentioned in #786

CHECKLIST

  • My code follows the style guidelines in the Contributor's Guide
  • I have performed a self-review of my own code using the Code Reviewer's Guide
  • I have commented my code, particularly in hard-to-understand areas
  • My changes need updates to the documentation. I have made corresponding changes to the documentation
  • My changes do not require updates to the documentation (explain).
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chan-hoo These changes look good to me! I noted a minor spelling issue in my review, but that is the only issue I noted. I also ran the fundamental tests on Orion and they all passed successfully. Approving now.

@@ -1004,7 +987,11 @@ nco:
#-----------------------------------------------------------------------
#
# Set variables that are only used in NCO mode (i.e. when RUN_ENVIR is
# set to "nco"). Definitions:
# set to "nco"). All variables have the suffix [_dfv] meaning the default value.
# This is bacuase they are supposed to be defined in job cards for the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because rather than bacuase.

Suggested change
# This is bacuase they are supposed to be defined in job cards for the
# This is because they are supposed to be defined in job cards for the

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MichaelLueken, Oops! silly mistake! Thank you for letting me know. Fixed.

@christopherwharrop-noaa
Copy link
Collaborator

@chan-hoo - What your narrative about NCO's perspective brings to light is the fact that there is a design conflict between the extremely narrow interests of the Operational environment and what is needed for the community to do research and development. NCO's approach is not portable or usable outside of operations. And so, their requirements are causing design problems for the system because now two methods must be accommodated. This is bad software design. I don't know that there is a great way out of this problem, but it is worth noting that sometimes Operational requirements appear to dictate approaches that are clearly inferior from the larger perspective of having a single system usable for both research and operations.

@chan-hoo
Copy link
Collaborator Author

@christopherwharrop-noaa, as you may know, I don't have any power to change NCO's standards :) One of the EMC's main roles is to support NCO to run applications operationally. In order to do that, we (EMC) should follow the NCO standards and NCO's approach. Without satisfy them, NCO will not accept any new applications. You can think it is a bad software design. However, to resolve this, NCO should agree with that. It is very hard (especially from my side. I have never met NCO guys since I joined EMC :) ). Since the UFS SRW App is designed to support operation, I think we should give full consideration about this. As you know, they (NCO) are very conservative.

@christopherwharrop-noaa
Copy link
Collaborator

@chan-hoo - I am in no way suggesting we try to change or ignore NCO standards here. That is not possible. Everything you said is true. My point is that this is an ongoing problem that creates conflict between the interests of Operations and the community. Requiring an implementation that can only work in one place/system is objectively bad software design. I don't think there is any serious argument to be had on that point. But, we're stuck with it. A solution to this problem will require engagement from up the management chain, and that is outside the scope/purview of this work.

@chan-hoo
Copy link
Collaborator Author

@christinaholtNOAA, do you still have any concerns? To add the ecFlow option to develop, I hope this pr is merged as early as possible. If you have concerns relevant to ecFlow, I think we can discuss them in the PR for ecFlow later.

@chan-hoo
Copy link
Collaborator Author

@christinaholtNOAA, below is the link to my technical document. On Page 102, you can find how the NCO environment variables are updated through the workflow with ecFlow. I hope this helps you understand why [nco var]_default is necessary:
https://drive.google.com/file/d/1zyCvHmuL_LWx0Yto6B1GXF04m-n6Kbs2/view?usp=drive_link

Copy link
Collaborator

@christinaholtNOAA christinaholtNOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chan-hoo Thanks for addressing my comments. I'm sorry I haven't been available to respond for several days.

@MichaelLueken MichaelLueken added the run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests label Jun 27, 2023
@MichaelLueken
Copy link
Collaborator

@chan-hoo All of the automated Jenkins tests failed. The issue is with tests/WE2E/run_WE2E_tests.py. On line 161, there is the following:

test_cfg['nco'].update({"OPSROOT": args.opsroot})

Since OPSROOT has been changed to OPSROOT_default in this PR, the automated testing is failing because OPSROOT no longer exists in config_defaults. Please change this OPSROOT to OPSROOT_default, then I can resubmit the Jenkins tests.

@chan-hoo chan-hoo added run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests and removed run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests labels Jun 28, 2023
Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chan-hoo The Jenkins tests are now passing for Cheyenne, Jet, and Hera GNU. However, there is a WE2E test that is still failing on Hera Intel. The test that is failing is:

get_from_HPSS_ics_GDAS_lbcs_GDAS_fmt_netcdf_2022040400_ensemble_2mems

All coverage tests that are run on Hera Intel are run in NCO mode. The failure is in the make_ics/lbcs tasks:

/scratch1/NCEPDEV/stmp2/role.epic/jenkins/workspace/fs-srweather-app_pipeline_PR-828__2/scripts/exregional_make_ics.sh: line 93: /scratch1/NCEPDEV/stmp2/role.epic/jenkins/workspace/fs-srweather-app_pipeline_PR-828__2/nco_dirs/tmp/get_extrn_ics.id_1687963892_2022040400/rrfs.t00z.GDAS.ICS.extrn_mdl_var_defns.sh: No such file or directory

The ICs and LBCs are found in:

/scratch1/NCEPDEV/stmp2/role.epic/jenkins/workspace/fs-srweather-app_pipeline_PR-828__2/nco_dirs/tmp/get_extrn_ics.id_1687963892_2022040400/mem*

The location on Hera with the failed tests is:

/scratch1/NCEPDEV/stmp2/role.epic/jenkins/workspace/fs-srweather-app_pipeline_PR-828__2/expt_dirs/get_from_HPSS_ics_GDAS_lbcs_GDAS_fmt_netcdf_2022040400_ensemble_2mems

In order to make this test work, I had to add ${SLASH_ENSMEM_SUBDIR} back into exregional_make_ics.sh and exregional_make_lbcs.sh. I tried adding this variable to job_preamble.sh, but the jobs still failed with the above error messages.

Is it okay to add ${SLASH_ENSMEM_SUBDIR} back in for ensembles, or would that go against NCO standards? If it goes against NCO standards to reintroduce this variable in these two scripts, can you think of another way to ensure that ensembles can properly run in NCO mode? Thanks!

scripts/exregional_make_ics.sh Outdated Show resolved Hide resolved
scripts/exregional_make_lbcs.sh Outdated Show resolved Hide resolved
@chan-hoo
Copy link
Collaborator Author

@MichaelLueken, thanks for testing. It should be fine. I've added SLASH_ENSEM_SUBDIR to the paths. Please run the tests again.

@MichaelLueken
Copy link
Collaborator

Thanks, @chan-hoo! I have resubmitted the automated Jenkins testing.

Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chan-hoo The same test failed, but due to:

/scratch1/NCEPDEV/stmp2/role.epic/jenkins/workspace/fs-srweather-app_pipeline_PR-828__2/scripts/exregional_make_ics.sh: line 87: SLASH_ENSEM_SUBDIR: unbound variable

The added variable should have been ${SLASH_ENSMEM_SUBDIR}, rather than ${SLASH_ENSEM_SUBDIR}. I'm very sorry about this.

I'll resubmit the tests one last time after this has been updated, then move forward with the merge.

scripts/exregional_make_ics.sh Outdated Show resolved Hide resolved
scripts/exregional_make_lbcs.sh Outdated Show resolved Hide resolved
@chan-hoo
Copy link
Collaborator Author

@MichaelLueken, I am sorry. I didn't double-check it. I've updated it. Thanks again for the testing.

@MichaelLueken
Copy link
Collaborator

Now that Orion is back from yesterday's maintenance, the WE2E tests were manually ran and all passed successfully:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
deactivate_tasks                                                   COMPLETE               1.04
get_from_AWS_ics_GEFS_lbcs_GEFS_fmt_grib2_2022040400_ensemble_2me  COMPLETE             751.05
grid_CONUS_3km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta   COMPLETE             252.27
grid_RRFS_AK_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot        COMPLETE             141.12
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_RRFS_v1beta            COMPLETE              14.19
grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_2017_gfdlmp  COMPLETE              10.21
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR              COMPLETE             380.56
grid_RRFS_CONUScompact_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16   COMPLETE              28.19
grid_RRFS_CONUScompact_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16    COMPLETE             279.65
grid_SUBCONUS_Ind_3km_ics_FV3GFS_lbcs_FV3GFS_suite_WoFS_v0         COMPLETE              13.12
nco                                                                COMPLETE               7.70
2020_CAD                                                           COMPLETE              31.47
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            1910.57

Awaiting completion of automated Jenkins testing before moving forward with the merge.

@MichaelLueken
Copy link
Collaborator

@chan-hoo The automated tests on Cheyenne, Jet, and Hera Intel successfully passed. The get_from_NOMADS_ics_FV3GFS_lbcs_FV3GFS failed on Hera GNU. The get_extrn_ics task failed. Using rocotorewind/rocotoboot, the test successfully passed. I will now move forward with merging this work.

[Michael.Lueken@hfe09 get_from_NOMADS_ics_FV3GFS_lbcs_FV3GFS]$ rocotostat -w FV3LAM_wflow.xml -d FV3LAM_wflow.db -v 10 | more
       CYCLE                    TASK                       JOBID               STATE         EXIT STATUS     TRIES      DURATION
================================================================================================================================
202306270000               make_grid                    46515760           SUCCEEDED                   0         1           9.0
202306270000               make_orog                    46515788           SUCCEEDED                   0         1          58.0
202306270000          make_sfc_climo                    46515993           SUCCEEDED                   0         1          74.0
202306270000           get_extrn_ics                    46516729           SUCCEEDED                   0         1         290.0
202306270000          get_extrn_lbcs                    46516768           SUCCEEDED                   0         1         483.0
202306270000         make_ics_mem000                    46517185           SUCCEEDED                   0         1         660.0
202306270000        make_lbcs_mem000                    46517186           SUCCEEDED                   0         1        1214.0
202306270000         run_fcst_mem000                    46517792           SUCCEEDED                   0         1        1458.0
202306270000    run_post_mem000_f000                    46518001           SUCCEEDED                   0         1          13.0
202306270000    run_post_mem000_f001                    46518233           SUCCEEDED                   0         1          12.0
202306270000    run_post_mem000_f002                    46518234           SUCCEEDED                   0         1          13.0
202306270000    run_post_mem000_f003                    46518280           SUCCEEDED                   0         1          13.0
202306270000    run_post_mem000_f004                    46518448           SUCCEEDED                   0         1          13.0
202306270000    run_post_mem000_f005                    46518449           SUCCEEDED                   0         1          14.0
202306270000    run_post_mem000_f006                    46518450           SUCCEEDED                   0         1          13.0

@MichaelLueken MichaelLueken merged commit 9b9942f into ufs-community:develop Jun 29, 2023
@chan-hoo chan-hoo deleted the feature/prod_back branch July 1, 2023 10:30
MichaelLueken pushed a commit that referenced this pull request Jul 25, 2023
The renaming of the configuration variable NET to NET_default by PR #828 broke the WE2E test MET_ensemble_verification_only_vx_time_lag (see Issue #851). This PR fixes the bug and adds that WE2E test to the list of coverage tests on Hera (currently that test is only in the comprehensive tests list).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Data structure, error handling, and COM definition should be updated to meet NCO standards
6 participants