-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[develop] Bring change of production/AQM.v7 back to develop #828
[develop] Bring change of production/AQM.v7 back to develop #828
Conversation
… into feature/prod_back
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chan-hoo These changes look good to me! I noted a minor spelling issue in my review, but that is the only issue I noted. I also ran the fundamental tests on Orion and they all passed successfully. Approving now.
ush/config_defaults.yaml
Outdated
@@ -1004,7 +987,11 @@ nco: | |||
#----------------------------------------------------------------------- | |||
# | |||
# Set variables that are only used in NCO mode (i.e. when RUN_ENVIR is | |||
# set to "nco"). Definitions: | |||
# set to "nco"). All variables have the suffix [_dfv] meaning the default value. | |||
# This is bacuase they are supposed to be defined in job cards for the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because rather than bacuase.
# This is bacuase they are supposed to be defined in job cards for the | |
# This is because they are supposed to be defined in job cards for the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MichaelLueken, Oops! silly mistake! Thank you for letting me know. Fixed.
@chan-hoo - What your narrative about NCO's perspective brings to light is the fact that there is a design conflict between the extremely narrow interests of the Operational environment and what is needed for the community to do research and development. NCO's approach is not portable or usable outside of operations. And so, their requirements are causing design problems for the system because now two methods must be accommodated. This is bad software design. I don't know that there is a great way out of this problem, but it is worth noting that sometimes Operational requirements appear to dictate approaches that are clearly inferior from the larger perspective of having a single system usable for both research and operations. |
@christopherwharrop-noaa, as you may know, I don't have any power to change NCO's standards :) One of the EMC's main roles is to support NCO to run applications operationally. In order to do that, we (EMC) should follow the NCO standards and NCO's approach. Without satisfy them, NCO will not accept any new applications. You can think it is a bad software design. However, to resolve this, NCO should agree with that. It is very hard (especially from my side. I have never met NCO guys since I joined EMC :) ). Since the UFS SRW App is designed to support operation, I think we should give full consideration about this. As you know, they (NCO) are very conservative. |
@chan-hoo - I am in no way suggesting we try to change or ignore NCO standards here. That is not possible. Everything you said is true. My point is that this is an ongoing problem that creates conflict between the interests of Operations and the community. Requiring an implementation that can only work in one place/system is objectively bad software design. I don't think there is any serious argument to be had on that point. But, we're stuck with it. A solution to this problem will require engagement from up the management chain, and that is outside the scope/purview of this work. |
… into feature/prod_back
@christinaholtNOAA, do you still have any concerns? To add the ecFlow option to develop, I hope this pr is merged as early as possible. If you have concerns relevant to ecFlow, I think we can discuss them in the PR for ecFlow later. |
@christinaholtNOAA, below is the link to my technical document. On Page 102, you can find how the NCO environment variables are updated through the workflow with ecFlow. I hope this helps you understand why |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chan-hoo Thanks for addressing my comments. I'm sorry I haven't been available to respond for several days.
@chan-hoo All of the automated Jenkins tests failed. The issue is with
Since |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chan-hoo The Jenkins tests are now passing for Cheyenne, Jet, and Hera GNU. However, there is a WE2E test that is still failing on Hera Intel. The test that is failing is:
get_from_HPSS_ics_GDAS_lbcs_GDAS_fmt_netcdf_2022040400_ensemble_2mems
All coverage tests that are run on Hera Intel are run in NCO mode. The failure is in the make_ics/lbcs tasks:
/scratch1/NCEPDEV/stmp2/role.epic/jenkins/workspace/fs-srweather-app_pipeline_PR-828__2/scripts/exregional_make_ics.sh: line 93: /scratch1/NCEPDEV/stmp2/role.epic/jenkins/workspace/fs-srweather-app_pipeline_PR-828__2/nco_dirs/tmp/get_extrn_ics.id_1687963892_2022040400/rrfs.t00z.GDAS.ICS.extrn_mdl_var_defns.sh: No such file or directory
The ICs and LBCs are found in:
/scratch1/NCEPDEV/stmp2/role.epic/jenkins/workspace/fs-srweather-app_pipeline_PR-828__2/nco_dirs/tmp/get_extrn_ics.id_1687963892_2022040400/mem*
The location on Hera with the failed tests is:
/scratch1/NCEPDEV/stmp2/role.epic/jenkins/workspace/fs-srweather-app_pipeline_PR-828__2/expt_dirs/get_from_HPSS_ics_GDAS_lbcs_GDAS_fmt_netcdf_2022040400_ensemble_2mems
In order to make this test work, I had to add ${SLASH_ENSMEM_SUBDIR}
back into exregional_make_ics.sh
and exregional_make_lbcs.sh
. I tried adding this variable to job_preamble.sh
, but the jobs still failed with the above error messages.
Is it okay to add ${SLASH_ENSMEM_SUBDIR} back in for ensembles, or would that go against NCO standards? If it goes against NCO standards to reintroduce this variable in these two scripts, can you think of another way to ensure that ensembles can properly run in NCO mode? Thanks!
@MichaelLueken, thanks for testing. It should be fine. I've added SLASH_ENSEM_SUBDIR to the paths. Please run the tests again. |
Thanks, @chan-hoo! I have resubmitted the automated Jenkins testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chan-hoo The same test failed, but due to:
/scratch1/NCEPDEV/stmp2/role.epic/jenkins/workspace/fs-srweather-app_pipeline_PR-828__2/scripts/exregional_make_ics.sh: line 87: SLASH_ENSEM_SUBDIR: unbound variable
The added variable should have been ${SLASH_ENSMEM_SUBDIR}
, rather than ${SLASH_ENSEM_SUBDIR}
. I'm very sorry about this.
I'll resubmit the tests one last time after this has been updated, then move forward with the merge.
@MichaelLueken, I am sorry. I didn't double-check it. I've updated it. Thanks again for the testing. |
Now that Orion is back from yesterday's maintenance, the WE2E tests were manually ran and all passed successfully:
Awaiting completion of automated Jenkins testing before moving forward with the merge. |
@chan-hoo The automated tests on Cheyenne, Jet, and Hera Intel successfully passed. The
|
The renaming of the configuration variable NET to NET_default by PR #828 broke the WE2E test MET_ensemble_verification_only_vx_time_lag (see Issue #851). This PR fixes the bug and adds that WE2E test to the list of coverage tests on Hera (currently that test is only in the comprehensive tests list).
DESCRIPTION OF CHANGES:
Multiple scripts relevant to AQM are updated to meet the NCO standards and EE2 reviewer's comments based on the
production/AQM.v7
branch._default
meaning the default value. This is because they should be defined injob cards
as can be seen in the NCO standards (pp.4, Table 1). In the current rocoto-based workflow, the job cards do not exist (instead, it uses the XML file). Therefore, they can be set in the configuration file andush/machine/
file. However, for ecFlow, these variables should be replaced with those in the job cards. (The ecFlow option will be added in a separate PR later). This is done in theush/job_preamble.sh
script which is sourced in the J-job scripts.compath
call available inprod_util
.YES/NO
flags while SRW App usesTRUE/FALSE
. To resolve this mismatch, theboolify
calls for the NCO variables are added tojob_preamble.sh
.POST_STEP
.DATA
is defined and created injob_preamble.sh
for the NCO mode. Therefore, if-statement for the 'community' mode is added to the J-job scripts.KEEPDATA
capability does not work correctly for the shared tasks. This is fixed in this PR.DCOMINdata
.Type of change
TESTS CONDUCTED:
WE2E fundamental on Hera
AQM community sample run on Hera
AQM nco realtime sample run on WCOSS2
hera.intel
orion.intel
cheyenne.intel
cheyenne.gnu
gaea.intel
jet.intel
wcoss2.intel
NOAA Cloud (indicate which platform)
Jenkins
fundamental test suite
comprehensive tests (specify which if a subset was used)
ISSUE:
Fixes issue mentioned in #786
CHECKLIST