Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[develop] Add ufs-case-study data options #736

Merged
merged 2 commits into from
Apr 25, 2023

Conversation

clouden90
Copy link
Contributor

DESCRIPTION OF CHANGES:

Added capabilities to retrieve IC/BC files for UFS-CASE-STUDIES in retrieve_data.py and associated unit tests in test_retrieve_data.py.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

TESTS CONDUCTED:

Two python unit tests (test_ufs_ics_from_aws and test_ufs_lbcs_from_aws)have been added and tested on Hera and Orion.

  • hera.intel
  • orion.intel
  • cheyenne.intel
  • cheyenne.gnu
  • gaea.intel
  • jet.intel
  • wcoss2.intel
  • NOAA Cloud (indicate which platform)
  • Jenkins
  • fundamental test suite
  • comprehensive tests (specify which if a subset was used)

ISSUE:

Fixes issue mentioned in #734

CHECKLIST

  • My code follows the style guidelines in the Contributor's Guide
  • I have performed a self-review of my own code using the Code Reviewer's Guide
  • I have commented my code, particularly in hard-to-understand areas
  • My changes need updates to the documentation. I have made corresponding changes to the documentation
  • My changes do not require updates to the documentation (explain).
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

LABELS (optional):

A Code Manager needs to add the following labels to this PR:

  • Work In Progress
  • bug
  • enhancement
  • documentation
  • release
  • high priority
  • run_ci
  • run_we2e_fundamental_tests
  • run_we2e_comprehensive_tests
  • Needs Cheyenne test
  • Needs Jet test
  • Needs Hera test
  • Needs Orion test
  • help wanted

Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clouden90 These changes look good to me! The unittests were approved and all successfully passed. Out of curiosity, what is the next step in this work? Approving now.

Jenkins tests will be submitted in the morning, hopefully with Gaea back in service.

@clouden90
Copy link
Contributor Author

@clouden90 These changes look good to me! The unittests were approved and all successfully passed. Out of curiosity, what is the next step in this work? Approving now.

Jenkins tests will be submitted in the morning, hopefully with Gaea back in service.

@MichaelLueken , thanks for the prompt response. Appreciated. The main purpose of this work is to fulfill requirement of adding selected UFS-case-study test cases into existing SRW apps with minimum modifications. After reviewing current structure of SRW apps, I think the best approach for next step is to add example yaml config file into WE2E test suites with minor modification in ics/lbcs staging scripts. Again we will try to minimize unnecessary changes to have ufs-case-study cases in SRW apps. Any thoughts are welcome!

@MichaelLueken MichaelLueken added the run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests label Apr 19, 2023
@MichaelLueken
Copy link
Collaborator

@clouden90 There were some failures encountered in the Jenkins tests:

There was one failure on Cheyenne Intel:

grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16 which failed with the following error message in the run_MET_GridStat_vx_REFC_mem001 task:

ERROR: run_metplus failed: [Errno 17] File exists: '/glade/scratch/epicufsrt/jenkins/workspace/fs-srweather-app_pipeline_PR-736__2/expt_dirs/grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16/2021051212/mem001/metprd/GridStat'

A rerun of the failed test passed without issue.

There were two failures on Jet:

custom_GFDLgrid failed with the following error message in the make_ics/lbcs tasks:

/var/spool/slurmd/job26283328/slurm_script: line 21: /mnt/lfs4/HFIP/hfv3gfs/role.epic/jenkins/workspace/fs-srweather-app_pipeline_PR-736/ush/load_modules_run_task.sh: Input/output error

grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_HRRR failed with the following error message in the make_lbcs_mem000 task:

slurmstepd: error: execve(): /mnt/lfs4/HFIP/hfv3gfs/role.epic/jenkins/workspace/fs-srweather-app_pipeline_PR-736/install_intel/exec/chgres_cube: Input/output error

Reruns of custom_GFDLgrid have successfully passed. Reruns of grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_HRRR continue to fail, with different error messages every time. Will continue testing this, since your changes shouldn't be affecting the WE2E tests.

If you would like to see the pipeline, please see - https://jenkins.epic.oarcloud.noaa.gov/blue/organizations/jenkins/ufs-srweather-app%2pipeline/detail/PR-736/1/pipeline

@yichengt90
Copy link

@clouden90 There were some failures encountered in the Jenkins tests:

There was one failure on Cheyenne Intel:

grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16 which failed with the following error message in the run_MET_GridStat_vx_REFC_mem001 task:

ERROR: run_metplus failed: [Errno 17] File exists: '/glade/scratch/epicufsrt/jenkins/workspace/fs-srweather-app_pipeline_PR-736__2/expt_dirs/grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16/2021051212/mem001/metprd/GridStat'

A rerun of the failed test passed without issue.

There were two failures on Jet:

custom_GFDLgrid failed with the following error message in the make_ics/lbcs tasks:

/var/spool/slurmd/job26283328/slurm_script: line 21: /mnt/lfs4/HFIP/hfv3gfs/role.epic/jenkins/workspace/fs-srweather-app_pipeline_PR-736/ush/load_modules_run_task.sh: Input/output error

grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_HRRR failed with the following error message in the make_lbcs_mem000 task:

slurmstepd: error: execve(): /mnt/lfs4/HFIP/hfv3gfs/role.epic/jenkins/workspace/fs-srweather-app_pipeline_PR-736/install_intel/exec/chgres_cube: Input/output error

Reruns of custom_GFDLgrid have successfully passed. Reruns of grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_HRRR continue to fail, with different error messages every time. Will continue testing this, since your changes shouldn't be affecting the WE2E tests.

If you would like to see the pipeline, please see - https://jenkins.epic.oarcloud.noaa.gov/blue/organizations/jenkins/ufs-srweather-app%2pipeline/detail/PR-736/1/pipeline

Thanks for the quick updates, @MichaelLueken! I will take a look. As you mentioned, my changes should not have any impacts on existing WE2E tests, I appreciate if you can continue testing it. Thanks!

@MichaelLueken
Copy link
Collaborator

@clouden90 No worries! As an update, using rocotorewind/rocotoboot to remove the failed task and rerun ultimately worked for the grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_HRRR test. This work should be good to go.

Copy link
Collaborator

@panll panll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@mkavulich
Copy link
Collaborator

@clouden90 Sorry to bring up discussion on an already-merged PR, but this seems like the most relevant place. Is there a way we could use grib2 (or even a newer case after the switch to netcdf data) for these test cases? The "retrieve_data.py" unit tests now take a very long time to run (more than an hour for a single file on Hera!) due to having to retrieve very large nemsio files over the internet.

@clouden90
Copy link
Contributor Author

@clouden90 Sorry to bring up discussion on an already-merged PR, but this seems like the most relevant place. Is there a way we could use grib2 (or even a newer case after the switch to netcdf data) for these test cases? The "retrieve_data.py" unit tests now take a very long time to run (more than an hour for a single file on Hera!) due to having to retrieve very large nemsio files over the internet.

Hi @mkavulich, no worries, and thanks for bringing it up! Indeed I also noticed this issue on some machines such as hera. The main goal of this work is to full fill NOAA-EPIC task requirement to have selected ufs-case-studies cases in SRW apps so users can play with those cases. Unfortunately at this moment they only provide nemsio files. Below are a few alternatives:

  1. Simply skip "ufs-case-study" unit tests by adding "CI" env;
  2. Per discussion with Sylvia from NOAA-EPIC DTO team, if the PO for the SRW release thinks it is worthy of having additional lbcs/ics files for these cases on SRW cloud buckets and approves of it, then we can consider to add associated grib2 files to SRW S3 buckets.

Any thoughts are more than welcome!

@JeffBeck-NOAA
Copy link
Collaborator

@clouden90, @mkavulich, I would strongly advocate for grib2 use over nemsio.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Data retrieval for UFS-CASE STUDIES
6 participants