Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update develop branch for generic linux, MacOS capability #539

Merged

Conversation

mkavulich
Copy link
Collaborator

DESCRIPTION OF CHANGES:

This change will add the capability to run regional_workflow (as part of the SRW app) on MacOS and generic LINUX platforms. Most of these changes are identical to those in #402 (hash bc08607) but some additional modifications needed to be made due to intervening changes in the develop branch.

There are still a few issues that may need resolving for MacOS platforms. Currently UFS_UTILS and post tasks will occasionally fail on MACOS when using multiple processors (mpirun -np 1 always succeeds). These problems likely need to be addressed outside of regional_workflow, so I don't believe that should hold up this PR.

TESTS CONDUCTED:

Ran Graduate Student Test on new platforms:

  • my personal Mac machine (MacOS Catalina 10.15.7) MacOS with gnu 9.4.0 compilers.
  • Cheyenne compute node as a faux "stand-alone" machine, intel 19.1.1 compilers

Ran suite of end-to-end tests on Cheyenne (intel/19.1.1) and Hera (intel/18.0.5.274). All passed as expected:

  • GST_release_public_v1
  • grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2
  • grid_RRFS_CONUS_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1alpha
  • grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2
  • grid_RRFS_CONUS_13km_ics_HRRR_lbcs_RAP_suite_RRFS_v1alpha
  • grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2 (Hera only)
  • grid_RRFS_CONUS_3km_ics_HRRR_lbcs_RAP_suite_RRFS_v1alpha (Hera only)

This has not been tested on WCOSS yet; it probably should be as some of the changed stanzas are specific to the WCOSS platform.

DEPENDENCIES:

This can be merged as-is; however, it will not work on MACOS platforms until ufs-community/UFS_UTILS#545 is merged

DOCUMENTATION:

Documentation pending (currently in Google doc: https://docs.google.com/document/d/14jlvL3nOi85NJCWSNnhHjjBsrAsR-mr8JfGFvVIybFI/edit)

ISSUE:

Will resolve #369

@mkavulich mkavulich added enhancement New feature or request Needs WCOSS test Testing needs to be run on WCOSS Tested on Hera Tested successfully on Hera machine Tested on Cheyenne Successfully tested on NCAR Cheyenne machine Tested on MacOS Tests ran successfully on MacOS system labels Jul 8, 2021
RUN_CMD_FCST="mpirun -np \${PE_MEMBER01}"
RUN_CMD_POST="mpirun -np 1"
#
#-----------------------------------------------------------------------
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens on Linux/MacOS when someone doesn't have MPI installed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JeffBeck-NOAA These are default commands; they should be overwritten by users in their "config.sh" file if they want to use something else.

"LINUX")
APRUN=$RUN_CMD_POST
;;

*)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just so I understand, these $RUN_* variables are required for MacOS when mpirun is called?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JeffBeck-NOAA These are the run command for your individual machine that goes before the calling of the executable. They can be "mpirun", "mpiexec", "weird_proprietary_run_command", or even blank if you're okay just running the executable directly.

@chan-hoo
Copy link
Collaborator

chan-hoo commented Jul 15, 2021

@mkavulich, I ran one WE2E test on the WCOSS dell with your branch. Although all the tasks were 'SUCCEEDED', the 'Workflow status' remained 'IN PROGRESS' with an error message in the 'launch_FV3LAM_wflow.sh' as follows:
./launch_FV3LAM_wflow.sh: line 312: SED: unbound variable
./launch_FV3LAM_wflow.sh: line 313: SED: unbound variable

… limited function on the old MACOS bash (readlink, ln, sed, and date) with a variable, and add a file (ush/bash_utils/define_macos_utilities.sh) to set these variablesappropriately for the compatible MACOS utilities if the OS is Darwin.
…owercase with custom functions that are back-compatible with MACOS
 - Add WORKFLOW_MANAGER variable for specifying machines that support rocoto (and maybe other workflow managers in the future)
 - Add run commands for specifying mpirun commands on non-rocoto machines
…ot quit unless you manually check for non-zero exit status
@chan-hoo
Copy link
Collaborator

@mkavulich, WE2E tests failed on the 'run_fcst' task again. I copied the log files on Hera:/scratch2/NCEPDEV/fv3-cam/Chan-hoo.Jeon/log

@climbfuji
Copy link
Collaborator

@chan-hoo please try to pull mkavulich#1 into regional_workflow and try again (from ./generate_FV3LAM_wflow.sh onwards).

@chan-hoo
Copy link
Collaborator

@climbfuji, it doesn't work. Its log files are copied to: /scratch2/NCEPDEV/fv3-cam/Chan-hoo.Jeon/log_dom/

…ug_fixes_dom_20210916

Update develop for macos linux bug fixes Dom 20210916
@climbfuji
Copy link
Collaborator

climbfuji commented Sep 17, 2021

@climbfuji, it doesn't work. Its log files are copied to: /scratch2/NCEPDEV/fv3-cam/Chan-hoo.Jeon/log_dom/

This is something for the regional workflow people to look at, these log files are so cryptic that I have no idea what is going on. You may have to copy over the entire experiment directory. Note that I do have access to wcoss, so if you need me to look at something (other than regional_workflow logs) in the future, no need to copy to hera.

@mkavulich @JeffBeck-NOAA @gsketefian

@mkavulich
Copy link
Collaborator Author

@chan-hoo Sorry, it looks like there is still a problem with generating the XML file with the newest changes. I am working on it.

@climbfuji
Copy link
Collaborator

That at least explains why I didn't have any problems using the standalone wrappers on macOS.

@mkavulich mkavulich added Tested on Cheyenne Successfully tested on NCAR Cheyenne machine Tested on Hera Tested successfully on Hera machine Tested on MacOS Tests ran successfully on MacOS system Tested on WCOSS Successfully tested on WCOSS machine and removed Needs WCOSS test Testing needs to be run on WCOSS testing on hera testing on cheyenne labels Sep 20, 2021
Copy link
Collaborator

@chan-hoo chan-hoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three WE2E tests with RRFS_CONUS_25km,13km,3km were completed successfully on the wcoss dell and cray.

@mkavulich
Copy link
Collaborator Author

Okay, with lots of help from @climbfuji and @chan-hoo, we have completed a pretty comprehensive battery of tests on these changes:

Cheyenne (intel)

  • grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_2017_gfdlmp
  • grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_2017_gfdlmp_regional
  • grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2
  • grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16
  • grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GSD_SAR
  • grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_2017_gfdlmp
  • grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2
  • grid_RRFS_CONUS_25km_ics_HRRR_lbcs_HRRR_suite_GSD_SAR
  • grid_RRFS_CONUS_25km_ics_HRRR_lbcs_HRRR_suite_HRRR
  • grid_RRFS_CONUS_25km_ics_HRRR_lbcs_HRRR_suite_RRFS_v1beta
  • grid_RRFS_CONUS_25km_ics_HRRR_lbcs_RAP_suite_GSD_SAR
  • grid_RRFS_CONUS_25km_ics_HRRR_lbcs_RAP_suite_GSD_v0
  • grid_RRFS_CONUS_25km_ics_HRRR_lbcs_RAP_suite_HRRR
  • grid_RRFS_CONUS_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta
  • MET_ensemble_verification

Hera (intel)

  • grid_CONUS_25km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16
  • grid_RRFS_AK_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16
  • grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16
  • grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2
  • grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16
  • grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_2017_gfdlmp
  • grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_2017_gfdlmp_regional
  • grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2
  • grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16
  • grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GSD_SAR
  • grid_RRFS_NA_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GSD_v0
  • grid_RRFS_SUBCONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16
  • MET_ensemble_verification

Other

  • RRFS_CONUS_25km,13km,3km on WCOSS Dell and Cray
  • RRFS_CONUS_25km end-to-end test on MacOS
  • RRFS_CONUS_25km end-to-end test on RedHat Linux (gnu 9)

@mkavulich mkavulich merged commit 09ae4f1 into ufs-community:develop Sep 22, 2021
@climbfuji
Copy link
Collaborator

Excellent! Now we only need an update for Externals.cfg in the ufs-srweather-app, or?

christinaholtNOAA added a commit to christinaholtNOAA/regional_workflow that referenced this pull request Nov 2, 2021
These changes are all consistent with PR ufs-community#539 (merged) and PR ufs-community#617 (under review) for the develop branch of the NOAA-EMC:regional_workflow repository.

This should enable the use of a generic linux machine with Rocoto for all jobs that exist in the current develop branch. More work is needed to tackle those tasks that do not yet exit in the develop branch, but exist here.
SamuelTrahanNOAA pushed a commit to SamuelTrahanNOAA/regional_workflow that referenced this pull request Sep 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Tested on Cheyenne Successfully tested on NCAR Cheyenne machine Tested on Hera Tested successfully on Hera machine Tested on MacOS Tests ran successfully on MacOS system Tested on WCOSS Successfully tested on WCOSS machine
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Workflow does not work on MacOS due to bash and UNIX utility differences
5 participants