-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update develop branch for generic linux, MacOS capability #539
Update develop branch for generic linux, MacOS capability #539
Conversation
RUN_CMD_FCST="mpirun -np \${PE_MEMBER01}" | ||
RUN_CMD_POST="mpirun -np 1" | ||
# | ||
#----------------------------------------------------------------------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens on Linux/MacOS when someone doesn't have MPI installed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JeffBeck-NOAA These are default commands; they should be overwritten by users in their "config.sh" file if they want to use something else.
"LINUX") | ||
APRUN=$RUN_CMD_POST | ||
;; | ||
|
||
*) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just so I understand, these $RUN_* variables are required for MacOS when mpirun is called?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JeffBeck-NOAA These are the run command for your individual machine that goes before the calling of the executable. They can be "mpirun", "mpiexec", "weird_proprietary_run_command", or even blank if you're okay just running the executable directly.
@mkavulich, I ran one WE2E test on the WCOSS dell with your branch. Although all the tasks were 'SUCCEEDED', the 'Workflow status' remained 'IN PROGRESS' with an error message in the 'launch_FV3LAM_wflow.sh' as follows: |
… limited function on the old MACOS bash (readlink, ln, sed, and date) with a variable, and add a file (ush/bash_utils/define_macos_utilities.sh) to set these variablesappropriately for the compatible MACOS utilities if the OS is Darwin.
…owercase with custom functions that are back-compatible with MACOS
- Add WORKFLOW_MANAGER variable for specifying machines that support rocoto (and maybe other workflow managers in the future) - Add run commands for specifying mpirun commands on non-rocoto machines
…all to $READLINK prior to it being defined;
…ot quit unless you manually check for non-zero exit status
@mkavulich, WE2E tests failed on the 'run_fcst' task again. I copied the log files on Hera:/scratch2/NCEPDEV/fv3-cam/Chan-hoo.Jeon/log |
@chan-hoo please try to pull mkavulich#1 into regional_workflow and try again (from |
@climbfuji, it doesn't work. Its log files are copied to: /scratch2/NCEPDEV/fv3-cam/Chan-hoo.Jeon/log_dom/ |
…ug_fixes_dom_20210916 Update develop for macos linux bug fixes Dom 20210916
This is something for the regional workflow people to look at, these log files are so cryptic that I have no idea what is going on. You may have to copy over the entire experiment directory. Note that I do have access to wcoss, so if you need me to look at something (other than regional_workflow logs) in the future, no need to copy to hera. |
@chan-hoo Sorry, it looks like there is still a problem with generating the XML file with the newest changes. I am working on it. |
That at least explains why I didn't have any problems using the standalone wrappers on macOS. |
… how these things work
…ould always be TRUE if we are in this task
Initialize `sysdir` to an empty string to avoid "unbound variable" problems
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Three WE2E tests with RRFS_CONUS_25km,13km,3km were completed successfully on the wcoss dell and cray.
Okay, with lots of help from @climbfuji and @chan-hoo, we have completed a pretty comprehensive battery of tests on these changes: Cheyenne (intel)
Hera (intel)
Other
|
Excellent! Now we only need an update for |
These changes are all consistent with PR ufs-community#539 (merged) and PR ufs-community#617 (under review) for the develop branch of the NOAA-EMC:regional_workflow repository. This should enable the use of a generic linux machine with Rocoto for all jobs that exist in the current develop branch. More work is needed to tackle those tasks that do not yet exit in the develop branch, but exist here.
DESCRIPTION OF CHANGES:
This change will add the capability to run regional_workflow (as part of the SRW app) on MacOS and generic LINUX platforms. Most of these changes are identical to those in #402 (hash bc08607) but some additional modifications needed to be made due to intervening changes in the develop branch.
There are still a few issues that may need resolving for MacOS platforms. Currently UFS_UTILS and post tasks will occasionally fail on MACOS when using multiple processors (mpirun -np 1 always succeeds). These problems likely need to be addressed outside of regional_workflow, so I don't believe that should hold up this PR.
TESTS CONDUCTED:
Ran Graduate Student Test on new platforms:
Ran suite of end-to-end tests on Cheyenne (intel/19.1.1) and Hera (intel/18.0.5.274). All passed as expected:
This has not been tested on WCOSS yet; it probably should be as some of the changed stanzas are specific to the WCOSS platform.
DEPENDENCIES:
This can be merged as-is; however, it will not work on MACOS platforms until ufs-community/UFS_UTILS#545 is merged
DOCUMENTATION:
Documentation pending (currently in Google doc: https://docs.google.com/document/d/14jlvL3nOi85NJCWSNnhHjjBsrAsR-mr8JfGFvVIybFI/edit)
ISSUE:
Will resolve #369