-
Notifications
You must be signed in to change notification settings - Fork 253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
C384 P7 coupled tests fail on cheyenne #698
C384 P7 coupled tests fail on cheyenne #698
Comments
There is an awful number of NaNs in that stack trace. Let's try to figure it out today between my meetings. |
While testing restarts (ie, no waves) for the upcoming RTs for the coupled model with P7 configuration, I have been able to run the c192 coupled P7 test on Cheyenne intel (non-wave). However, the c384 P7 case (non-wave) still fails with the MPT error. |
The C384 P7 coupled cases still fail on Cheyenne for PR #765. We now have a standalone P7 test so I will make a C384 version of this to see if the error is reproducible. |
I created a |
Re-opening. |
I've been able to run the cpld_control_c384_p7 test on cheyenne.intel by turning off merra2. In input.nml: iaer = 1011 -> iaer = 5111 |
Are the merra2 aerosol climatology files on cheyenne input directory?
…On Thu, Oct 7, 2021 at 2:52 PM Denise Worthen ***@***.***> wrote:
I've been able to run the cpld_control_c384_p7 test on cheyenne.intel by
turning off merra2. In input.nml:
iaer = 1011 -> iaer = 5111
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#698 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AI7D6TMNYPIGCKN46YDPOB3UFXT7NANCNFSM5AV3AQJQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Yes, I have the aeroclim.m*.nc files in the run directory. |
The input directories used by the regression tests are identical. If the data comes from elsewhere, then almost certainly Cheyenne doesn't have it.
… On Oct 7, 2021, at 12:56 PM, Jun Wang ***@***.***> wrote:
Are the merra2 aerosol climatology files on cheyenne input directory?
On Thu, Oct 7, 2021 at 2:52 PM Denise Worthen ***@***.***>
wrote:
> I've been able to run the cpld_control_c384_p7 test on cheyenne.intel by
> turning off merra2. In input.nml:
>
> iaer = 1011 -> iaer = 5111
>
> —
> You are receiving this because you are subscribed to this thread.
> Reply to this email directly, view it on GitHub
> <#698 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AI7D6TMNYPIGCKN46YDPOB3UFXT7NANCNFSM5AV3AQJQ>
> .
> Triage notifications on the go with GitHub Mobile for iOS
> <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
> or Android
> <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
>
>
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub <#698 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB5C2RJQMUIEWXS4JGDWJYDUFXUMHANCNFSM5AV3AQJQ>.
Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
I went back and checked the commit history. The test first failed after we converted the v16 test to a v16_p7b test. |
We should keep in mind that the Intel compiler is significantly newer than on any of the other platforms (2021.2). Does it run with GNU?
… On Oct 7, 2021, at 1:05 PM, Denise Worthen ***@***.***> wrote:
I went back and checked the commit history. The test first failed after we converted the v16 test to a v16_p7b test.
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub <#698 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB5C2RMKHKG5UW3XI6MBBN3UFXVQBANCNFSM5AV3AQJQ>.
Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
I first found that turning it off w/ gnu-debug worked. I will test w/ the non-debug version. |
I used both intel and gnu in non-debug mode. First I ran the standard
The gnu case timed out but did run all the way to the fh=6. |
Can you copy a run directory from hera to cheyenne and change the
executable/job_card/modulefile to see if it runs?
…On Thu, Oct 7, 2021 at 5:25 PM Denise Worthen ***@***.***> wrote:
I used both intel and gnu in non-debug mode. First I ran the standard
cpld_control_c384_p7 test and they both failed with the MPT error. I
copied the run directories and changed iaer to 5111. Both cases ran. The
run directories are on cheyenne:
/glade/scratch/worthen/FV3_RT/c384_test_gnu and c384_test_intel
The gnu case timed out but did run all the way to the fh=6.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#698 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AI7D6TPP32CORZ7ZZEGH4ZLUFYF3RANCNFSM5AV3AQJQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
It appears that increasing the memory available on Cheyenne allows the I made this change in the job_card:
I'm not sure that is entirely the right way but it does complete the 6 hours for intel. |
I believe the issue is the memory footprint of the ingested Merra2 data. The values are stored in the files as float (r4), but when read-in are immediately promoted to double precision. They are then interpolated in time and space for use by the ATM. If the ingested values are kept as single precision and promoted to double precision when they are interpolated, the model runs on Cheyenne with the default resources. I have tested the following change on both cheyenne.intel and hera.intel. On hera.intel all baselines pass against
In testing, I found that the |
* Workflow in python starting to work. * Use new python_utils package structure. * Some bug fixes. * Use uppercase TRUE/FALSE in var_dfns * Use config.sh by default. * Minor bug fixes. * Remove config.yaml * Update to the latest develop * Remove quotes from numbers in predef grid. * Minor bug fix. * Move validity checker to the bottom of setup * Add more unit tests. * Update with python_utils changes. * Update to latest develop additions (Need to re-run regression test) * Use set_namelist and fill_jinja_template as python functions. * Replace sed regex searches with python re. * Use python realpath. * Construct settings as dictionary before passing to fill_jinja and set_namelist * Use yaml for setting predefined grid parameters. * Use xml parser for ccpp phys suite definition file. * Remove more run_command calls. * Simplify some func argument processing. * Move different config format parsers to same file. * Use os.path.join for the sake of macosx * Remove remaining func argument processing via os.environ. * Minor bug fix in set_extrn_mdl_params.sh * Add suite defn in test_data. * Minor fixes on unittest on jet. * Simplify boolean condition checks. * Include old in renaming of old directories * Fix conflicting yaml !join tag for paths and strings. * Bug fix with setting sfcperst dict. * Imitate "readlink -m" with os.path.realpath instead of os.readlink * Don't use /tmp as that is shared by multiple users. * Bug fix with cron line, maintain quotes around TRUE/FALSE. * Update to latest develop (untested) * Bug fix with existing cron line and quotes. * Bug fix with case-sensitive MACHINE name, and empty EXPT_DIR. * Update to latest develop * More updates. * Bug fix thanks to @willmayfield! Check both starting/ending characters are brackets for shell variable to be considered an array. * Make empty EXPT_BASEDIR workable. * Update to latest develop * Update in predef grid. * Check f90nml as well. Co-authored-by: Daniel Abdi <dabdi@Orion-login-2.HPC.MsState.Edu>
Description
This job fails to run at startup with an error:
MPT: shepherd terminated: r5i4n4.ib0.cheyenne.ucar.edu - job aborting
To Reproduce:
Cheyenne.intel
Additional context
Found during testing for PR #639. The test was turned off for cheyenne.intel until the issue can be resolved.
Since the wave model cannot be compile in debug mode, an equivalent test was created for a non-wave bmark_p7b configuration. The test can be accessed using rt.test in this branch
When running in debug mode, the model fails with the following:
The text was updated successfully, but these errors were encountered: