Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

evp1d failures #623

Closed
apcraig opened this issue Aug 5, 2021 · 9 comments · Fixed by #624
Closed

evp1d failures #623

apcraig opened this issue Aug 5, 2021 · 9 comments · Fixed by #624

Comments

@apcraig
Copy link
Contributor

apcraig commented Aug 5, 2021

This is a follow up to #568 and #279.

The QC test with the evp1d failed after 4 years with

(zap_snow_temperature)Tmin: -100.000000000000
(zap_snow_temperature)Tmax: 4.486434370863951E-006
(zap_snow_temperature)zqsn: -182365602.580354
(zap_snow_temperature)zap_snow_temperature: temperature out of bounds!
(zap_snow_temperature)k: 1
(zap_snow_temperature)zTsn: -103.894744835945
(zap_snow_temperature)Tmin: -100.000000000000
(zap_snow_temperature)Tmax: 5.419524611890995E-006
(zap_snow_temperature)zqsn: -182424769.766085

This needs to be further investigated.

@apcraig
Copy link
Contributor Author

apcraig commented Aug 5, 2021

Some ideas carried from #568.

If we believe the 2d implementation is robust, then this suggests there may still be a bug in the 1d implementation that's possibly triggered only on timescales of months to years. All our bit-for-bit testing with debug flags has been shorter than a year. We could run QC with debug flags on with 1d and 2d and see what happens.

I guess you can't restart the QC from shortly before the crash with debug flags on, and expect it to crash again. Would that be worth a try?

The first thing I might try is to turn debug on with 1d and 2d and run 5 years, 1 year at a time with restarts, just to see if the models are bit-for-bit throughout and to see if the 1d with debug also fails. Then we might try running evp1d with optimization but threading off. This might provide insight about an OpenMP issues. Depending what we learn there, we might create a case with a restart just before the failure and then start debugging the actual abort.

Is this the type of error you get for CFL violations in CESM? I'm wondering if the 1D evp QC test just happens to be hitting one of these, and the 2D case barely misses it. (The "incremental" part of incremental remap assumes that ice moves no farther than 1 grid cell in 1 time step.) If restarting the 1D case with a reduced timestep runs through this point, CFL could be the culprit -- there might not be anything wrong with the 1D evp implementation at all, just unlucky.

@apcraig
Copy link
Contributor Author

apcraig commented Aug 5, 2021

One other thing to add from #568. The QC error with 1d evp was a robust failure with both 9x4 (threaded) and 36x1 (no threading) which suggests OpenMP is not the problem.

@apcraig
Copy link
Contributor Author

apcraig commented Aug 5, 2021

While we're thinking about evp1d. We should also add some additional tests like evp1d + revised evp.

@TillRasmussen
Copy link
Contributor

I can run that tonight if there is room on hpc

@TillRasmussen
Copy link
Contributor

While we're thinking about evp1d. We should also add some additional tests like evp1d + revised evp.

I have run all test with revised evp for 1d and 2d. All test pass for both intel and gnu with full debug flags on

@apcraig
Copy link
Contributor Author

apcraig commented Aug 11, 2021

The same is happening testing in #621 without evp1d on. This looks like a bigger problem than evp1d. All tests fail at the end of 2008 with 2005 cycling data. I think this may be a calendar issue. The QC tests have leap years on but we're cycling 2005 data. We probably need to set leap years off for QC tests from now on. This problem may have come in when we cleaned up the time manager. I will do additional testing.

@apcraig
Copy link
Contributor Author

apcraig commented Aug 12, 2021

I have this figured out and will be submitting a PR for it. It turns out the problem is with the forcing and model calendar being out of sync with leap years. The errors were not limited to the evp1d runs, but that didn't become clear until today.

@TillRasmussen
Copy link
Contributor

Perfect.

@apcraig
Copy link
Contributor Author

apcraig commented Aug 12, 2021

I reran the QC tests with evp1d with the calendar fix and everything is working fine and passes. The base uses the standard_2d evp solver and the test uses the shared_mem_1d solver. I tested with both OpenMP on and off for the evp1d.

> ./cice.t-test.py /glade/scratch/tcraig/CICE_RUNS/cheyenne_intel_smoke_gx1_36x1_medium_qc.qc_evp1d_base /glade/scratch/tcraig/CICE_RUNS/cheyenne_intel_smoke_gx1_36x1_evp1d_medium_qc.qc_evp1d_test
INFO:__main__:Running QC test on the following directories:
INFO:__main__:  /glade/scratch/tcraig/CICE_RUNS/cheyenne_intel_smoke_gx1_36x1_medium_qc.qc_evp1d_base
INFO:__main__:  /glade/scratch/tcraig/CICE_RUNS/cheyenne_intel_smoke_gx1_36x1_evp1d_medium_qc.qc_evp1d_test
INFO:__main__:Number of files: 1825
INFO:__main__:2 Stage Test Passed
INFO:__main__:Quadratic Skill Test Passed for Northern Hemisphere
INFO:__main__:Quadratic Skill Test Passed for Southern Hemisphere
WARNING:__main__:Error loading necessary Python modules in plot_data function
WARNING:__main__:Error loading necessary Python modules in plot_data function
WARNING:__main__:Error loading necessary Python modules in plot_data function
INFO:__main__:
INFO:__main__:Quality Control Test PASSED

> ./cice.t-test.py /glade/scratch/tcraig/CICE_RUNS/cheyenne_intel_smoke_gx1_36x1_medium_qc.qc_evp1d_base /glade/scratch/tcraig/CICE_RUNS/cheyenne_intel_smoke_gx1_9x4_evp1d_medium_qc.qc_evp1d_test
INFO:__main__:Running QC test on the following directories:
INFO:__main__:  /glade/scratch/tcraig/CICE_RUNS/cheyenne_intel_smoke_gx1_36x1_medium_qc.qc_evp1d_base
INFO:__main__:  /glade/scratch/tcraig/CICE_RUNS/cheyenne_intel_smoke_gx1_9x4_evp1d_medium_qc.qc_evp1d_test
INFO:__main__:Number of files: 1825
INFO:__main__:2 Stage Test Passed
INFO:__main__:Quadratic Skill Test Passed for Northern Hemisphere
INFO:__main__:Quadratic Skill Test Passed for Southern Hemisphere
WARNING:__main__:Error loading necessary Python modules in plot_data function
WARNING:__main__:Error loading necessary Python modules in plot_data function
WARNING:__main__:Error loading necessary Python modules in plot_data function
INFO:__main__:
INFO:__main__:Quality Control Test PASSED

Once the fix is merged to master, we can close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants