Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

evp kernel version 2 testing and validation #279

Closed
apcraig opened this issue Feb 2, 2019 · 13 comments · Fixed by #568
Closed

evp kernel version 2 testing and validation #279

apcraig opened this issue Feb 2, 2019 · 13 comments · Fixed by #568

Comments

@apcraig
Copy link
Contributor

apcraig commented Feb 2, 2019

We are going to merge PR #278, PR #252. There are several outstanding issues, basically copied from the end of #252,


Let me summarize where we are.

With evp_kernel_ver=0, results are bit-for-bit for most tests against the current master. This is running full test suites on gordon for 4 compilers. A subset of box tests are NOT bit-for-bit on 3/4 compilers. Rerunning the failed box tests with the debug flag (reduced optimization and run time checks) on both master and this PR results in bit-for-bit identical answers. It seems the changes in the answers in the box test is caused by some compiler optimization as a results of the code changes. This might be associated with the evp kernel changes (although @mhrib makes a case it shouldn't) or it might be associated with some of the code cleanup. We could look into this further or we could accept it. Personally, I am comfortable with this outcome as it stands. I believe we've shown the answers are roundoff different (see above gbox128 diff) as a result of compiler optimization and that we can make this bit-for-bit if we reduce compiler optimization. I think based on these results, we could merge this PR. evp_kernel_ver=0 will be the default setting.

Separately, there is an effort to test and validate the evp_kernel_ver=2. The same test suite on gordon was run with the new kernel on. Results can be found https://github.com/CICE-Consortium/Test-Results/wiki/cice_by_hash_forks, hash aa6de33...+evpk=2. Three to four tests fail on each compiler, and they are the same tests across the compilers. Looking at the intel results, https://github.com/CICE-Consortium/Test-Results/wiki/aa6de33f19.gordon.pgi.190128.235649, there are four failures.

  • restart gbox128 4x2. This test runs but fails to restart exactly.
  • restart gx1 40x4 droundrobin medium. This test fails with "(abort_ice) error = (horizontal_remap)ERROR: bad departure points" on the first timestep.
  • restart gx3 16x2x5x10x20 drakeX2. This test fails with "(abort_ice) error = (horizontal_remap)ERROR: bad departure points" on the first timestep.
  • restart tx1 40x4 dsectrobin medium. This test fails gracefully in the evp kernel. tx1 is not supported yet.

Again, many tests passed, but these 4 failures need to be debugged. In addition, the qc test relies on the gx1 configuration, so the qc testing comparing evp_kernel_ver=2 to 0 could not be done.

So, the outstanding tasks are

  • debug the 4 failures noted above
  • run the qc test comparing evp_kernel_ver=0 to evp_kernel_ver=2. This requires gx1 (one of the failing tests)
  • update documentation
  • change evp_kernel_ver variable to kevp_kernel
  • produce and document timing information comparing evp_kernel_ver=0 and 2.
  • add evp_kernel_ver=2 tests to the test suite
  • maybe do a little cleanup on ice_dyn_evp_1d.F90 to make the code a little more readable (breaks between subroutines and such)
@apcraig
Copy link
Contributor Author

apcraig commented May 15, 2019

Some more information on the test failures with evp_kernel_ver=2, running just with the intel compiler for now (other compilers have similar issues).

  • conrad_intel_restart_gbox128_4x2. This runs fine, but fails exact restart. The same case with debug set, runs very slow and timed out. If I run 8x1 (no openmp) instead of 4x2, it passes. If I comment out all the OMP in ice_dyn_evp_1d.F90 with 4x2, this case passes.

  • conrad_intel_restart_gx3_16x2x5x10x20_drakeX2 fails on the first timestep with the "bad departure points error". The same case with debug set fails the same way. If I comment out all the OMP in ice_dyn_evp_1d.F90, then this case still fails. If I use 32x1x5x10x20 (no openmp), the case fails the same way.

  • conrad_intel_restart_gx1_40x4_droundrobin_medium fails on the first timestep with "bad departure points error". Other gx1 cases pass. The same case with debug set fails the same way. If I run with 160x1 (no openmp), the case fails the same way.

There are also some cases (which I haven't reported on before) that fail bit-for-bit comparison between different cases when the results should be identical. These are comparisons with the same model, just different decompositions and so forth.

  • FAIL conrad_intel_smoke_gx3_4x1_diag1_run5day_thread bfbcomp conrad_intel_smoke_gx3_8x2_diag1_run5day different-data
  • FAIL conrad_intel_restart_gx3_1x1x50x58x4_droundrobin_thread bfbcomp conrad_intel_restart_gx3_4x2x25x29x4_dslenderX2 different-data
  • FAIL conrad_intel_restart_gx3_4x1x25x116x1_dslenderX1_thread bfbcomp conrad_intel_restart_gx3_4x2x25x29x4_dslenderX2 different-data
  • FAIL conrad_intel_restart_gx3_1x20x5x29x80_dsectrobin_short bfbcomp conrad_intel_restart_gx3_4x2x25x29x4_dslenderX2 different-data
  • FAIL conrad_intel_restart_gx3_1x4x25x29x16_droundrobin bfbcomp conrad_intel_restart_gx3_4x2x25x29x4_dslenderX2 different-data
  • FAIL conrad_intel_logbfb_gx3_1x20x5x29x80_diag1_dsectrobin_reprosum_short bfbcomp conrad_intel_logbfb_gx3_4x2x25x29x4_diag1_dslenderX2_reprosum different-data

If I run these tests with the OMP commented out in ice_dyn_evp_1d.F90, then they pass.

So OMP may explain some of it, but does not explain all of it. I can get some of the failed tests to run if I comment out all of the OMP in ice_dyn_evp_1d.F90. But that is not going to acceptable longer term for the evp_kernel_ver=2 since it relies only on openmp for performance.

But there are a couple other tests that still fail even when turning all the openmp off (by selecting pe layouts with 1 thread/task which does not invoke the openmp compiler option) with evp_kernel_ver=2 when they run fine with evp_kernel_ver=0.

I do not expect evp_kernel_ver=2 to get bit-for-bit results with ver=0, although maybe it should. At this point, I am not testing that. The first step is to make sure all the tests that pass with ver=0 also pass with ver=2.

Thoughts?

@apcraig
Copy link
Contributor Author

apcraig commented May 30, 2019

#318 addresses several of the outstanding points to some degree,

  • update documentation
  • change evp_kernel_ver variable to kevp_kernel
  • maybe do a little cleanup on ice_dyn_evp_1d.F90 to make the code a little more readable (breaks between subroutines and such)

@apcraig
Copy link
Contributor Author

apcraig commented May 30, 2019

#318 also documents that the current implementation is not validated and the code will abort when kevp_kernel=2. As a workaround for testing, kevp_kernel=102 will turn on version=2. Once this is validated, we will remove the abort and the workaround.

@apcraig
Copy link
Contributor Author

apcraig commented Sep 5, 2019

In nag testing, a separate issue was discovered.

When I add -nan to the compile, I get a separate error,

    /home/tcraig/cice-consortium/cice.nagclean/cicecore/cicedynB/dynamics/ice_dyn_evp_1d.F90: In function ‘dmi_omp_MP_domp_get_domain_rlu’:
    /home/tcraig/cice-consortium/cice.nagclean/cicecore/cicedynB/dynamics/ice_dyn_evp_1d.F90:110:58: error: ‘dmi_omp_MP_rdomp_iamTPI’ undeclared (first use in this function)
    subroutine domp_get_domain_rlu(lower,upper,d_lower,d_upper)
    ^
    /home/tcraig/cice-consortium/cice.nagclean/cicecore/cicedynB/dynamics/ice_dyn_evp_1d.F90:110:58: note: each undeclared identifier is reported only once for each function it appears in
    /home/tcraig/cice-consortium/cice.nagclean/cicecore/cicedynB/dynamics/ice_dyn_evp_1d.F90:110:57: error: ‘dmi_omp_MP_rdomp_ntTPI’ undeclared (first use in this function)
    subroutine domp_get_domain_rlu(lower,upper,d_lower,d_upper)
    ^
    /home/tcraig/cice-consortium/cice.nagclean/cicecore/cicedynB/dynamics/ice_dyn_evp_1d.F90: In function ‘dmi_omp_MP_domp_initMP__omp459’:
    /home/tcraig/cice-consortium/cice.nagclean/cicecore/cicedynB/dynamics/ice_dyn_evp_1d.F90:70:57: error: ‘dmi_omp_MP_rdomp_ntTPI’ undeclared (first use in this function)
    !$OMP PARALLEL DEFAULT(none)
    ^
    /home/tcraig/cice-consortium/cice.nagclean/cicecore/cicedynB/dynamics/ice_dyn_evp_1d.F90:70:58: error: ‘dmi_omp_MP_rdomp_iamTPI’ undeclared (first use in this function)
    !$OMP PARALLEL DEFAULT(none)
    ^
    /home/tcraig/cice-consortium/cice.nagclean/izumi_nag_smoke_gx3_8x1_debug_diag1_run1day_thread.chk02/Makefile:144: recipe for target 'ice_dyn_evp_1d.o' failed
    gmake: *** [ice_dyn_evp_1d.o] Error 1

I spent a few minutes trying to understand this. This only happens when threading and -nan is set. The error went away if I removed the threadprivate OMP definition at line 51 and then changed the OMP PARALLEL DEFAULT(none) statement on line 70 to OMP PARALLEL PRIVATE(domp_iam,domp_nt,rdomp_iam,rdomp_nt).

At this time, I think we have to assume this is a compiler bug. It doesn't make a lot of sense that the -nan compiler argument breaks some thread private implementation. On the other hand, it would be nice to be able to use the -nan option, and I do have some lingering concerns about whether maybe the thread implementation in the evp_1d is completely correct.

@apcraig
Copy link
Contributor Author

apcraig commented Sep 5, 2019

Also want to note the fix to the kind types due to nag testing in ice_dyn_evp_1d.F90 that was part of #356

@apcraig apcraig reopened this Sep 5, 2019
@TillRasmussen
Copy link
Contributor

@apcraig The tx1 test case is set to exit within the 1d solver as the conversion from 2d to 1d and vice versa is not tested for tripole grids

@mhrib
Copy link
Contributor

mhrib commented Oct 26, 2020

@TillRasmussen and @apcraig : The 1d solver will not work correctly for tripole grids (will run but gives wrong results at "tripole boundaries"). That's why an exit with an error is implemented for tri-grids.
If it have to work for tripole grids, the subroutine calc_halo_parent have to be modified.
Rgs
Mads

@apcraig
Copy link
Contributor Author

apcraig commented Oct 26, 2020

tripole is not the issue. It's fine that it's not working for those grids. I think I knew that and did not raise it as a concern. All the documented problems are for non-tripole grids. gx1 and gx3 are not tripole grids.

@TillRasmussen
Copy link
Contributor

@apcraig @mhrib I agree it shouid work for the others (gx1 and gx3).

@apcraig
Copy link
Contributor Author

apcraig commented Oct 26, 2020

I just ran a full test suite with kevp_kernel = 102 set in ice_in on cheyenne with the intel compiler. I compared to this weekend's full test suite which passes all tests. The evp1d results are here,

https://github.com/CICE-Consortium/Test-Results/wiki/922b998005.cheyenne.intel.20-10-26.221713.0

and for comparison, the baseline results are

https://github.com/CICE-Consortium/Test-Results/wiki/922b998005.cheyenne.intel.20-10-25.025008.0

Most of the tests pass and most of the tests produce non-bit-for-bit results with the baseline. But some tests fail, those are

FAIL cheyenne_intel_restart_gx1_40x4_droundrobin_medium test 
FAIL cheyenne_intel_restart_tx1_40x4_dsectrobin_medium test 
FAIL cheyenne_intel_restart_gbox128_4x2_short test 
FAIL cheyenne_intel_smoke_gx1_24x1_jra55_gx1_2008_medium_run90day test 
FAIL cheyenne_intel_restart_gx1_24x1_jra55_gx1_short test 
FAIL cheyenne_intel_restart_gx3_16x2x5x10x20_drakeX2 test 
FAIL cheyenne_intel_restart_gx3_1x1x50x58x4_droundrobin_thread bfbcomp cheyenne_intel_restart_gx3_4x2x25x29x4_dslenderX2 different-data
FAIL cheyenne_intel_restart_gx3_4x1x25x116x1_dslenderX1_thread bfbcomp cheyenne_intel_restart_gx3_4x2x25x29x4_dslenderX2 different-data
FAIL cheyenne_intel_restart_gx3_1x20x5x29x80_dsectrobin_short bfbcomp cheyenne_intel_restart_gx3_4x2x25x29x4_dslenderX2 different-data
FAIL cheyenne_intel_restart_gx3_1x4x25x29x16_droundrobin bfbcomp cheyenne_intel_restart_gx3_4x2x25x29x4_dslenderX2 different-data
FAIL cheyenne_intel_logbfb_gx3_1x20x5x29x80_diag1_dsectrobin_reprosum_short bfbcomp cheyenne_intel_logbfb_gx3_4x2x25x29x4_diag1_dslenderX2_reprosum different-data

which is largely consistent with results from last year. I have not looked into each of the failures in any detail. If you want me to do that, I can. If you need any help understanding any of the tests, let me know. And again, happy to help do some additional analysis if that would be helpful.

@eclare108213
Copy link
Contributor

Following up, have all of the issues in this issue been addressed adequately, up to and including #568 ?

@apcraig
Copy link
Contributor Author

apcraig commented Mar 14, 2021

I think once #568 has been revised, this issue can be closed. But there are a few things still to address in #568 I believe. I will comment there.

@apcraig apcraig linked a pull request Jul 16, 2021 that will close this issue
12 tasks
@apcraig apcraig mentioned this issue Aug 5, 2021
@apcraig
Copy link
Contributor Author

apcraig commented Aug 5, 2021

See #623 for followup issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants