Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update cgridDEV to main #8d7314f (CICE6.3.1 release) #64

Merged
merged 23 commits into from
Mar 8, 2022
Merged

Conversation

apcraig
Copy link
Owner

@apcraig apcraig commented Mar 5, 2022

PR checklist

  • Short (1 sentence) summary of your PR:
    Update to main #8d7314f and fix C/CD deformations, taubx/tauby, and OpenMP
  • Developer(s):
    apcraig
  • Suggest PR reviewers from list in the column to the right.
  • Please copy the PR test results link or provide a summary of testing completed below.
    full test suites run on 3 compilers on cheyenne,
    https://github.com/CICE-Consortium/Test-Results/wiki/cice_by_hash_forks#71500b6021c002daba522d75f239ed26bfe37e1a
    Differences vs main are documented below
  • How much do the PR code changes differ from the unmodified code?
    • bit for bit, mostly
    • different at roundoff level
    • more substantial
  • Does this PR create or have dependencies on Icepack or any other models?
    • Yes
    • No
  • Does this PR add any new test cases?
    • Yes
    • No
  • Is the documentation being updated? ("Documentation" includes information on the wiki or in the .rst files from doc/source/, which are used to create the online technical docs at https://readthedocs.org/projects/cice-consortium-cice/. A test build of the technical docs will be performed as part of the PR testing.)
    • Yes
    • No, does the documentation need to be updated at a later time?
      • Yes
      • No
  • Please provide any additional information or relevant details below:

Full test suites run on cheyenne intel, gnu, pgi compared to the main #8d7314f with the following results,

7188 measured results of 7188 total results
6107 of 7188 tests PASSED
0 of 7181 tests PENDING
792 of 7181 tests MISSING data
290 of 7181 tests FAILED
  • C/CD restarts fail, known problem (12)
  • Some unittests are new, fail compare due to missing baseline (15)
  • Many box tests changed answers due to "ai" division of forcing. 40 each for intel, gnu, pgi which suggests it's the change in implementation. (120)
  • dynpicard tests fail omp testing 18x1 /= 6x4, known problem on main VP exact restart and other nonBFB problems CICE-Consortium/CICE#518 (9)
  • Several PGI results are not bit-for-bit on cgridDEV anymore. Intel and gnu do not show the same. This is likely due to a compiler optimization issue in PGI, PGI is very particular. There are cases with debug (20) and cases without debug (80). FYI, 200 PGI regression tests pass vs 100 that do not.
  • There is a problem with the following cases with omp testing. This seems to be new on the branch and needs to be debugged. Problem exists on B grid too but not with intel. (6) Note: further testing suggests boxrestore is not bit-for-bit with different decomps with OMP off, even in B case. Also, these tend to pass on main, but not 100%.
FAIL cheyenne_pgi_smoke_gbox128_24x1_boxrestore_cmplogrest_reprosum_run10day_thread bfbcomp cheyenne_pgi_smoke_gbox128_14x2_boxrestore_reprosum_run10day different-data
FAIL cheyenne_pgi_smoke_gbox128_24x1_boxrestore_cmplogrest_gridc_reprosum_run10day_thread bfbcomp cheyenne_pgi_smoke_gbox128_14x2_boxrestore_gridc_reprosum_run10day different-data
FAIL cheyenne_pgi_smoke_gbox128_24x1_boxrestore_cmplogrest_gridcd_reprosum_run10day_thread bfbcomp cheyenne_pgi_smoke_gbox128_14x2_boxrestore_gridcd_reprosum_run10day different-data
FAIL cheyenne_gnu_smoke_gbox128_24x1_boxrestore_cmplogrest_reprosum_run10day_thread bfbcomp cheyenne_gnu_smoke_gbox128_14x2_boxrestore_reprosum_run10day different-data
FAIL cheyenne_gnu_smoke_gbox128_24x1_boxrestore_cmplogrest_gridc_reprosum_run10day_thread bfbcomp cheyenne_gnu_smoke_gbox128_14x2_boxrestore_gridc_reprosum_run10day different-data
FAIL cheyenne_gnu_smoke_gbox128_24x1_boxrestore_cmplogrest_gridcd_reprosum_run10day_thread bfbcomp cheyenne_gnu_smoke_gbox128_14x2_boxrestore_gridcd_reprosum_run10day different-data
  • There is a problem with the following cases with omp testing. I suspect this is an issue with the OpenMP threading in gridc/gridcd with seabedstress. Needs more debugging. Problem doesn't exist with B grid. (6) Note: further testing confirms no problem with B grid but non bit-for-bit with different decomps with C/CD so may be decomp issue, not OMP.
FAIL cheyenne_intel_smoke_gx1_28x1_cmplogrest_gridc_reprosum_run10day_seabedprob_thread bfbcomp cheyenne_intel_smoke_gx1_15x2_gridc_reprosum_run10day_seabedprob different-data
FAIL cheyenne_intel_smoke_gx1_28x1_cmplogrest_gridcd_reprosum_run10day_seabedprob_thread bfbcomp cheyenne_intel_smoke_gx1_15x2_gridcd_reprosum_run10day_seabedprob different-data
FAIL cheyenne_pgi_smoke_gx1_28x1_cmplogrest_gridc_reprosum_run10day_seabedprob_thread bfbcomp cheyenne_pgi_smoke_gx1_15x2_gridc_reprosum_run10day_seabedprob different-data
FAIL cheyenne_pgi_smoke_gx1_28x1_cmplogrest_gridcd_reprosum_run10day_seabedprob_thread bfbcomp cheyenne_pgi_smoke_gx1_15x2_gridcd_reprosum_run10day_seabedprob different-data
FAIL cheyenne_gnu_smoke_gx1_28x1_cmplogrest_gridc_reprosum_run10day_seabedprob_thread bfbcomp cheyenne_gnu_smoke_gx1_15x2_gridc_reprosum_run10day_seabedprob different-data

  • prod tests are not entirely reliable on main and this is the case on the branch too. We need to look into this on main. Are results not reproducible over long runs ?? (6)
FAIL cheyenne_intel_restart_gx1_32x1_gx1prod complog cice.8d7314f28b.220226-095011 different-data
FAIL cheyenne_intel_restart_gx1_32x1_gx1prod compare cice.8d7314f28b.220226-095011 27.88 6.14 13.00 different-data
FAIL cheyenne_pgi_restart_gx1_32x1_gx1prod complog cice.8d7314f28b.220226-095011 different-data
FAIL cheyenne_pgi_restart_gx1_32x1_gx1prod compare cice.8d7314f28b.220226-095011 47.52 9.97 27.58 different-data
FAIL cheyenne_gnu_restart_gx1_32x1_gx1prod complog cice.8d7314f28b.220226-095011 different-data
FAIL cheyenne_gnu_restart_gx1_32x1_gx1prod compare cice.8d7314f28b.220226-095011 34.21 7.42 17.02 different-data
  • VP implementation now uses capping=1.0 instead of capping=0.0. (16)
FAIL cheyenne_intel_smoke_gx3_4x1_dynpicard complog cice.8d7314f28b.220226-095011 different-data
FAIL cheyenne_intel_smoke_gx3_4x1_dynpicard compare cice.8d7314f28b.220226-095011 3.21 0.95 1.49 different-data
26a29,30
FAIL cheyenne_intel_smoke_gx3_6x4_dynpicard_reprosum_run10day complog cice.8d7314f28b.220226-095011 different-data
FAIL cheyenne_intel_smoke_gx3_6x4_dynpicard_reprosum_run10day compare cice.8d7314f28b.220226-095011 27.66 16.47 4.12 different-data
36a41,42
FAIL cheyenne_intel_smoke_gx3_18x1_cmplogrest_dynpicard_reprosum_run10day_thread complog cice.8d7314f28b.220226-095011 different-data
FAIL cheyenne_intel_smoke_gx3_18x1_cmplogrest_dynpicard_reprosum_run10day_thread compare cice.8d7314f28b.220226-095011 20.18 9.52 4.48 different-data
152a159,160
FAIL cheyenne_pgi_smoke_gx3_6x4_dynpicard_reprosum_run10day complog cice.8d7314f28b.220226-095011 different-data
FAIL cheyenne_pgi_smoke_gx3_6x4_dynpicard_reprosum_run10day compare cice.8d7314f28b.220226-095011 62.66 44.72 11.22 different-data
162a171,172
FAIL cheyenne_pgi_smoke_gx3_18x1_cmplogrest_dynpicard_reprosum_run10day_thread complog cice.8d7314f28b.220226-095011 different-data
FAIL cheyenne_pgi_smoke_gx3_18x1_cmplogrest_dynpicard_reprosum_run10day_thread compare cice.8d7314f28b.220226-095011 28.88 14.66 8.36 different-data
233a244,245
FAIL cheyenne_gnu_smoke_gx3_4x1_dynpicard complog cice.8d7314f28b.220226-095011 different-data
FAIL cheyenne_gnu_smoke_gx3_4x1_dynpicard compare cice.8d7314f28b.220226-095011 4.19 1.49 1.88 different-data
241a254,255
FAIL cheyenne_gnu_smoke_gx3_6x4_dynpicard_reprosum_run10day complog cice.8d7314f28b.220226-095011 different-data
FAIL cheyenne_gnu_smoke_gx3_6x4_dynpicard_reprosum_run10day compare cice.8d7314f28b.220226-095011 33.13 20.40 6.51 different-data
251a266,267
FAIL cheyenne_gnu_smoke_gx3_18x1_cmplogrest_dynpicard_reprosum_run10day_thread complog cice.8d7314f28b.220226-095011 different-data
FAIL cheyenne_gnu_smoke_gx3_18x1_cmplogrest_dynpicard_reprosum_run10day_thread compare cice.8d7314f28b.220226-095011 23.47 12.54 5.31 different-data

apcraig and others added 18 commits November 15, 2021 18:07
…m#665)

* Fix some raketests for izumi

* fix some rake tests
…ICE-Consortium#667)

When 'make' is invoked on the CICE Makefile, the first thing it does is
to try to make the included dependency files (*.d) (which are in fact
Makefiles themselves) [1], in alphabetical order.

The rule to make the dep files have the dependency generator, 'makdep',
as a prerequisite, so when processing the first dep file, make notices
'makdep' does not exist and proceeds to build it. If for whatever reason
this compilation fails, make will then proceed to the second dep file,
notice that it recently tried and failed to build its dependency
'makdep', give up on the second dep file, proceed to the third, and so
on.

In the end, no dep file is produced. Make then restarts itself and
proceeds to build the code, which of course fails catastrophically
because the Fortran source files are not compiled in the right order
because the dependency files are missing.

To avoid that, add a dependency on the dep file to the rules that make
the object file out of the Fortran source files. Since old-fashioned
suffix rules cannot have their own prerequisites [2], migrate the rules
for the Fortran source files to use pattern rules [3] instead. While at
it, also migrate the rule for the C source files.

With this new dependency, the builds abort early, before trying to
compile the Fortran sources, making it easier to understand what
has gone wrong.

Since we do not use suffix rules anymore, remove the '.SUFFIXES' line
that indicates which extension to use suffix rules for (but keep the
line that eliminates all default suffix rules).

[1] https://www.gnu.org/software/make/manual/html_node/Remaking-Makefiles.html
[2] https://www.gnu.org/software/make/manual/html_node/Suffix-Rules.html
[3] https://www.gnu.org/software/make/manual/html_node/Pattern-Rules.html#Pattern-Rules
* update parsing scripts to improve robustness, fix multi-pe advection=none

* Update cice script to improve performance including
minor refactoring of parse_namelist and parse_settings
to reduce cost and ability to use already setup ice_in file
from a prior case in the suite.

Added commented out timing ability in cice.setup.

Change test default to PEND from FAIL.

* fix cice.setup for case

* add sedbak implementation to support Mac sed

* s/spend/spent
* add debug_model feature
* add required variables and calls for tr_snow
* Adding method to write erroneous namelist options

* Remove erroneous comma in abort_ice for namelist check

* Added check for zbgc_nml. I missed that namelist in this file.

* Added space and colons for namelist error output

* Added space and colons for namelist error output

Co-authored-by: David A. Hebert <dhebert@nrlssc.navy.mil>
* updated orbital calculations needed for cesm

* fixed problems in updated orbital calculations needed for cesm

* update CICE6 to support coupling with UFS

* put in changes so that both ufsatm and cesm requirements for potential temperature and density are satisfied

* update icepack submodule

* Revert "update icepack submodule"

This reverts commit e70d1ab.

* update comp_ice.backend with temporary ice_timers fix

* Fix threading problem in init_bgc

* Fix additional OMP problems

* changes for coldstart running

* Move the forapps directory

* remove cesmcoupled ifdefs

* Fix logging issues for NUOPC

* removal of many cpp-ifdefs

* fix compile errors

* fixes to get cesm working

* fixed white space issue

* Add restart_coszen namelist option

* Update NUOPC cap to work with latest CICE6 master

* nuopc,cmeps or s2s build updates

* fixes for dbug_flag

* Update nuopc2 to latest CICE master

* Fix some merge problems

* Fix dbug variable

* Manual merge of UFS changes

* fixes to get CESM B1850 compset working

* refactored ice_prescribed_mod.F90 to work with cdeps rather than the mct data models

* Fix use_restart_time

* changes for creating masks at runtime

* added ice_mesh_mod

* implemented area correction factors as option

* more cleanup

* Fix dragio

* Fix mushy bug

* updates to nuopc cap to resolve inconsistency between driver inputs and cice namelists

* changed error message

* added icepack_warnings_flush

* updates to get ice categories working

* updates to have F compset almost working with cice6 - still problems in polar regions - need to resolve 253K/cice6 versus 273K/cice5 differences

* changed tolerance of mesh/grid comparison

* added issues raised in PR

* Update CESM-CICE sync with new time manager

* Add back in latlongrid

* Add new advanced snow physics to driver

* Fix restart issue with land blocks

* Update mesh check in cap

* fix scam problems

* reintroduced imesh_eps check

* Put dragio in the namelist instead

* Remove redundant code

* Fix some indents

Co-authored-by: Mariana Vertenstein <mvertens@ucar.edu>
Co-authored-by: apcraig <anthony.p.craig@gmail.com>
Co-authored-by: Denise Worthen <denise.worthen@noaa.gov>
* Add CESM1_PIO for fill value check

* Revert PIO_FILL_DOUBLE change for now
…nsortium#677)

- Remove recent update to namelist read that traps bad lines, it conflicts
  with flexibility to read groups in random order picked up by NAG.
- change print* statements to write(nu_diag,*)
* Add narwhal intel, gnu, cray, aocc
Add perf_suite.ts

* update narwhal_cray and perf_suite
* Add narwhal intel, gnu, cray, aocc
Add perf_suite.ts

* update narwhal_cray and perf_suite

* Review and update OMP implementation

- Fix call to timers in block loops
- Fix some OMP Private variables
- Test OMP Scheduling, add SCHEDULE(runtime) to some OMP loops
- Review column and advection OMP implementation
- ADD OMP_TIMERS CPP option (temporary) to time threaded sections
- Add timer_tmp timers (temporary)
- Add omp_suite.ts test suite
- Add ability to set OMP_SCHEDULE via options (ompscheds, ompscheds1, ompschedd1)

* - Review diagnostics OMP implementation
- Add timer_stats namelist to turn on extra timer output information
- Add ICE_BFBTYPE and update bit-for-bit comparison logic in scripts
- Update qc and logbfb testing
- Remove logbfb and qchkf tests, add cmplog, cmplogrest, cmprest set_env files to set ICE_BFBTYPE
- Update documentation

* Update EVP OMP implementation

* - Refactor puny/pi scalars in eap dynamics to improve performance
- Update OMP in evp and eap

* Clean up

* Comment out temporary timers

* Update OMP env variables on Narwhal

* Update gaffney OMP_STACKSIZE

* update OMP_STACKSIZE on cori

* Update Onyx OMP_STACKSIZE
Update documentation

* Update OMP_STACKSIZE on mustang

* - Update Tsfc values on land in various places in the code, was affecting testing.  Specifically fix upwind advection.
- Comment out OMP in ice_dyn_evp_1d_kernel, was producing non bit-for-bit results with different thread counts
* refactor seabed_stress. Bit for bit

* Removed if statement from stepu. Results are binary identical, however taubx and tauby is updated on all iterations instead of just the last one. Not used within iteration

* changed capping from logical to numeric in order to remove if statement. Moved call to deformation out of loop

* clean dyn_finish, correct intent(inout) to intent(in) for Cw, resse Cb in stepu, remove if from seabed_stress_LKD

* Reolve conflicts after updating main

* modified environment for Freya to accomodate for additional OMP commands

* Requested changes after review. Only changed in seabed stress and not bit for bit if cor=0.0

added legacy comment in ice_dyn_finish

* move deformation to subcycling
- Remove gordon and conrad machines.
- Add setenv OMP_STACKSIZE commented out in env files
- Update Icepack to fc4b809
Update tauxbx, tauxby calculations on C/CD to be consistent with main B changes
Extend omp_suite to include C/CD tests
@apcraig apcraig mentioned this pull request Mar 6, 2022
16 tasks
@eclare108213
Copy link
Collaborator

  • prod tests are not entirely reliable on main and this is the case on the branch too. We need to look into this on main. Are results not reproducible over long runs ??

They should be reproducible over long runs! Definitely need to look into this.

@apcraig
Copy link
Owner Author

apcraig commented Mar 7, 2022

I've been debugging some of these issues the last couple days. It's been challenging because some issues arise due to multiple bugs. Those are always the most difficult to sort out. Have found a few issues, one an uninitialized array associated with seabed_stress. That could be the issue with reproducibility of the prod tests, but I still need to check that.

@apcraig
Copy link
Owner Author

apcraig commented Mar 7, 2022

This is ready to merge. It includes just the merge from main. All the other problems existed before on cgridDEV. I will create a new PR to fix some of the issues I've found. Anyone think we shouldn't merge this?

@eclare108213
Copy link
Collaborator

I'm reviewing it now. Have a few questions but probably nothing to stop it from being merged.

Copy link
Collaborator

@eclare108213 eclare108213 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noted a few questions but none of them need to have responses before this PR is merged.

@@ -328,7 +328,7 @@ subroutine implicit_solver (dt)

! tcraig, tcx, threading here leads to some non-reproducbile results and failures in icepack_ice_strength
! need to do more debugging
!$TCXOMP PARALLEL DO PRIVATE(iblk,ilo,ihi,jlo,jhi,this_block)
!$TCXOMP PARALLEL DO PRIVATE(iblk,ilo,ihi,jlo,jhi,this_block,ij,i,j)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this loop still not working? If it is, then remove TCX?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The VP still has a few known problems. One with this OMP loop, another with restarts, and possibly another with bit-for-bit with different decomps and/or OMP, CICE-Consortium#518.

trcrn(i,j,nt_Tsfc,n) = Tf(i,j) ! surface temperature
else
trcrn(i,j,nt_Tsfc,n) = c0 ! on land gridcells
endif
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be more efficient (or maybe just more elegant) to use the merge function instead of all the if conditionals for setting Tsfc, here and elsewhere?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a huge fan of merge syntax, but willing to start using it as it is a bit more compact. Will keep that in mind moving forward, update the current implementation, and try to use merge more.

@@ -745,7 +745,8 @@ subroutine read_restart_field(nu,nrec,work,atype,vname,ndim3,diag, &
status = pio_inq_varid(File,trim(vname),vardesc)

if (status /= PIO_noerr) then
call abort_ice(subname//"ERROR: CICE restart? Missing variable: "//trim(vname))
call abort_ice(subname// &
"ERROR: CICE restart? Missing variable: "//trim(vname))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is there a question mark in this error statement?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, I'll try to review that.

@apcraig apcraig merged commit 6534048 into cgridDEV Mar 8, 2022
apcraig added a commit that referenced this pull request Mar 8, 2022
address the problems with the boxrestore test errors.
- Update box2001 so it's bit-for-bit with different blocks/decomps/pe counts.
- Initialize bathymetry values at all gridcells when bathymetry_format='default' and use_bathymetry=.false.
apcraig added a commit that referenced this pull request Mar 10, 2022
…, omp_suite (#65)

* Fix several problems noted in PR #64.  This should
address the problems with the boxrestore test errors.
- Update box2001 so it's bit-for-bit with different blocks/decomps/pe counts.
- Initialize bathymetry values at all gridcells when bathymetry_format='default' and use_bathymetry=.false.

* update comparelog to exclude extraneous icepack output, help prod comparisons

* - Zero out certain fields on land on restart files.  Some fields have non-zero values
  over land by default and this causes problems with land block elimination and
  comparisons of different decompositions.  Does not affect science.
- Update omp_suite to use different block sizes in comparisons

* use c0 instead of 0.

* fix OpenMP private variables on new code

* update pio restart diagnostics format
@apcraig apcraig deleted the cgridD branch August 17, 2022 21:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants