Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In fv3atm: convert GFS DDTs from blocked data structures to contiguous arrays #2183

Conversation

climbfuji
Copy link
Collaborator

@climbfuji climbfuji commented Mar 11, 2024

Commit Queue Requirements:

  • Fill out all sections of this template.
  • All sub component pull requests have been reviewed by their code managers.
  • Run the full Intel+GNU RT suite (compared to current baselines) on either Hera/Derecho/Hercules
  • Commit 'test_changes.list' from previous step

Description:

This PR updates the submodule pointers for fv3atm, gfdl_atmos_cubed_sphere, ccpp-physics for the changes described in the associates PRs below: convert internal GFS DDTs from blocked data structures to contiguous arrays. This excludes the (external) GFS_extdiag and GFS_restart DDTs.

Commit Message:

* UFSWM - In fv3atm and submodules, convert internal GFS DDTs from blocked data structures to contiguous arrays. This excludes the (external) `GFS_extdiag` and `GFS_restart` DDTs.
  * AQM - 
  * CDEPS - 
  * CICE - 
  * CMEPS - 
  * CMakeModules - 
  * FV3 - Convert GFS DDTs from blocked data structures to contiguous arrays (not including GFS_restart and GFS_extdiag DDTs)
    * ccpp-physics - Convert GFS DDTs from blocked data structures to contiguous arrays (affects `GFS_debug.{F90,meta} only`)
    * atmos_cubed_sphere - Convert GFS DDTs from blocked data structures to contiguous arrays and remove IPD_Data super DDT
  * GOCART - 
  * HYCOM - 
  * MOM6 - 
  * NOAHMP - 
  * WW3 - 
  * stochastic_physics - 

Priority:

  • Normal

Git Tracking

UFSWM:

Sub component Pull Requests:

UFSWM Blocking Dependencies:

n/a


Changes

Regression Test Changes (Please commit test_changes.list):

  • No Baseline Changes

Input data Changes:

  • None.

Library Changes/Upgrades:

  • No Updates

Testing Log:

  • RDHPCS
    • Hera
    • Orion
    • Hercules
    • Jet
    • Gaea
    • Derecho
  • WCOSS2
    • Dogwood/Cactus
    • Acorn
  • CI
  • opnReqTest (complete task if unnecessary)

climbfuji and others added 30 commits December 22, 2023 10:32
@zach1221 zach1221 added Ready for Commit Queue The PR is ready for the Commit Queue. All checkboxes in PR template have been checked. and removed Waiting for Reviews The PR is waiting for reviews from associated component PR's. labels Aug 7, 2024
@zach1221 zach1221 added jenkins-ort run ORT testing and removed jenkins-ort run ORT testing labels Aug 7, 2024
@FernandoAndrade-NOAA
Copy link
Collaborator

Gaea failed compile_atm_debug_dyn32, rerunning.

@FernandoAndrade-NOAA
Copy link
Collaborator

Gaea's compile_atm_debug_dyn32 is persistently failing, it looks like an out of memory issue @jkbk2004 @zach1221 FYI
/gpfs/f5/epic/scratch/Fernando.Andrade-maldonado/RT_RUNDIRS/Fernando.Andrade-maldonado/FV3_RT/rt_88232/compile_atm_debug_dyn32_intel/err

 2616 ---------^
 2617 slurmstepd: error: Detected 1 oom_kill event in StepId=88636253.batch. Some of the step tasks have been OOM Killed.

@DusanJovic-NOAA
Copy link
Collaborator

Gaea's compile_atm_debug_dyn32 is persistently failing, it looks like an out of memory issue @jkbk2004 @zach1221 FYI /gpfs/f5/epic/scratch/Fernando.Andrade-maldonado/RT_RUNDIRS/Fernando.Andrade-maldonado/FV3_RT/rt_88232/compile_atm_debug_dyn32_intel/err

 2616 ---------^
 2617 slurmstepd: error: Detected 1 oom_kill event in StepId=88636253.batch. Some of the step tasks have been OOM Killed.

While porting to Intel LLVM I also found that some of the compile jobs (do not remember exactly which one) fail often with OOM error on Gaea. I ended up setting memory size explicitly to 4Gb per cpu, see:

https://github.com/ufs-community/ufs-weather-model/pull/2224/files#diff-3f224109d65db4d383892d9070d370bc8ec1361abac50bb22c7c1c4d996c2105

@BrianCurtis-NOAA
Copy link
Collaborator

I have not been granted permission to access Acorn yet. We can skip it for now.

@FernandoAndrade-NOAA
Copy link
Collaborator

I'll leave another update for Gaea with the suggested 4Gb fix once the rerun concludes. Jet is rerunning control_wam intel due to a timeout.

@FernandoAndrade-NOAA
Copy link
Collaborator

Tests passed with the proposed 4gb per cpu fix, thanks @DusanJovic-NOAA! Please ignore discrepancies in the baseline dates in the hera, gaea, and jet logs, they are the same.

@FernandoAndrade-NOAA
Copy link
Collaborator

We should be good to continue with the merge process, I'll let ccpp and cubed sphere to proceed with merging.

@DusanJovic-NOAA
Copy link
Collaborator

Merged fv3atm.

@FernandoAndrade-NOAA FernandoAndrade-NOAA merged commit fcf0022 into ufs-community:develop Aug 8, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
No Baseline Change No Baseline Change Ready for Commit Queue The PR is ready for the Commit Queue. All checkboxes in PR template have been checked.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Convert blocked data structures in FV3atm to contiguous arrays