Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: out of memory #1759

Closed
Qingfu-Liu opened this issue May 19, 2023 · 5 comments
Closed

Error: out of memory #1759

Qingfu-Liu opened this issue May 19, 2023 · 5 comments
Labels
bug Something isn't working

Comments

@Qingfu-Liu
Copy link
Collaborator

Qingfu-Liu commented May 19, 2023

Description

To Reproduce:

Additional context

Output

@Qingfu-Liu Qingfu-Liu added the bug Something isn't working label May 19, 2023
@Qingfu-Liu
Copy link
Collaborator Author

Qingfu-Liu commented May 19, 2023

Not sure why the text is not showed up in the description section. If you open the "edit" button, you will see the text. I also copy the text here :

I have an error message "Out Of Memory" from the workflow I checked out on May 18/19, 2023.
The workflow I checked out on Hera from: https://github.com/NOAA-EMC/global-workflow
and created the workflow directory and run the suite FV3_GFS_v17_p8
I am able to run through 240 hours forecast from the version I checked out on April 28:
/scratch1/NCEPDEV/global/Qingfu.Liu/git/gfsv17_20230428/sorc/ufs_model.fd]git branch -vva

I am running suite FV3_GFS_v17_p8 on Hera using two different versions. Not sure which commit cause the problem.

Screenshots
: PASS: fcstRUN phase 1, n_atmsteps = 12 time is 1.417832
0: PASS: fcstRUN phase 2, n_atmsteps = 12 time is 0.321013
4608: PROCESS SURFCE done
4: ncells= 5
4: nlives= 12
4: nthresh= 18.0000000000000
4680: slurmstepd: error: Detected 1 oom-kill event(s) in StepId=45100467.0. Some of your processes may have been killed by the cgroup out-of-memory handler.
4608: PROCESS CLDRAD done
srun: error: h33m34: task 4688: Out Of Memory
srun: launch/slurm: _step_signal: Terminating StepId=45100467.0
0: slurmstepd: error: *** STEP 45100467.0 ON h1c01 CANCELLED AT 2023-05-19T16:00:47 ***
4964: forrtl: error (78): process killed (SIGTERM)

output logs
Hera: /scratch1/NCEPDEV/stmp2/Qingfu.Liu/ROTDIRS/gfsv17p8c_p9/logs/2020080100/gfsfcst.log_save

The code is here: /scratch1/NCEPDEV/global/Qingfu.Liu/git/gfsv17_20230518_org

@DeniseWorthen
Copy link
Collaborator

@Qingfu-Liu Since you're running into this problem from the workflow, I'd suggest the issue be created there.

@JessicaMeixner-NOAA
Copy link
Collaborator

@Qingfu-Liu I'd suggest adding more tasks the write component. That's helped me with memory problems in the workflow in the past. You can also see this issue: NOAA-EMC/global-workflow#1572 which might be of relevance.

@Qingfu-Liu
Copy link
Collaborator Author

@DeniseWorthen @JessicaMeixner-NOAA Thank you very much for the help. I will try more tests by changing the tasks in the workflow

@Qingfu-Liu
Copy link
Collaborator Author

Qingfu-Liu commented May 19, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants