Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust prep job resources on Orion due to recent system updates #463

Closed
KateFriedman-NOAA opened this issue Oct 14, 2021 · 7 comments
Closed
Assignees
Labels
bug Something isn't working

Comments

@KateFriedman-NOAA
Copy link
Member

@AndrewEichmann-NOAA reported that his prep job on Orion was failing with an oom-kill. See message below. This is likely related to recent updates on Orion related to memory. Andy reran the job with increased nodes (8 instead of 2 so 4x) and it ran without error:

in config.resources set export npe_$step=16:

if [ $step = "prep" -o $step = "prepbufr" ]; then

    eval "export wtime_$step='00:45:00'"
    eval "export npe_$step=16"
    eval "export npe_node_$step=2"
    eval "export nth_$step=1"

in xml you get 8 nodes now instead of 2:

<nodes>8:ppn=2:tpp=1</nodes>

Should adjust prep job resources on Orion but possibly leave other platforms as is for now so additional nodes aren't wasted elsewhere. Future reworks of resource assignments will likely consider memory and thus handle this better.

I had a problem with gdasprep failing (according to rocoto) on Orion - the job appeared to finish and then have a 
memory error - end of the log presented here:

********************************************************************
Finished executing on node  Orion-01-11
Ending time  : Wed Oct 13 11:50:10 CDT 2021
********************************************************************

++ 893s + hostname
++ 893s + date -u
+ 893s + echo ' Orion-01-11.HPC.MsState.Edu  --  Wed Oct 13 16:50:28 UTC 2021'
+ 893s + '[' -n '' ']'
+ 893s + '[' -n '' ']'
+ 893s + '[' NO '!=' YES ']'
+ 893s + cd /work/noaa/stmp/aeichman/RUNDIRS/flyer/2021060906/gdas/prepbufr
+ 893s + rm -rf /work/noaa/stmp/aeichman/RUNDIRS/flyer/2021060906/gdas/prepbufr/prep.363132
+ 893s + date -u
Wed Oct 13 16:50:29 UTC 2021
+ 894s + exit
+ status=0
+ [[ 0 -ne 0 ]]
+ exit 0
slurmstepd: error: Detected 2 oom-kill event(s) in step 3359057.batch cgroup. Some of your processes may have been
killed by the cgroup out-of-memory handler.
@KateFriedman-NOAA KateFriedman-NOAA added bug Something isn't working uncoupled labels Oct 14, 2021
@KateFriedman-NOAA KateFriedman-NOAA self-assigned this Oct 14, 2021
@WalterKolczynski-NOAA
Copy link
Contributor

If this is due to the recent changes, throwing more nodes at it is a waste of resources. Need to increase the memory.

@AndrewEichmann-NOAA
Copy link
Contributor

AndrewEichmann-NOAA commented Oct 15, 2021 via email

@WalterKolczynski-NOAA
Copy link
Contributor

Walter, Is there a way to do that in the model or experiment configuration? I'm encountering the same problem after a few successful cycles. Andy

On Thu, Oct 14, 2021 at 3:57 PM Walter Kolczynski - NOAA < @.> wrote: If this is due to the recent changes, throwing more nodes at it is a waste of resources. Need to increase the memory. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#463 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOBXXGJRVBZ2FPOPWJTZE63UG4Y4BANCNFSM5GAJ6IXA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
-- Andrew Eichmann IMSG at NOAA/NWS/NCEP/EMC 5830 University Research Court, Office #2875 College Park, MD 20740 USA @.
cell (for the duration): 401-477-2702 office: 301-683-0506 <%2B1%20301%20683%203501>

After you generate the workflow, add a tag specification (see the rocoto documentation) to the task specification for any task that is having problems. That should solve the problem, but it hasn't been tested yet.

@AndrewEichmann-NOAA
Copy link
Contributor

AndrewEichmann-NOAA commented Oct 18, 2021 via email

@WalterKolczynski-NOAA
Copy link
Contributor

Yes, although the important part is where it is actually added to the task description with the <memory> tags.

The 4608 MB is the default per-core, but the setting in the XML will be for the total memory. Formerly, Orion was setting the limit as the total memory for the node (192 GB/node). You can go back to 2 nodes and ask for 384 GB, which if I understand things correctly should be the same as before.

@AndrewEichmann-NOAA
Copy link
Contributor

AndrewEichmann-NOAA commented Oct 18, 2021 via email

@WalterKolczynski-NOAA
Copy link
Contributor

This was fixed as part of the mega-PR #500

kayeekayee pushed a commit to kayeekayee/global-workflow that referenced this issue May 30, 2024
*  Improve cloud fraction when using Thompson MP. See NCAR/ccpp-physics#809 for more details.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants