Adjust prep job resources on Orion due to recent system updates #463

KateFriedman-NOAA · 2021-10-14T18:43:51Z

@AndrewEichmann-NOAA reported that his prep job on Orion was failing with an oom-kill. See message below. This is likely related to recent updates on Orion related to memory. Andy reran the job with increased nodes (8 instead of 2 so 4x) and it ran without error:

in config.resources set export npe_$step=16:

if [ $step = "prep" -o $step = "prepbufr" ]; then

    eval "export wtime_$step='00:45:00'"
    eval "export npe_$step=16"
    eval "export npe_node_$step=2"
    eval "export nth_$step=1"

in xml you get 8 nodes now instead of 2:

<nodes>8:ppn=2:tpp=1</nodes>

Should adjust prep job resources on Orion but possibly leave other platforms as is for now so additional nodes aren't wasted elsewhere. Future reworks of resource assignments will likely consider memory and thus handle this better.

I had a problem with gdasprep failing (according to rocoto) on Orion - the job appeared to finish and then have a 
memory error - end of the log presented here:

********************************************************************
Finished executing on node  Orion-01-11
Ending time  : Wed Oct 13 11:50:10 CDT 2021
********************************************************************

++ 893s + hostname
++ 893s + date -u
+ 893s + echo ' Orion-01-11.HPC.MsState.Edu  --  Wed Oct 13 16:50:28 UTC 2021'
+ 893s + '[' -n '' ']'
+ 893s + '[' -n '' ']'
+ 893s + '[' NO '!=' YES ']'
+ 893s + cd /work/noaa/stmp/aeichman/RUNDIRS/flyer/2021060906/gdas/prepbufr
+ 893s + rm -rf /work/noaa/stmp/aeichman/RUNDIRS/flyer/2021060906/gdas/prepbufr/prep.363132
+ 893s + date -u
Wed Oct 13 16:50:29 UTC 2021
+ 894s + exit
+ status=0
+ [[ 0 -ne 0 ]]
+ exit 0
slurmstepd: error: Detected 2 oom-kill event(s) in step 3359057.batch cgroup. Some of your processes may have been
killed by the cgroup out-of-memory handler.

The text was updated successfully, but these errors were encountered:

WalterKolczynski-NOAA · 2021-10-14T19:57:42Z

If this is due to the recent changes, throwing more nodes at it is a waste of resources. Need to increase the memory.

AndrewEichmann-NOAA · 2021-10-15T13:17:04Z

Walter, Is there a way to do that in the model or experiment configuration? I'm encountering the same problem after a few successful cycles. Andy

…

On Thu, Oct 14, 2021 at 3:57 PM Walter Kolczynski - NOAA < ***@***.***> wrote: If this is due to the recent changes, throwing more nodes at it is a waste of resources. Need to increase the memory. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#463 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AOBXXGJRVBZ2FPOPWJTZE63UG4Y4BANCNFSM5GAJ6IXA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

-- Andrew Eichmann IMSG at NOAA/NWS/NCEP/EMC 5830 University Research Court, Office #2875 College Park, MD 20740 USA ***@***.*** cell (for the duration): 401-477-2702 office: 301-683-0506 <%2B1%20301%20683%203501>

WalterKolczynski-NOAA · 2021-10-15T19:38:47Z

Walter, Is there a way to do that in the model or experiment configuration? I'm encountering the same problem after a few successful cycles. Andy
…
On Thu, Oct 14, 2021 at 3:57 PM Walter Kolczynski - NOAA < @.> wrote: If this is due to the recent changes, throwing more nodes at it is a waste of resources. Need to increase the memory. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#463 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOBXXGJRVBZ2FPOPWJTZE63UG4Y4BANCNFSM5GAJ6IXA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
-- Andrew Eichmann IMSG at NOAA/NWS/NCEP/EMC 5830 University Research Court, Office #2875 College Park, MD 20740 USA @. cell (for the duration): 401-477-2702 office: 301-683-0506 <%2B1%20301%20683%203501>

After you generate the workflow, add a tag specification (see the rocoto documentation) to the task specification for any task that is having problems. That should solve the problem, but it hasn't been tested yet.

AndrewEichmann-NOAA · 2021-10-18T16:53:00Z

Hi Walter, I'm guessing you mean adding an entry along the lines of MEMORY_PREP_GDAS to the xml file and specifying a larger-than-default memory allocation? The current default is 4608 Mb - do you know what it was before? Thanks, Andy On Fri, Oct 15, 2021 at 3:39 PM Walter Kolczynski - NOAA < ***@***.***> wrote:

…

Walter, Is there a way to do that in the model or experiment configuration? I'm encountering the same problem after a few successful cycles. Andy … <#m_7241905943026813825_> On Thu, Oct 14, 2021 at 3:57 PM Walter Kolczynski - NOAA < *@*. *> wrote: If this is due to the recent changes, throwing more nodes at it is a waste of resources. Need to increase the memory. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#463 (comment) <#463 (comment)>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOBXXGJRVBZ2FPOPWJTZE63UG4Y4BANCNFSM5GAJ6IXA <https://github.com/notifications/unsubscribe-auth/AOBXXGJRVBZ2FPOPWJTZE63UG4Y4BANCNFSM5GAJ6IXA> . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. -- Andrew Eichmann IMSG at NOAA/NWS/NCEP/EMC 5830 University Research Court, Office #2875 College Park, MD 20740 USA @.* cell (for the duration): 401-477-2702 office: 301-683-0506 <%2B1%20301%20683%203501> After you generate the workflow, add a tag specification (see the rocoto documentation <http://christopherwharrop.github.io/rocoto/#:~:text=10%3C/walltime%3E%0A%0A%20%20%3C/task%20%3E-,The%20%3Cmemory%3E%20tag,-The%20%3Cmemory%3E%20tag>) to the task specification for any task that is having problems. That *should* solve the problem, but it hasn't been tested yet. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#463 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AOBXXGO4BNMV27B6TIG4IU3UHB7NDANCNFSM5GAJ6IXA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

-- Andrew Eichmann IMSG at NOAA/NWS/NCEP/EMC 5830 University Research Court, Office #2875 College Park, MD 20740 USA ***@***.*** cell (for the duration): 401-477-2702 office: 301-683-0506 <%2B1%20301%20683%203501>

WalterKolczynski-NOAA · 2021-10-18T17:04:42Z

Yes, although the important part is where it is actually added to the task description with the <memory> tags.

The 4608 MB is the default per-core, but the setting in the XML will be for the total memory. Formerly, Orion was setting the limit as the total memory for the node (192 GB/node). You can go back to 2 nodes and ask for 384 GB, which if I understand things correctly should be the same as before.

AndrewEichmann-NOAA · 2021-10-18T18:07:21Z

As you indicated I added <memory>&MEMORY_PREP_GDAS;</memory> to the "task" section. I tried specifying 384G and 192G, which both got rejected, and then 96G, which completed in a couple minutes. Thank you!

…

On Mon, Oct 18, 2021 at 1:05 PM Walter Kolczynski - NOAA < ***@***.***> wrote: Yes, although the important part is where it is actually added to the task description with the <memory> tags. The 4608 MB is the default per-core, but the setting in the XML will be for the total memory. Formerly, Orion was setting the limit as the total memory for the node (192 GB/node). You can go back to 2 nodes and ask for 384 GB, which if I understand things correctly should be the same as before. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#463 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AOBXXGMQRWXUXD6OKGDQI3TUHRHTNANCNFSM5GAJ6IXA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

-- Andrew Eichmann IMSG at NOAA/NWS/NCEP/EMC 5830 University Research Court, Office #2875 College Park, MD 20740 USA ***@***.*** cell (for the duration): 401-477-2702 office: 301-683-0506 <%2B1%20301%20683%203501>

WalterKolczynski-NOAA · 2022-01-14T09:55:51Z

This was fixed as part of the mega-PR #500

* Improve cloud fraction when using Thompson MP. See NCAR/ccpp-physics#809 for more details.

KateFriedman-NOAA added bug Something isn't working uncoupled labels Oct 14, 2021

KateFriedman-NOAA self-assigned this Oct 14, 2021

WalterKolczynski-NOAA mentioned this issue Oct 14, 2021

ocnpost jobs time out on Orion #461

Closed

WalterKolczynski-NOAA closed this as completed Jan 14, 2022

kayeekayee pushed a commit to kayeekayee/global-workflow that referenced this issue May 30, 2024

Thompson MP cloud tuning (NOAA-EMC#463)

500ceaa

* Improve cloud fraction when using Thompson MP. See NCAR/ccpp-physics#809 for more details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adjust prep job resources on Orion due to recent system updates #463

Adjust prep job resources on Orion due to recent system updates #463

KateFriedman-NOAA commented Oct 14, 2021

WalterKolczynski-NOAA commented Oct 14, 2021

AndrewEichmann-NOAA commented Oct 15, 2021 via email

WalterKolczynski-NOAA commented Oct 15, 2021

AndrewEichmann-NOAA commented Oct 18, 2021 via email

WalterKolczynski-NOAA commented Oct 18, 2021

AndrewEichmann-NOAA commented Oct 18, 2021 via email

WalterKolczynski-NOAA commented Jan 14, 2022

Adjust prep job resources on Orion due to recent system updates #463

Adjust prep job resources on Orion due to recent system updates #463

Comments

KateFriedman-NOAA commented Oct 14, 2021

WalterKolczynski-NOAA commented Oct 14, 2021

AndrewEichmann-NOAA commented Oct 15, 2021 via email

WalterKolczynski-NOAA commented Oct 15, 2021

AndrewEichmann-NOAA commented Oct 18, 2021 via email

WalterKolczynski-NOAA commented Oct 18, 2021

AndrewEichmann-NOAA commented Oct 18, 2021 via email

WalterKolczynski-NOAA commented Jan 14, 2022