Add machine defined limits on maximum processors and batch wall clock time #349

apcraig · 2019-08-14T03:34:56Z

PR checklist

Two new optional machine env variables were added,

ICE_MACHINE_MAXPES
ICE_MACHINE_MAXRUNLENGTH

This is a relatively quick fix to allow test suites and cases to run on machines with limited processors or batch wall times. These constraints will be imposed on the case as follows,

The batch wall time will be limited to the one specified in the env machine file but this will not be reflected directly in the cice.settings ICE_RUNLENGTH variable. It will just be adjusted in the batch submission scripts.
The tasks and threads will be corrected such that the tasks*threads will not be greater than ICE_MACHINE_MAXPES. If a user has also specified a particular decomposition with max blocks, then the max blocks will also automatically be increased if the tasks are decreased. This will appear in the cice.settings file and it will also change the name of the test. So a case name of conrad_intel_smoke_gx3_8x4x10x12x8.testid will be changed to 2x4x10x12x32 if the MAXPES variable is 8. In this case, the testname will reflect the actual test being done which is important.

A full test suite was run on conrad with two compilers. Manual testing was also done on conrad imposing limits on the pes and time limit to make sure the script was working as designed.

There are some other issues that could be tackled such as

adjusting both tasks and threads if the MAXPES is violated
adding other machine constraints such as a max threads per task or max threads per node
updating the RUNLENGTH logic in various ways to make it more flexible, but what I'm thinking would require a signficant refactor and an additional step between cice.setup and cice.build that would generate the resolved scripts. In that case, a use could create a case, change some things in the cice.settings, then generate the scripts, build and run. That extra step could also be useful for updating the case scripts after the case is created. But I view that largely as "too much". It's easy enough to generate a new case as needed at this point.

apcraig · 2019-08-14T03:35:43Z

This addresses Issue #330

apcraig · 2019-08-14T03:37:24Z

The new documentation, largely a table, can be seen here,

https://apcraig-cice.readthedocs.io/en/machlim/user_guide/ug_running.html#machine-variables

eclare108213 · 2019-08-14T22:30:01Z

This looks good to me. I'd like @phil-blain to review and test the changes on his (limited) machine, and I'll try to do the same on badger.

phil-blain

I tried it on brooks and the MAXRUNLENGTH settings works correctly.

Regarding the MAXPES option, I'm not sure I understand yet how everything works regarding blocks, max_blocks and everything, and how cice.setup interacts with the code at the top of cice.batch in that regard, but just looking at the mods in cice.setup in this PR, the first question that came to me was why do we favor keeping ${thrd} as was asked by the user and changing ${task}, and not the opposite ? (or should we allow for the user to choose which of ntask or nthreads should be overwritten if MAXPES < ntasks x nthreads.. )

I guess that is what you mean by

adjusting both tasks and threads if the MAXPES is violated

Also,

adding other machine constraints such as a max threads per task or max threads per node

I think this is a very good idea but maybe not necessary in the context of this PR.

apcraig · 2019-08-15T16:06:43Z

The choice to reduce tasks and increase max blocks while keeping threads the same is largely arbitrary, but I think it's probably the simplest and most robust. In many cases, threads will be 1 so the only option is to reduce tasks. In our test suites, if threads are 2, we really do not want to reduce them to 1 as we'll be turning off threading, which is probably not what we want for that test. Also, threads can be oversubscribed without too much problem, but tasks really cannot. I suppose some additional logic could be added to reduce threads but we'd need to come up with appropriate rules to do so.

Also, these checks are really designed to overcome setting up test suites which have somewhat arbitrary pe counts to cover various configurations. All we really need is an adjustment to the task/threads for a small subset of machines that are resource limited and for it to work and cover the various configurations. Reducing tasks seems to make sense for that. If someone is setting up a test or a case on their machine and oversubscribing their machine and they don't like what the scripts are doing, it's really their fault. Maybe we should have cice.setup abort and force the user to fix it in that case? Do you think that's a useful feature to add?

Anyway, that's my thinking about the current implementation. If think just reducing tasks addresses the issue, and I haven't been able to think of a case that is adversely affected by using that approach. But if there are cases where just reducing tasks is a problem, it would be good to take them into account. I guess one would be if the number of tasks is reduced to 1 while threads remain high, but I am not sure that is a likely scenario. It really depends on the resources available. For machines like travis, with very limited resources, we have a special test suite. If someone is trying to run the full test suite on their laptop, that's another case we might want to think about and generate a special test suite. For machines that may have only 16 pes or something, I think the approach taken here is reasonable. For any machine that has over 32 or 64 pes, it's likely to have enough resources. I'm open to other suggestions though.

phil-blain · 2019-08-15T16:18:02Z

Thanks for the explanation. I agree; the current implementation solves the problem at hand. We can always add more features if they are needed in the future.

apcraig · 2019-08-22T15:41:22Z

Anyone have any objections to merging this now? I will merge tomorrow (Friday, Aug 22) unless I hear otherwise.

apcraig added 3 commits August 13, 2019 22:29

add machine MAXPES and MAXRUNLENGTH settings

1ff9242

update documentation

ee66c60

update implementation so pes are updated in test case name

2253cca

apcraig requested review from eclare108213, mattdturner and phil-blain August 14, 2019 03:34

apcraig self-assigned this Aug 14, 2019

phil-blain approved these changes Aug 15, 2019

View reviewed changes

phil-blain mentioned this pull request Aug 15, 2019

Add MAXRUNLENGTH to env.brooks_intel #352

Merged

16 tasks

eclare108213 approved these changes Aug 22, 2019

View reviewed changes

Merge branch 'master' into machlim

4a34bd2

apcraig merged commit 9cb297b into CICE-Consortium:master Aug 23, 2019

apcraig mentioned this pull request Aug 30, 2019

Add max wallclock time to machine files #330

Closed

apcraig deleted the machlim branch August 17, 2022 20:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add machine defined limits on maximum processors and batch wall clock time #349

Add machine defined limits on maximum processors and batch wall clock time #349

apcraig commented Aug 14, 2019

apcraig commented Aug 14, 2019

apcraig commented Aug 14, 2019

eclare108213 commented Aug 14, 2019

phil-blain left a comment •

edited

Loading

apcraig commented Aug 15, 2019

phil-blain commented Aug 15, 2019

apcraig commented Aug 22, 2019

Add machine defined limits on maximum processors and batch wall clock time #349

Add machine defined limits on maximum processors and batch wall clock time #349

Conversation

apcraig commented Aug 14, 2019

PR checklist

apcraig commented Aug 14, 2019

apcraig commented Aug 14, 2019

eclare108213 commented Aug 14, 2019

phil-blain left a comment • edited Loading

Choose a reason for hiding this comment

apcraig commented Aug 15, 2019

phil-blain commented Aug 15, 2019

apcraig commented Aug 22, 2019

phil-blain left a comment •

edited

Loading