Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add machine defined limits on maximum processors and batch wall clock time #349

Merged
merged 4 commits into from
Aug 23, 2019

Conversation

apcraig
Copy link
Contributor

@apcraig apcraig commented Aug 14, 2019

PR checklist

  • Short (1 sentence) summary of your PR:
    Add machine defined limits on maximum processors and batch wall clock time
  • Developer(s):
    apcraig
  • Suggest PR reviewers from list in the column to the right.
  • Please copy the PR test results link or provide a summary of testing completed below.
    #2253cca91c8bb7e at https://github.com/CICE-Consortium/Test-Results/wiki/cice_by_hash_forks
  • How much do the PR code changes differ from the unmodified code?
    • bit for bit
    • different at roundoff level
    • more substantial
  • Does this PR create or have dependencies on Icepack or any other models?
    • Yes
    • No
  • Does this PR add any new test cases?
    • Yes
    • No
  • Is the documentation being updated? ("Documentation" includes information on the wiki or in the .rst files from doc/source/, which are used to create the online technical docs at https://readthedocs.org/projects/cice-consortium-cice/.)
    • Yes
    • No, does the documentation need to be updated at a later time?
      • Yes
      • No
  • Please provide any additional information or relevant details below:

Two new optional machine env variables were added,

ICE_MACHINE_MAXPES
ICE_MACHINE_MAXRUNLENGTH

This is a relatively quick fix to allow test suites and cases to run on machines with limited processors or batch wall times. These constraints will be imposed on the case as follows,

  • The batch wall time will be limited to the one specified in the env machine file but this will not be reflected directly in the cice.settings ICE_RUNLENGTH variable. It will just be adjusted in the batch submission scripts.
  • The tasks and threads will be corrected such that the tasks*threads will not be greater than ICE_MACHINE_MAXPES. If a user has also specified a particular decomposition with max blocks, then the max blocks will also automatically be increased if the tasks are decreased. This will appear in the cice.settings file and it will also change the name of the test. So a case name of conrad_intel_smoke_gx3_8x4x10x12x8.testid will be changed to 2x4x10x12x32 if the MAXPES variable is 8. In this case, the testname will reflect the actual test being done which is important.

A full test suite was run on conrad with two compilers. Manual testing was also done on conrad imposing limits on the pes and time limit to make sure the script was working as designed.

There are some other issues that could be tackled such as

  • adjusting both tasks and threads if the MAXPES is violated
  • adding other machine constraints such as a max threads per task or max threads per node
  • updating the RUNLENGTH logic in various ways to make it more flexible, but what I'm thinking would require a signficant refactor and an additional step between cice.setup and cice.build that would generate the resolved scripts. In that case, a use could create a case, change some things in the cice.settings, then generate the scripts, build and run. That extra step could also be useful for updating the case scripts after the case is created. But I view that largely as "too much". It's easy enough to generate a new case as needed at this point.

@apcraig
Copy link
Contributor Author

apcraig commented Aug 14, 2019

This addresses Issue #330

@apcraig
Copy link
Contributor Author

apcraig commented Aug 14, 2019

The new documentation, largely a table, can be seen here,

https://apcraig-cice.readthedocs.io/en/machlim/user_guide/ug_running.html#machine-variables

@eclare108213
Copy link
Contributor

This looks good to me. I'd like @phil-blain to review and test the changes on his (limited) machine, and I'll try to do the same on badger.

Copy link
Member

@phil-blain phil-blain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried it on brooks and the MAXRUNLENGTH settings works correctly.

Regarding the MAXPES option, I'm not sure I understand yet how everything works regarding blocks, max_blocks and everything, and how cice.setup interacts with the code at the top of cice.batch in that regard, but just looking at the mods in cice.setup in this PR, the first question that came to me was why do we favor keeping ${thrd} as was asked by the user and changing ${task}, and not the opposite ? (or should we allow for the user to choose which of ntask or nthreads should be overwritten if MAXPES < ntasks x nthreads.. )

I guess that is what you mean by

adjusting both tasks and threads if the MAXPES is violated

Also,

adding other machine constraints such as a max threads per task or max threads per node

I think this is a very good idea but maybe not necessary in the context of this PR.

@apcraig
Copy link
Contributor Author

apcraig commented Aug 15, 2019

The choice to reduce tasks and increase max blocks while keeping threads the same is largely arbitrary, but I think it's probably the simplest and most robust. In many cases, threads will be 1 so the only option is to reduce tasks. In our test suites, if threads are 2, we really do not want to reduce them to 1 as we'll be turning off threading, which is probably not what we want for that test. Also, threads can be oversubscribed without too much problem, but tasks really cannot. I suppose some additional logic could be added to reduce threads but we'd need to come up with appropriate rules to do so.

Also, these checks are really designed to overcome setting up test suites which have somewhat arbitrary pe counts to cover various configurations. All we really need is an adjustment to the task/threads for a small subset of machines that are resource limited and for it to work and cover the various configurations. Reducing tasks seems to make sense for that. If someone is setting up a test or a case on their machine and oversubscribing their machine and they don't like what the scripts are doing, it's really their fault. Maybe we should have cice.setup abort and force the user to fix it in that case? Do you think that's a useful feature to add?

Anyway, that's my thinking about the current implementation. If think just reducing tasks addresses the issue, and I haven't been able to think of a case that is adversely affected by using that approach. But if there are cases where just reducing tasks is a problem, it would be good to take them into account. I guess one would be if the number of tasks is reduced to 1 while threads remain high, but I am not sure that is a likely scenario. It really depends on the resources available. For machines like travis, with very limited resources, we have a special test suite. If someone is trying to run the full test suite on their laptop, that's another case we might want to think about and generate a special test suite. For machines that may have only 16 pes or something, I think the approach taken here is reasonable. For any machine that has over 32 or 64 pes, it's likely to have enough resources. I'm open to other suggestions though.

@phil-blain
Copy link
Member

Thanks for the explanation. I agree; the current implementation solves the problem at hand. We can always add more features if they are needed in the future.

@apcraig
Copy link
Contributor Author

apcraig commented Aug 22, 2019

Anyone have any objections to merging this now? I will merge tomorrow (Friday, Aug 22) unless I hear otherwise.

@apcraig apcraig merged commit 9cb297b into CICE-Consortium:master Aug 23, 2019
@apcraig apcraig deleted the machlim branch August 17, 2022 20:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants