Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add backstop check for free memory before starting job #699

Open
bloodearnest opened this issue Jan 2, 2024 · 1 comment
Open

Add backstop check for free memory before starting job #699

bloodearnest opened this issue Jan 2, 2024 · 1 comment

Comments

@bloodearnest
Copy link
Member

Currently, we can run MAX_WORKERS jobs, each with up to 128GB of memory (a global config). At time of writing MAX_WORKERS is 20, so that's a potential of 2560GB. We have ~610GB on TPP, so we are ~4x overcommitted at peak. Normally this is fine, as most jobs use a lot less memory that this. Occasionally, this is not true, and while each job is below 128GB, 20 of them exceeds 610GB, and we get the bad kind of OOM behaviour (as it has to choose which docker process to kill, which the OS equivalent of Undefined Behaviour).

The proper solution for this is probably to enable per-job limits. Most jobs can run on a limit of 64GB or 32GB, only some need 128GB. This would also enable size based scheduling, which is well understood (think cloud VMs scheduling).

A dumber and simpler option might be to just check for a minimum amount of free memory before executing a job. This should be fairly simple to execute, and would apply dynamic backpressure without complicated scheduling algos. We could to the same with disk too, perhaps

@madwort
Copy link
Contributor

madwort commented Mar 7, 2024

Based on the incident in https://bennettoxford.slack.com/archives/C02GL3A9THD/p1708422081557029 my suggestion is to not start any jobs when there is less than 60GB free memory. Nb. if we run this in combination with #712 that only gives a job 15GB available before it's at risk of being killed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants