You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Many schedulers have a way to specify a memory usage limit that when hit, it will kill the job. This is often a nicer way then just letting the node crash. If the node limit is known for a machine, it might be nice to always set the scheduler specific flag to some fraction of the max memory. To facilitate this we would need a new property on the computer.
The text was updated successfully, but these errors were encountered:
@sphuber, I assume this can be marked as completed since the feature has been introduced in #5260
[update]
Ok, this refers more to the maximum available memory, while the #5260 introduces a default memory, which is not the same. At the same time, the existence of #3503 makes me think that people are leaning towards having such a feature.
The naming chosen in #5260 might indeed be suboptimal in hindsight. In essence, it did implement what this issue was asking for, so I will still close it and we can continue discussion in #3503 .
Looking back at it and the two issues (this one and #3503 ) there seem to be two use cases for being able to define the amount of memory a computer has available:
Let the scheduler plugin automatically defines the max memory that should be allocated for a CalcJob, such that it kills the job with an OOM instead of letting the code crash hard.
Let utilities know how much memory a machine (node) has available so they can compute how many nodes are required for a particular task and how to parallelize
PR #5260 essentially satisfied use case 1 as the default_memory_per_machine attribute of a Computer now serves as a default value for the max_memory_kb metadata option of the CalcJob. So the name is a bit misleading, maybe it would better have been simply called max_memory.
So that leaves use case 2. It could be argued that both use cases could simply reuse the same Computer attribute for their purpose, i.e., we could define the memory and cores attributes which define the amount of memory and cores available on a computer, respectively. This would directly satisfy use-case 2. It could also be used to automatically satisfy use-case 1. In practice I think this is what people would anyway usually do. If you have a machine with 64 cores and 128 GB of RAM, I think people would typically set default_mpiprocs_per_machine to 64 and telling the scheduler to allocated max 128 GB of RAM would also make sense as a default. But I am not sure if there might be use-cases where this hijacking of these defaults would be incorrect and then there would be no way to change the default.
Many schedulers have a way to specify a memory usage limit that when hit, it will kill the job. This is often a nicer way then just letting the node crash. If the node limit is known for a machine, it might be nice to always set the scheduler specific flag to some fraction of the max memory. To facilitate this we would need a new property on the computer.
The text was updated successfully, but these errors were encountered: