-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shuffle tasks on hosts with overutilized memory resources #1937
Conversation
9c0b981
to
6e25b84
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall strategy of 'fix the worst one first' seems fine to me for memory vs cpu. Left one comment about breaking out of the loop early if we think the current removed set of tasks should satisfy things. Other than that, my only other thought is if 'most overused' is still the best metric for sorting tasks to shuffle. We've seen some cases where a task using 10% of its cpu still gets shuffled, which seems odd. Though I'm not sure fo a good way around it either
private boolean shuffleTasksForOverloadedSlaves = false; // recommended 'true' when oversubscribing cpu for larger clusters | ||
private boolean shuffleTasksForOverloadedSlaves = false; // recommended 'true' when oversubscribing resources for larger clusters | ||
|
||
private double shuffleTasksWhenSlaveMemoryUtilizationPercentageExceeds = 0.82; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was there any math behind this value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sort of an arbitrary rule of thumb. Assuming we'd like to set a default "target" memory utilization of 85%, setting this config at 82% should give us enough time to shuffle tasks before actually hitting our target.
|
||
for (TaskIdWithUsage taskIdWithUsage : possibleTasksToShuffle) { | ||
if (requestsWithShuffledTasks.contains(taskIdWithUsage.getTaskId().getRequestId())) { | ||
LOG.debug("Request {} already has a shuffling task, skipping", taskIdWithUsage.getTaskId().getRequestId()); | ||
continue; | ||
} | ||
if (cpuOverage <= 0 || shuffledTasksOnSlave > configuration.getMaxTasksToShufflePerHost() || currentShuffleCleanupsTotal >= configuration.getMaxTasksToShuffleTotal()) { | ||
LOG.debug("Not shuffling any more tasks (overage: {}, shuffledOnHost: {}, totalShuffleCleanups: {})", cpuOverage, shuffledTasksOnSlave, currentShuffleCleanupsTotal); | ||
if ((mostOverusedResource.overusage <= 0) || shuffledTasksOnSlave > configuration.getMaxTasksToShufflePerHost() || currentShuffleCleanupsTotal >= configuration.getMaxTasksToShuffleTotal()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cpuOverage <= 0
worked as a condition for 'should this put us back in green?' because we updated it at the end of the loop. I don't currently see mostOverusedResource.overusage
getting updated anywhere
🚢 |
🚢 |
No description provided.