-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Offer Loop Usage of Request Locks #2239
Conversation
…ng for request locks
return start; | ||
} else { | ||
LOG.trace( | ||
"{} - Failed to acquire lock on {} ({})", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be a debug or even an info? Feel like we want to be more aware of when locks are timing out
return -1; | ||
} | ||
} catch (InterruptedException e) { | ||
throw new RuntimeException(e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick note that any uncaught exception that bubbles up through the offer scheduling loop will cause singularity to abort/restart. Interrupted seems like we'd already be in the process of shutdown if that happens, but worth a quick trace to see where it would be caught
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a change made to prevent uncaught exceptions for tasks from killing the entire scheduler (#2233), but this method probably should return -1 in this case as well, unable to acquire lock due to interrupt.
…tup with outdated running task
deployManager.createDeployIfNotExists( | ||
updatedRequest, | ||
deployMarker, | ||
validatedDeploy | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do this before the deploy-already-in-progress check? We would already save this data right afterwards in saveDeploy anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nevermind, I see what this is doing now, pending deploy gets created slightly before the actual deploy object. without it
🚢 |
This PR attempts to reduce the consequences (ie scheduling lag) of attempting to schedule a task whose request lock is locked elsewhere for a long period of time (ie history persisters).
Changes:
Took a very quick/lazy approach to the tryRunWithLock in the offer loop, so please send feedback - should it throw an error, log whether it was able to run, etc.
Considered attempting a larger scale refactor (poll from a single priority queue of pending tasks, remove request locks, etc), but that seemed very likely to introduce new and exciting bugs. The fact that there's separate
offer-scheduler
andSingularitySchedulerPoller
threads that schedule tasks slightly differently (drainPendingQueue) is annoying, but doesn't seem to be the main source of problems.cc - @ssalinas, @pschoenfelder