Improve Offer Loop Usage of Request Locks #2239

WH77 · 2021-10-28T05:05:54Z

This PR attempts to reduce the consequences (ie scheduling lag) of attempting to schedule a task whose request lock is locked elsewhere for a long period of time (ie history persisters).

Changes:

Actually make offer scoring parallel while waiting on request locks. The offer scoring thread pool uses an unbounded queue, so it would never grow past the core pool size of 1 / was still single threaded.
Limit the amount of time that the offer loop will wait for a request lock - less relevant now that the thread pool can have more than 1 thread, but would still help if the thread pool was saturated with locked requests.

Took a very quick/lazy approach to the tryRunWithLock in the offer loop, so please send feedback - should it throw an error, log whether it was able to run, etc.

Considered attempting a larger scale refactor (poll from a single priority queue of pending tasks, remove request locks, etc), but that seemed very likely to introduce new and exciting bugs. The fact that there's separate offer-scheduler and SingularitySchedulerPoller threads that schedule tasks slightly differently (drainPendingQueue) is annoying, but doesn't seem to be the main source of problems.

cc - @ssalinas, @pschoenfelder

…ng for request locks

ssalinas · 2021-10-28T13:29:38Z

SingularityService/src/main/java/com/hubspot/singularity/mesos/SingularitySchedulerLock.java

+        return start;
+      } else {
+        LOG.trace(
+          "{} - Failed to acquire lock on {} ({})",


can this be a debug or even an info? Feel like we want to be more aware of when locks are timing out

ssalinas · 2021-10-28T13:30:34Z

SingularityService/src/main/java/com/hubspot/singularity/mesos/SingularitySchedulerLock.java

+        return -1;
+      }
+    } catch (InterruptedException e) {
+      throw new RuntimeException(e);


Quick note that any uncaught exception that bubbles up through the offer scheduling loop will cause singularity to abort/restart. Interrupted seems like we'd already be in the process of shutdown if that happens, but worth a quick trace to see where it would be caught

There was a change made to prevent uncaught exceptions for tasks from killing the entire scheduler (#2233), but this method probably should return -1 in this case as well, unable to acquire lock due to interrupt.

…ven a max size

…tup with outdated running task

…d deploy

ssalinas · 2021-11-15T17:43:52Z

SingularityService/src/main/java/com/hubspot/singularity/resources/DeployResource.java

+          deployManager.createDeployIfNotExists(
+            updatedRequest,
+            deployMarker,
+            validatedDeploy
+          );


Why do this before the deploy-already-in-progress check? We would already save this data right afterwards in saveDeploy anyway

nevermind, I see what this is doing now, pending deploy gets created slightly before the actual deploy object. without it

ssalinas · 2021-11-15T20:43:46Z

🚢

use fixed size thread pool for offer scoring + limit time spent waiti…

d750324

…ng for request locks

ssalinas reviewed Oct 28, 2021

View reviewed changes

William Hou added 4 commits October 28, 2021 10:49

log failed trylocks for request locks

35495f4

return -1 when interrupted in trylock for request locks

bbe603a

flag thread as interrupted if trylock is interrupted

ac5be79

make return managed thread pool factory fixed size thread pools if gi…

a3a50e9

…ven a max size

WH77 force-pushed the offer-check-lag branch from caaa2d4 to a3a50e9 Compare November 5, 2021 16:46

prevent pending request deletion of new scheduled task deploy on star…

22e0dea

…tup with outdated running task

WH77 force-pushed the offer-check-lag branch from e647e07 to 1747910 Compare November 9, 2021 18:27

prevent pending deploys from being created without corresponding save…

78249b3

…d deploy

WH77 force-pushed the offer-check-lag branch from 1747910 to 78249b3 Compare November 9, 2021 18:28

ssalinas reviewed Nov 15, 2021

View reviewed changes

ssalinas approved these changes Nov 15, 2021

View reviewed changes

WH77 merged commit df03e6c into master Nov 16, 2021

WH77 deleted the offer-check-lag branch November 16, 2021 16:18

ssalinas added this to the 1.5.0 milestone May 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Offer Loop Usage of Request Locks #2239

Improve Offer Loop Usage of Request Locks #2239

WH77 commented Oct 28, 2021

ssalinas Oct 28, 2021

ssalinas Oct 28, 2021

WH77 Oct 28, 2021

ssalinas Nov 15, 2021

ssalinas Nov 15, 2021

ssalinas commented Nov 15, 2021

Improve Offer Loop Usage of Request Locks #2239

Improve Offer Loop Usage of Request Locks #2239

Conversation

WH77 commented Oct 28, 2021

ssalinas Oct 28, 2021

Choose a reason for hiding this comment

ssalinas Oct 28, 2021

Choose a reason for hiding this comment

WH77 Oct 28, 2021

Choose a reason for hiding this comment

ssalinas Nov 15, 2021

Choose a reason for hiding this comment

ssalinas Nov 15, 2021

Choose a reason for hiding this comment

ssalinas commented Nov 15, 2021