Reintroduce slot supplier & add many tests #2143

Sushisource · 2024-07-10T22:38:37Z

What was changed

Re-introduce slot supplier framework reverted in #2134

Why?

Bring back feature with fix for leaking local activity slots

Checklist

Closes
How was this tested:
Added many tests
Any docs updates needed?

Sushisource · 2024-07-10T22:39:19Z

temporal-sdk/src/main/java/io/temporal/internal/worker/LocalActivityWorker.java

+            permit =
+                timeLimiter.callWithTimeout(


This is the actual fix. The source of the bug was not assigning the permit when reservation was called behind this callWithTimeout function.

Sushisource · 2024-07-10T22:40:49Z

temporal-sdk/src/test/java/io/temporal/internal/worker/WorkflowSlotsSmallSizeTests.java

+    testWorkflowRule.getTestEnvironment().close();
+    assertEquals(


There is some dupe among these tests with this bit, and asserting on the metrics, etc.

Not sure if it's actually worth it or not to dedupe it, though. Welcome any thoughts on that.

Quinn-With-Two-Ns · 2024-07-11T19:56:54Z

temporal-sdk/src/test/java/io/temporal/internal/worker/WorkflowSlotTests.java

@@ -49,9 +54,16 @@ public class WorkflowSlotTests {
  private final int MAX_CONCURRENT_WORKFLOW_TASK_EXECUTION_SIZE = 100;
  private final int MAX_CONCURRENT_ACTIVITY_EXECUTION_SIZE = 1000;
  private final int MAX_CONCURRENT_LOCAL_ACTIVITY_EXECUTION_SIZE = 10000;
+  private final CountingSlotSupplier<WorkflowSlotInfo> workflowTaskSlotSupplier =


I'm not sure how I feel about this because now you are not longer testing the default worker setting.

It's a fairly thin wrapper and the only reasonable way I had to actually test the slot mechanism directly instead of via metrics without exposing a bunch more stuff publicly.

If you have another suggestion I'm open for sure, but, this was seemingly the best way

Yeah I was thinking you could have a few tests that schedule more activities/workflows then you have slots so you know some had to be freed. Thoughts on a test like that? You can also keep a static count in the activity to cross check with.

Sure I can add something like that

temporal-sdk/src/main/java/io/temporal/internal/worker/WorkflowWorker.java

temporal-sdk/src/main/java/io/temporal/internal/worker/LocalActivityWorker.java

temporal-sdk/src/main/java/io/temporal/internal/worker/TrackingSlotSupplier.java

temporal-sdk/src/main/java/io/temporal/internal/worker/LocalActivityWorker.java

temporal-sdk/src/test/java/io/temporal/worker/ResourceBasedTunerTests.java

temporal-sdk/src/main/java/io/temporal/internal/worker/TrackingSlotSupplier.java

Quinn-With-Two-Ns · 2024-07-17T22:16:15Z

temporal-sdk/src/main/java/io/temporal/worker/MetricsType.java

@@ -135,6 +135,8 @@ private MetricsType() {}
  // gauge
  public static final String WORKER_TASK_SLOTS_AVAILABLE =
      TEMPORAL_METRICS_PREFIX + "worker_task_slots_available";
+  public static final String WORKER_TASK_SLOTS_USED =


Nit: these are public, which is weird but it is what it is, so should be marked experimental.

temporal-sdk/src/main/java/io/temporal/internal/worker/TrackingSlotSupplier.java

Quinn-With-Two-Ns · 2024-07-17T22:32:39Z

temporal-sdk/src/main/java/io/temporal/internal/worker/LocalActivityWorker.java

@@ -416,14 +437,13 @@ private AttemptTaskHandlerImpl(ActivityTaskHandler handler) {

    @Override
    public void handle(LocalActivityAttemptTask attemptTask) throws Exception {


I would need to double check the Java SDK, but is handle here called per attempt of a local activity? Is there a risk we markSlotUsed multiple times for the same local activity?

Good call. Added a test & fixed.

Looking at the original code it seems like before the local activity slot metric was not considered "used" while the local activity was backing off. Now I think the slot will be considered "used" if the activity is executing or not. Do you think I am correct? If so I think that is fine since the slot is not available.

Actually looking at the code more it appears there is a more significant change here. Before the local activity concurrency was limited by the number of threads in the thread pool, and more then the max local activity task slot could be scheduled. Now we are limiting the max number of local activities being scheduled to the number of local activity slots. That could have implications to users who use local activities for certain use cases like polling.

Discussed on call:

Will keep the ratelimiter which is 2x max size, or for implementations w/o max size, 2x number of currently running activities (only to be triggered if tryReserve fails).

Release slots for LAs which are backing off, prioritize retries and re-acquire then

Quinn-With-Two-Ns · 2024-07-18T15:29:09Z

temporal-sdk/src/main/java/io/temporal/internal/worker/LocalActivityExecutionContext.java

@@ -154,6 +161,14 @@ public boolean callback(LocalActivityResult result) {
    if (scheduleToCloseFuture != null) {
      scheduleToCloseFuture.cancel(false);
    }
+    SlotReleaseReason reason = SlotReleaseReason.taskComplete();


Is it possible that the slot was never used?

No - it's used in the sense that we tried to run the LA even if it timed out, which counts as taskComplete in this case. We might need finer-grained reasons at some point if people ask for them

temporal-sdk/src/main/java/io/temporal/internal/worker/TrackingSlotSupplier.java

Sushisource · 2024-07-18T22:32:59Z

temporal-sdk/src/main/java/io/temporal/internal/worker/LocalActivitySlotSupplierQueue.java

+      TrackingSlotSupplier<LocalActivitySlotInfo> slotSupplier,
+      Functions.Proc1<LocalActivityAttemptTask> afterReservedCallback) {
+    this.afterReservedCallback = afterReservedCallback;
+    // TODO: See if I can adjust this for dynamic ones based on current rate


This is kinda the last thing. I don't think there's really much sensible that can be done here, but maybe we want to make it an option?

Yeah I would open an issue to revisit the backoff logic. My main goal was to just maintain the current behavior for a fixed slot suppliers without some more thought.

Quinn-With-Two-Ns · 2024-07-19T16:48:45Z

temporal-sdk/src/main/java/io/temporal/internal/worker/LocalActivityExecutionContext.java

+    }
+    // Permit can be null in the event of a timeout while waiting on a permit
+    if (permit != null) {
+      slotSupplier.releaseSlot(reason, permit);


Didn't we say we would move the release out of here?

Yes, good point. And move it instead to handle, which is more like the normal task types?

I guess the main issue is that this gets referenced from like a bajillion places, so it's not obvious that centralizing it is going to catch all those... but I can see what happens.

It also looks like your release the slot again in handleResult? Maybe it is being release twice. The slot acquire/release. handle is the right spot IMO, currently the SDK takes and releases slots after handle.

At a high level the slot acquire/release should be tied to the processing of the task, not the reporting of the result.

Quinn-With-Two-Ns · 2024-07-19T16:52:50Z

temporal-sdk/src/main/java/io/temporal/internal/worker/LocalActivityWorker.java

@@ -584,6 +554,8 @@ private void handleResult(
              executionContext, activityTask, activityHandlerResult.getTaskFailed().getFailure());

      if (retryDecision.doNextAttempt()) {
+        // Release slot before scheduling the next attempt
+        slotSupplier.releaseSlot(SlotReleaseReason.willRetry(), executionContext.getPermit());


Shouldn't you pass the failure here like you do for normal activities?

Normal activities don't actually set an error - it's more like when processing failed rather than the task itself failed that the error variant is used. So I think this is consistent.

This reverts commit 46b239d.

…y when accept timeout exists

…aking test take too long

Quinn-With-Two-Ns · 2024-07-22T18:35:04Z

temporal-sdk/src/main/java/io/temporal/internal/worker/LocalActivityWorker.java

+        // where scheduleToStart is already fired, but didn't report a completion yet.
+        boolean shouldDiscardTheAttempt = scheduleToStartFired || executionContext.isCompleted();
+        if (shouldDiscardTheAttempt) {
+          return;


Here shouldn't the task slot release reason be not used?

Not quite. This is a timeout, and the timeouts right now are also taskComplete. Not used is right now just "didn't even get a task"

temporal-sdk/src/main/java/io/temporal/internal/worker/LocalActivityWorker.java

Quinn-With-Two-Ns · 2024-07-22T18:49:47Z

Did we add a test that shows retrying local activities do not hold the slot while in a back-off stage?

Quinn-With-Two-Ns · 2024-07-22T19:00:00Z

temporal-sdk/src/main/java/io/temporal/internal/worker/TrackingSlotSupplier.java

+    publishSlotsMetric();
+  }
+
+  public void releaseSlot(SlotReleaseReason reason, SlotPermit permit) {


What happens on a double release? I think right now it will silently pass?

It will. There's not a great way to avoid that that I see, at least not without making some kind of publicly accessible isReleased field on the permit, since that's in worker rather than internal. Could be done though. Alternatively it can just be tested for with CountingSlotSupplier in the tests which is in essence doing it by making sure the acquire/releases always line up.

Yeah I was thinking it was easier to check then maybe it actually is since it probably comes down to the route slot supplier. Maybe just testing CountingSlotSupplier is enough

Sushisource · 2024-07-22T20:48:32Z

Did we add a test that shows retrying local activities do not hold the slot while in a back-off stage?

This is implicitly covered in TestLocalActivityFailsThenPasses since the acquires & releases have to line up to the number that would be expected with retrying, but I can try to add something more explicit. (I added a more explicit release count check to that test)

Quinn-With-Two-Ns

LGTM, I know @cretz reviewed the original, may want to double check if he has any thoughts, not required though.

Sushisource · 2024-07-23T22:09:51Z

Chad mentioned he's good w/ your review

Sushisource requested a review from a team as a code owner July 10, 2024 22:38

Sushisource commented Jul 10, 2024

View reviewed changes

Quinn-With-Two-Ns reviewed Jul 11, 2024

View reviewed changes

temporal-sdk/src/main/java/io/temporal/internal/worker/WorkflowWorker.java Outdated Show resolved Hide resolved

Quinn-With-Two-Ns reviewed Jul 11, 2024

View reviewed changes

temporal-sdk/src/main/java/io/temporal/internal/worker/LocalActivityWorker.java Outdated Show resolved Hide resolved

Quinn-With-Two-Ns reviewed Jul 11, 2024

View reviewed changes

temporal-sdk/src/main/java/io/temporal/internal/worker/TrackingSlotSupplier.java Outdated Show resolved Hide resolved

Quinn-With-Two-Ns reviewed Jul 17, 2024

View reviewed changes

temporal-sdk/src/main/java/io/temporal/internal/worker/LocalActivityWorker.java Outdated Show resolved Hide resolved

Quinn-With-Two-Ns reviewed Jul 17, 2024

View reviewed changes

temporal-sdk/src/test/java/io/temporal/worker/ResourceBasedTunerTests.java Show resolved Hide resolved

Quinn-With-Two-Ns reviewed Jul 17, 2024

View reviewed changes

temporal-sdk/src/main/java/io/temporal/internal/worker/TrackingSlotSupplier.java Outdated Show resolved Hide resolved

Sushisource force-pushed the reintroduce-slot-supplier branch from 9d52c23 to 747dce7 Compare July 17, 2024 22:10

Quinn-With-Two-Ns reviewed Jul 17, 2024

View reviewed changes

temporal-sdk/src/main/java/io/temporal/internal/worker/TrackingSlotSupplier.java Outdated Show resolved Hide resolved

Quinn-With-Two-Ns reviewed Jul 17, 2024

View reviewed changes

Sushisource force-pushed the reintroduce-slot-supplier branch from 747dce7 to 765dfca Compare July 17, 2024 23:57

Quinn-With-Two-Ns reviewed Jul 18, 2024

View reviewed changes

temporal-sdk/src/main/java/io/temporal/internal/worker/TrackingSlotSupplier.java Outdated Show resolved Hide resolved

Sushisource commented Jul 18, 2024

View reviewed changes

Quinn-With-Two-Ns reviewed Jul 19, 2024

View reviewed changes

Sushisource and others added 10 commits July 19, 2024 14:04

Revert "Revert configurable slot provider (temporalio#2134)"

b45d063

This reverts commit 46b239d.

Add tests for worker slots

61a6523

Fix new tests by publishing metrics on start & setting permit properl…

df96e8a

…y when accept timeout exists

Add tests for small counts / hitting failed slot acquisition

50c0ac5

More tests

11cbb4b

Add counting slot supplier for tests

f06f231

License headers

39b419f

Ensure markUsed is called on eager activities too

cc81648

Avoid having to set metric scope on tracking supplier after construction

198d23e

Don't emit available metric unless max slots is sensible

d0006b1

Sushisource added 16 commits July 19, 2024 14:04

Short WFT timeout was unnecessary & could cause opposite problem of m…

b953a32

…aking test take too long

Provide correct release reason in case of handler errors

5a1b47c

Fix flaky hang in test

26fdcb4

Add used metric

5064110

Add local activities to resource tuner test

d9e785d

Throw if permit is not set when marking used

d1ceb94

Add timeouts to interface

e411584

Change exception type

c1a9481

Fix problem with schedule-to-start LA timeouts

b8f79f1

Verify slots aren't exceeded without custom tuner

bd58c11

Change interface to overload tryReserveSlot

d9d6a59

Deal with possibility of calling markUsed multiple times

f97b2da

More closely mimic old behavior w/ backpressure

60f4ab0

Change double-mark-used check / license header

1af5890

add experimental tag to new metric

412b196

Move releasing to handle

061d066

Sushisource force-pushed the reintroduce-slot-supplier branch from 3cd6cd6 to 061d066 Compare July 19, 2024 21:04

Quinn-With-Two-Ns reviewed Jul 22, 2024

View reviewed changes

temporal-sdk/src/main/java/io/temporal/internal/worker/LocalActivityWorker.java Outdated Show resolved Hide resolved

Quinn-With-Two-Ns reviewed Jul 22, 2024

View reviewed changes

A few last bits of review feedback

93ed3fd

Sushisource force-pushed the reintroduce-slot-supplier branch from 2b1f325 to ceb3455 Compare July 22, 2024 22:40

Make sure first poll doesn't go through until after initial slot check

0f8e0b6

Sushisource force-pushed the reintroduce-slot-supplier branch from ceb3455 to 0f8e0b6 Compare July 22, 2024 22:52

Quinn-With-Two-Ns approved these changes Jul 23, 2024

View reviewed changes

Sushisource merged commit b95322f into temporalio:master Jul 23, 2024
8 checks passed

Sushisource deleted the reintroduce-slot-supplier branch July 23, 2024 22:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reintroduce slot supplier & add many tests #2143

Reintroduce slot supplier & add many tests #2143

Sushisource commented Jul 10, 2024

Sushisource Jul 10, 2024

Sushisource Jul 10, 2024

Quinn-With-Two-Ns Jul 11, 2024

Sushisource Jul 11, 2024

Quinn-With-Two-Ns Jul 17, 2024

Sushisource Jul 17, 2024

Quinn-With-Two-Ns Jul 17, 2024

Quinn-With-Two-Ns Jul 17, 2024 •

edited

Loading

Sushisource Jul 18, 2024

Quinn-With-Two-Ns Jul 18, 2024

Quinn-With-Two-Ns Jul 18, 2024

Sushisource Jul 18, 2024

Quinn-With-Two-Ns Jul 18, 2024

Sushisource Jul 18, 2024

Sushisource Jul 18, 2024

Quinn-With-Two-Ns Jul 19, 2024

Quinn-With-Two-Ns Jul 19, 2024

Sushisource Jul 19, 2024

Sushisource Jul 19, 2024

Quinn-With-Two-Ns Jul 19, 2024

Quinn-With-Two-Ns Jul 19, 2024

Sushisource Jul 19, 2024

Quinn-With-Two-Ns Jul 19, 2024

Sushisource Jul 19, 2024 •

edited

Loading

Quinn-With-Two-Ns Jul 22, 2024

Sushisource Jul 22, 2024

Quinn-With-Two-Ns commented Jul 22, 2024

Quinn-With-Two-Ns Jul 22, 2024

Sushisource Jul 22, 2024 •

edited

Loading

Quinn-With-Two-Ns Jul 22, 2024

Sushisource commented Jul 22, 2024 •

edited

Loading

Quinn-With-Two-Ns left a comment

Sushisource commented Jul 23, 2024

		@@ -416,14 +437,13 @@ private AttemptTaskHandlerImpl(ActivityTaskHandler handler) {

		@Override
		public void handle(LocalActivityAttemptTask attemptTask) throws Exception {

Reintroduce slot supplier & add many tests #2143

Reintroduce slot supplier & add many tests #2143

Conversation

Sushisource commented Jul 10, 2024

What was changed

Why?

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Quinn-With-Two-Ns Jul 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Sushisource Jul 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Quinn-With-Two-Ns commented Jul 22, 2024

Choose a reason for hiding this comment

Sushisource Jul 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Sushisource commented Jul 22, 2024 • edited Loading

Quinn-With-Two-Ns left a comment

Choose a reason for hiding this comment

Sushisource commented Jul 23, 2024

Quinn-With-Two-Ns Jul 17, 2024 •

edited

Loading

Sushisource Jul 19, 2024 •

edited

Loading

Sushisource Jul 22, 2024 •

edited

Loading

Sushisource commented Jul 22, 2024 •

edited

Loading