Handle overloaded agent's heartbeat timeout #724

Andyz26 · 2024-11-07T00:37:57Z

Context

We found a case where agents with shutdown workers got leaked:

A job with multiple workers is being shut down.
Some workers receive the shutdown signals and terminate the running workers. However, some workers fail to receive the signals (timeout, etc.) and continue to be alive. Now, all upstream traffic is re-distributed to these workers, and they become very busy (peaked CPU/network usage).
Retried kill signals cannot reach these workers while the worker's host agent's heartbeat message to the control plane timed out, causing them to be leaked.

Improvements:

increase heartbeat connection client's thread priority
increase heartbeat connection timeout settings.

Checklist

./gradlew build compiles code correctly
Added new tests where applicable
./gradlew test passes all tests
Extended README or added javadocs where applicable

sundargates · 2024-11-07T00:40:12Z

...plane/mantis-control-plane-core/src/main/java/io/mantisrx/server/core/CoreConfiguration.java

-    @Default("10000")
+    @Default("90000")
    int getAsyncHttpClientConnectionTimeoutMs();

    @Config("mantis.asyncHttpClient.requestTimeoutMs")
-    @Default("10000")
+    @Default("90000")
    int getAsyncHttpClientRequestTimeoutMs();

    @Config("mantis.asyncHttpClient.readTimeoutMs")
-    @Default("10000")
+    @Default("90000")


should we change the defaults for everyone? Can we just change this in our codebase?

see the other comment.

sundargates · 2024-11-07T00:40:25Z

mantis-runtime-loader/src/main/java/io/mantisrx/runtime/loader/config/WorkerConfiguration.java

-    @Default("5000")
+    @Default("90000")


It feels strange that we send heartbeats every 10 seconds by default but wait for 90 seconds for those heartbeats to be acknowledged. I think if you want to change this, you should also change the interval.

In a happy case where the request is completed right away nothing changes. I think the semantics of hb every 10 seconds is good, but we want to wait longer before aborting the request to avoid the leak here when both client and the control-plane are actually healthy (in this case, the request can be completed if the connection waits long enough).

github-actions · 2024-11-07T00:44:06Z

Test Results

615 tests ±0 605 ✅ ±0 8m 9s ⏱️ +6s
142 suites ±0 10 💤 ±0
142 files ±0 0 ❌ ±0

Results for commit 418a0d0. ± Comparison against base commit f9d3a5c.

♻️ This comment has been updated with latest results.

handle busy agent hb timeout

418a0d0

Andyz26 requested review from calvin681, sundargates, hmitnflx and fdc-ntflx as code owners November 7, 2024 00:37

Andyz26 had a problem deploying to Integrate Pull Request November 7, 2024 00:38 — with GitHub Actions Failure

sundargates approved these changes Nov 7, 2024

View reviewed changes

Andyz26 merged commit a5874b2 into master Nov 7, 2024
4 of 5 checks passed

Andyz26 deleted the andyz/teAgentHBTimeoutImprovement branch November 7, 2024 19:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle overloaded agent's heartbeat timeout #724

Handle overloaded agent's heartbeat timeout #724

Andyz26 commented Nov 7, 2024

sundargates Nov 7, 2024

Andyz26 Nov 7, 2024

sundargates Nov 7, 2024

sundargates Nov 7, 2024

Andyz26 Nov 7, 2024

github-actions bot commented Nov 7, 2024 •

edited

Loading

Handle overloaded agent's heartbeat timeout #724

Handle overloaded agent's heartbeat timeout #724

Conversation

Andyz26 commented Nov 7, 2024

Context

Checklist

sundargates Nov 7, 2024

Choose a reason for hiding this comment

Andyz26 Nov 7, 2024

Choose a reason for hiding this comment

sundargates Nov 7, 2024

Choose a reason for hiding this comment

sundargates Nov 7, 2024

Choose a reason for hiding this comment

Andyz26 Nov 7, 2024

Choose a reason for hiding this comment

github-actions bot commented Nov 7, 2024 • edited Loading

Test Results

github-actions bot commented Nov 7, 2024 •

edited

Loading