Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle overloaded agent's heartbeat timeout #724

Merged
merged 1 commit into from
Nov 7, 2024

Conversation

Andyz26
Copy link
Collaborator

@Andyz26 Andyz26 commented Nov 7, 2024

Context

We found a case where agents with shutdown workers got leaked:

  • A job with multiple workers is being shut down.
  • Some workers receive the shutdown signals and terminate the running workers. However, some workers fail to receive the signals (timeout, etc.) and continue to be alive. Now, all upstream traffic is re-distributed to these workers, and they become very busy (peaked CPU/network usage).
  • Retried kill signals cannot reach these workers while the worker's host agent's heartbeat message to the control plane timed out, causing them to be leaked.

Improvements:

  • increase heartbeat connection client's thread priority
  • increase heartbeat connection timeout settings.

Checklist

  • ./gradlew build compiles code correctly
  • Added new tests where applicable
  • ./gradlew test passes all tests
  • Extended README or added javadocs where applicable

Comment on lines -70 to +78
@Default("10000")
@Default("90000")
int getAsyncHttpClientConnectionTimeoutMs();

@Config("mantis.asyncHttpClient.requestTimeoutMs")
@Default("10000")
@Default("90000")
int getAsyncHttpClientRequestTimeoutMs();

@Config("mantis.asyncHttpClient.readTimeoutMs")
@Default("10000")
@Default("90000")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we change the defaults for everyone? Can we just change this in our codebase?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see the other comment.

@Default("5000")
@Default("90000")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels strange that we send heartbeats every 10 seconds by default but wait for 90 seconds for those heartbeats to be acknowledged. I think if you want to change this, you should also change the interval.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a happy case where the request is completed right away nothing changes. I think the semantics of hb every 10 seconds is good, but we want to wait longer before aborting the request to avoid the leak here when both client and the control-plane are actually healthy (in this case, the request can be completed if the connection waits long enough).

Copy link

github-actions bot commented Nov 7, 2024

Test Results

615 tests  ±0   605 ✅ ±0   8m 9s ⏱️ +6s
142 suites ±0    10 💤 ±0 
142 files   ±0     0 ❌ ±0 

Results for commit 418a0d0. ± Comparison against base commit f9d3a5c.

♻️ This comment has been updated with latest results.

@Andyz26 Andyz26 merged commit a5874b2 into master Nov 7, 2024
4 of 5 checks passed
@Andyz26 Andyz26 deleted the andyz/teAgentHBTimeoutImprovement branch November 7, 2024 19:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants