-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle overloaded agent's heartbeat timeout #724
Conversation
@Default("10000") | ||
@Default("90000") | ||
int getAsyncHttpClientConnectionTimeoutMs(); | ||
|
||
@Config("mantis.asyncHttpClient.requestTimeoutMs") | ||
@Default("10000") | ||
@Default("90000") | ||
int getAsyncHttpClientRequestTimeoutMs(); | ||
|
||
@Config("mantis.asyncHttpClient.readTimeoutMs") | ||
@Default("10000") | ||
@Default("90000") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we change the defaults for everyone? Can we just change this in our codebase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see the other comment.
@Default("5000") | ||
@Default("90000") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels strange that we send heartbeats every 10 seconds by default but wait for 90 seconds for those heartbeats to be acknowledged. I think if you want to change this, you should also change the interval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a happy case where the request is completed right away nothing changes. I think the semantics of hb every 10 seconds is good, but we want to wait longer before aborting the request to avoid the leak here when both client and the control-plane are actually healthy (in this case, the request can be completed if the connection waits long enough).
Context
We found a case where agents with shutdown workers got leaked:
Improvements:
Checklist
./gradlew build
compiles code correctly./gradlew test
passes all tests