-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RPC retries from client can alter blocking time for Node.GetClientAllocs
until client restart
#25033
Labels
Comments
Looks like we're hitting this 😂 journalctl --unit nomad | grep 'received stale allocation information' | awk '{ print $1 " " $2 " " $3 }'
...
Feb 02 20:00:27
Feb 02 20:00:56
Feb 02 20:01:27
Feb 02 20:01:57
Feb 02 20:02:28
Feb 02 20:02:59
Feb 02 20:03:30
Feb 02 20:04:02
Feb 02 20:04:32
Feb 02 20:05:48 |
I think I've got a fix, just need to write up a test for it. Edit: #25039 |
tgross
added a commit
that referenced
this issue
Feb 6, 2025
When a blocking query on the client hits a retryable error, we change the max query time so that it falls within the `RPCHoldTimeout` timeout. But when the retry succeeds we don't reset it to the original value. Because the calls to `Node.GetClientAllocs` reuse the same request struct instead of reallocating it, any retry will cause the agent to poll at a faster frequency until the agent restarts. No other current RPC on the client has this behavior, but we'll fix this in the `rpc` method rather than in the caller so that any future users of the `rpc` method don't have to remember this detail. Fixes: #25033
6 tasks
tgross
added a commit
that referenced
this issue
Feb 6, 2025
When a blocking query on the client hits a retryable error, we change the max query time so that it falls within the `RPCHoldTimeout` timeout. But when the retry succeeds we don't reset it to the original value. Because the calls to `Node.GetClientAllocs` reuse the same request struct instead of reallocating it, any retry will cause the agent to poll at a faster frequency until the agent restarts. No other current RPC on the client has this behavior, but we'll fix this in the `rpc` method rather than in the caller so that any future users of the `rpc` method don't have to remember this detail. Fixes: #25033
6 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
While debugging the issue described in hashicorp/yamux#143 I encountered a seemingly unrelated behavior where blocking queries from the client had their blocking time altered for retries during network connectivity problems, but the blocking time was not reset. Because the request object for the client's
Node.GetClientAllocs
RPC is reused repeatedly, this caused the client to making blocking queries every 26s instead of every 5min:I suspect the problem is in
client/rpc.go#L145-L155
. If we don't reset theMaxQueryTime
after the request is successful, the request defined inclient/client.go#L2307-L2322
uses that value forever. I haven't yet had an opportunity to verify this, but wanted to write this all down before I switch tasks. 😀The text was updated successfully, but these errors were encountered: