Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RPC retries from client can alter blocking time for Node.GetClientAllocs until client restart #25033

Closed
tgross opened this issue Feb 5, 2025 · 2 comments · Fixed by #25039
Closed

Comments

@tgross
Copy link
Member

tgross commented Feb 5, 2025

While debugging the issue described in hashicorp/yamux#143 I encountered a seemingly unrelated behavior where blocking queries from the client had their blocking time altered for retries during network connectivity problems, but the blocking time was not reset. Because the request object for the client's Node.GetClientAllocs RPC is reused repeatedly, this caused the client to making blocking queries every 26s instead of every 5min:

2025-02-04T14:09:32.785-0500 [DEBUG] client: received stale allocation information; retrying: index=1 min_index=1
2025-02-04T14:09:58.116-0500 [DEBUG] client: received stale allocation information; retrying: index=1 min_index=1
2025-02-04T14:10:24.026-0500 [DEBUG] client: received stale allocation information; retrying: index=1 min_index=1

I suspect the problem is in client/rpc.go#L145-L155. If we don't reset the MaxQueryTime after the request is successful, the request defined in client/client.go#L2307-L2322 uses that value forever. I haven't yet had an opportunity to verify this, but wanted to write this all down before I switch tasks. 😀

@lattwood
Copy link
Contributor

lattwood commented Feb 5, 2025

Looks like we're hitting this 😂

journalctl --unit nomad | grep 'received stale allocation information' | awk '{ print $1 " " $2 " " $3 }'
...
Feb 02 20:00:27
Feb 02 20:00:56
Feb 02 20:01:27
Feb 02 20:01:57
Feb 02 20:02:28
Feb 02 20:02:59
Feb 02 20:03:30
Feb 02 20:04:02
Feb 02 20:04:32
Feb 02 20:05:48

@jrasell jrasell moved this from Needs Triage to Needs Roadmapping in Nomad - Community Issues Triage Feb 6, 2025
@tgross
Copy link
Member Author

tgross commented Feb 6, 2025

I think I've got a fix, just need to write up a test for it.

Edit: #25039

@tgross tgross self-assigned this Feb 6, 2025
@tgross tgross moved this from Needs Roadmapping to In Progress in Nomad - Community Issues Triage Feb 6, 2025
tgross added a commit that referenced this issue Feb 6, 2025
When a blocking query on the client hits a retryable error, we change the max
query time so that it falls within the `RPCHoldTimeout` timeout. But when the
retry succeeds we don't reset it to the original value.

Because the calls to `Node.GetClientAllocs` reuse the same request struct
instead of reallocating it, any retry will cause the agent to poll at a faster
frequency until the agent restarts. No other current RPC on the client has this
behavior, but we'll fix this in the `rpc` method rather than in the caller so
that any future users of the `rpc` method don't have to remember this detail.

Fixes: #25033
tgross added a commit that referenced this issue Feb 6, 2025
When a blocking query on the client hits a retryable error, we change the max
query time so that it falls within the `RPCHoldTimeout` timeout. But when the
retry succeeds we don't reset it to the original value.

Because the calls to `Node.GetClientAllocs` reuse the same request struct
instead of reallocating it, any retry will cause the agent to poll at a faster
frequency until the agent restarts. No other current RPC on the client has this
behavior, but we'll fix this in the `rpc` method rather than in the caller so
that any future users of the `rpc` method don't have to remember this detail.

Fixes: #25033
@tgross tgross closed this as completed in 5d09d7a Feb 7, 2025
@github-project-automation github-project-automation bot moved this from In Progress to Done in Nomad - Community Issues Triage Feb 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging a pull request may close this issue.

2 participants