RPC retries from client can alter blocking time for `Node.GetClientAllocs` until client restart #25033

tgross · 2025-02-05T21:42:33Z

While debugging the issue described in hashicorp/yamux#143 I encountered a seemingly unrelated behavior where blocking queries from the client had their blocking time altered for retries during network connectivity problems, but the blocking time was not reset. Because the request object for the client's Node.GetClientAllocs RPC is reused repeatedly, this caused the client to making blocking queries every 26s instead of every 5min:

2025-02-04T14:09:32.785-0500 [DEBUG] client: received stale allocation information; retrying: index=1 min_index=1
2025-02-04T14:09:58.116-0500 [DEBUG] client: received stale allocation information; retrying: index=1 min_index=1
2025-02-04T14:10:24.026-0500 [DEBUG] client: received stale allocation information; retrying: index=1 min_index=1

I suspect the problem is in client/rpc.go#L145-L155. If we don't reset the MaxQueryTime after the request is successful, the request defined in client/client.go#L2307-L2322 uses that value forever. I haven't yet had an opportunity to verify this, but wanted to write this all down before I switch tasks. 😀

The text was updated successfully, but these errors were encountered:

lattwood · 2025-02-05T23:42:00Z

Looks like we're hitting this 😂

journalctl --unit nomad | grep 'received stale allocation information' | awk '{ print $1 " " $2 " " $3 }'
...
Feb 02 20:00:27
Feb 02 20:00:56
Feb 02 20:01:27
Feb 02 20:01:57
Feb 02 20:02:28
Feb 02 20:02:59
Feb 02 20:03:30
Feb 02 20:04:02
Feb 02 20:04:32
Feb 02 20:05:48

tgross · 2025-02-06T15:22:53Z

I think I've got a fix, just need to write up a test for it.

Edit: #25039

When a blocking query on the client hits a retryable error, we change the max query time so that it falls within the `RPCHoldTimeout` timeout. But when the retry succeeds we don't reset it to the original value. Because the calls to `Node.GetClientAllocs` reuse the same request struct instead of reallocating it, any retry will cause the agent to poll at a faster frequency until the agent restarts. No other current RPC on the client has this behavior, but we'll fix this in the `rpc` method rather than in the caller so that any future users of the `rpc` method don't have to remember this detail. Fixes: #25033

tgross added theme/client type/bug hcc/jira labels Feb 5, 2025

jrasell added this to Nomad - Community Issues Triage Feb 6, 2025

github-project-automation bot moved this to Needs Triage in Nomad - Community Issues Triage Feb 6, 2025

jrasell moved this from Needs Triage to Needs Roadmapping in Nomad - Community Issues Triage Feb 6, 2025

tgross self-assigned this Feb 6, 2025

tgross moved this from Needs Roadmapping to In Progress in Nomad - Community Issues Triage Feb 6, 2025

tgross mentioned this issue Feb 6, 2025

reset max query time of blocking queries in client after retries #25039

Merged

6 tasks

tgross closed this as completed in #25039 Feb 7, 2025

tgross closed this as completed in 5d09d7a Feb 7, 2025

github-project-automation bot moved this from In Progress to Done in Nomad - Community Issues Triage Feb 7, 2025

hc-github-team-nomad-core mentioned this issue Feb 7, 2025

Backport of reset max query time of blocking queries in client after retries into release/1.9.x #25049

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RPC retries from client can alter blocking time for `Node.GetClientAllocs` until client restart #25033

RPC retries from client can alter blocking time for `Node.GetClientAllocs` until client restart #25033

tgross commented Feb 5, 2025

lattwood commented Feb 5, 2025 •

edited

Loading

tgross commented Feb 6, 2025 •

edited

Loading

RPC retries from client can alter blocking time for Node.GetClientAllocs until client restart #25033

RPC retries from client can alter blocking time for Node.GetClientAllocs until client restart #25033

Comments

tgross commented Feb 5, 2025

lattwood commented Feb 5, 2025 • edited Loading

tgross commented Feb 6, 2025 • edited Loading

RPC retries from client can alter blocking time for `Node.GetClientAllocs` until client restart #25033

RPC retries from client can alter blocking time for `Node.GetClientAllocs` until client restart #25033

lattwood commented Feb 5, 2025 •

edited

Loading

tgross commented Feb 6, 2025 •

edited

Loading