disable scheduling until initial snapshot is restored #15560

tgross · 2022-12-16T15:06:02Z

When new servers join the cluster, they stream a raft snapshot from the existing servers to catch up for replication. But many other operations are spinning up concurrently, including scheduling.

Nomad scheduler workers start immediately on server start. When a scheduler dequeues an evaluation, the leader provides a minimum snapshot index to ensure that the scheduler has an in-memory state at least as current as that index. But the plan applier does not check the index again on plan submit, so if there were a bug in the logic for waiting on the scheduler, it could submit stale plans that stop all allocs and the plan applier would accept these because they “fit” on the current cluster. Even without bugs, this causes a window where evaluations are getting dequeued but can't be planned, and so the evaluations are delayed.

This especially impacts organizations with large clusters where the snapshot takes on the order of minutes to completely restore. In #15523 we're backing off scheduling if we determine we're behind, and in #15522 we provide tunables that can help cluster administrators ensure the snapshots go smoothly. But we could potentially tighten this behavior up by disabling scheduling entirely on the new server until it's ready to successfully do work. This is slightly complicated by bootstrapping and may need #13219 to be completed first. I'm opening this issue for further discussion among the team (and community!)

tgross · 2023-08-01T12:30:21Z

See also #18110 for more context.

lgfa29 · 2023-08-21T21:56:22Z

Noting here that #18267 describes how a server lagging on Raft restore can also impact clients.

schmichael · 2023-10-17T17:52:51Z

This autopilot callback can be used to detect index lag between the current Raft member and the leader, and then make whatever decisions we want. I think this could do a few things given some stale-ness threshold:

Disable scheduler works entirely
Ignore stale=true and forward all RPCs
Emit the index drift as a metric for ease of monitoring

https://github.com/hashicorp/nomad/blob/v1.6.2/nomad/autopilot.go#L69-L81

tgross added type/enhancement theme/scheduling stage/needs-discussion labels Dec 16, 2022

tgross mentioned this issue Jul 7, 2023

client RPC fails to validate new ACL token #17834

Open

tgross mentioned this issue Aug 1, 2023

[Feature Request] Add config option for workers to wait until RAFT state has fully caught-up before dequeuing work #18110

Closed

stswidwinski mentioned this issue Aug 21, 2023

Nomad Server may instruct Clients to erroneously stop and GC all of their allocations. #18267

Open

schmichael mentioned this issue Oct 9, 2024

Enhancement: persist commit index in LogStore to accelerate recovery hashicorp/raft#613

Open

tgross mentioned this issue Oct 18, 2024

Invalid Vault token (403) in a Nomad client after recycling Nomad servers #24256

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

disable scheduling until initial snapshot is restored #15560

disable scheduling until initial snapshot is restored #15560

tgross commented Dec 16, 2022

tgross commented Aug 1, 2023

lgfa29 commented Aug 21, 2023

schmichael commented Oct 17, 2023

disable scheduling until initial snapshot is restored #15560

disable scheduling until initial snapshot is restored #15560

Comments

tgross commented Dec 16, 2022

tgross commented Aug 1, 2023

lgfa29 commented Aug 21, 2023

schmichael commented Oct 17, 2023