Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

disable scheduling until initial snapshot is restored #15560

Open
tgross opened this issue Dec 16, 2022 · 3 comments
Open

disable scheduling until initial snapshot is restored #15560

tgross opened this issue Dec 16, 2022 · 3 comments

Comments

@tgross
Copy link
Member

tgross commented Dec 16, 2022

When new servers join the cluster, they stream a raft snapshot from the existing servers to catch up for replication. But many other operations are spinning up concurrently, including scheduling.

Nomad scheduler workers start immediately on server start. When a scheduler dequeues an evaluation, the leader provides a minimum snapshot index to ensure that the scheduler has an in-memory state at least as current as that index. But the plan applier does not check the index again on plan submit, so if there were a bug in the logic for waiting on the scheduler, it could submit stale plans that stop all allocs and the plan applier would accept these because they “fit” on the current cluster. Even without bugs, this causes a window where evaluations are getting dequeued but can't be planned, and so the evaluations are delayed.

This especially impacts organizations with large clusters where the snapshot takes on the order of minutes to completely restore. In #15523 we're backing off scheduling if we determine we're behind, and in #15522 we provide tunables that can help cluster administrators ensure the snapshots go smoothly. But we could potentially tighten this behavior up by disabling scheduling entirely on the new server until it's ready to successfully do work. This is slightly complicated by bootstrapping and may need #13219 to be completed first. I'm opening this issue for further discussion among the team (and community!)

@tgross
Copy link
Member Author

tgross commented Aug 1, 2023

See also #18110 for more context.

@lgfa29
Copy link
Contributor

lgfa29 commented Aug 21, 2023

Noting here that #18267 describes how a server lagging on Raft restore can also impact clients.

@schmichael
Copy link
Member

This autopilot callback can be used to detect index lag between the current Raft member and the leader, and then make whatever decisions we want. I think this could do a few things given some stale-ness threshold:

  1. Disable scheduler works entirely
  2. Ignore stale=true and forward all RPCs
  3. Emit the index drift as a metric for ease of monitoring

https://github.com/hashicorp/nomad/blob/v1.6.2/nomad/autopilot.go#L69-L81

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants