You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When new servers join the cluster, they stream a raft snapshot from the existing servers to catch up for replication. But many other operations are spinning up concurrently, including scheduling.
Nomad scheduler workers start immediately on server start. When a scheduler dequeues an evaluation, the leader provides a minimum snapshot index to ensure that the scheduler has an in-memory state at least as current as that index. But the plan applier does not check the index again on plan submit, so if there were a bug in the logic for waiting on the scheduler, it could submit stale plans that stop all allocs and the plan applier would accept these because they “fit” on the current cluster. Even without bugs, this causes a window where evaluations are getting dequeued but can't be planned, and so the evaluations are delayed.
This especially impacts organizations with large clusters where the snapshot takes on the order of minutes to completely restore. In #15523 we're backing off scheduling if we determine we're behind, and in #15522 we provide tunables that can help cluster administrators ensure the snapshots go smoothly. But we could potentially tighten this behavior up by disabling scheduling entirely on the new server until it's ready to successfully do work. This is slightly complicated by bootstrapping and may need #13219 to be completed first. I'm opening this issue for further discussion among the team (and community!)
The text was updated successfully, but these errors were encountered:
This autopilot callback can be used to detect index lag between the current Raft member and the leader, and then make whatever decisions we want. I think this could do a few things given some stale-ness threshold:
Disable scheduler works entirely
Ignore stale=true and forward all RPCs
Emit the index drift as a metric for ease of monitoring
When new servers join the cluster, they stream a raft snapshot from the existing servers to catch up for replication. But many other operations are spinning up concurrently, including scheduling.
Nomad scheduler workers start immediately on server start. When a scheduler dequeues an evaluation, the leader provides a minimum snapshot index to ensure that the scheduler has an in-memory state at least as current as that index. But the plan applier does not check the index again on plan submit, so if there were a bug in the logic for waiting on the scheduler, it could submit stale plans that stop all allocs and the plan applier would accept these because they “fit” on the current cluster. Even without bugs, this causes a window where evaluations are getting dequeued but can't be planned, and so the evaluations are delayed.
This especially impacts organizations with large clusters where the snapshot takes on the order of minutes to completely restore. In #15523 we're backing off scheduling if we determine we're behind, and in #15522 we provide tunables that can help cluster administrators ensure the snapshots go smoothly. But we could potentially tighten this behavior up by disabling scheduling entirely on the new server until it's ready to successfully do work. This is slightly complicated by bootstrapping and may need #13219 to be completed first. I'm opening this issue for further discussion among the team (and community!)
The text was updated successfully, but these errors were encountered: