You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We observed some of our jobs taking a while (longer than 30 seconds) to be started (go from the pending to running state) when at least one Nomad server has been restarted.
We believe this is because Nomad workers dequeue and nack evaluations while the raft state is not yet fully caught up with the rest of the cluster. Because subsequent nacks cause delays in the re-enqueuing of evaluations, this will naturally slow down job startup times.
See the Source Code & Details sections below with our findings.
Proposal
We don’t think there is an easy workaround to avoid this problem (See Alternatives section).
We propose adding a configuration toggle for scheduler workers to only start dequeuing work from the global evaluation queue (Eval broker) after the server’s raft index has fully caught up with the rest of the cluster (or is within a configurable “distance” from the current raft state). The default value for the toggle could maintain the existing optimistic behavior.
By doing this, scheduler workers will only process evaluations once the follower’s raft state is in sync with the cluster, which should help to avoid unnecessary job delays and improve the overall job scheduling process.
Thank you very much for considering our feature request!
Source Code
We took a look at source code to verify our hypothesis:
[server.go] is where we initialize the schedulers and workers
Server’s SetupRaft() call sets up a new raft store
NewRaft() doesn’t block and starts up async threads
[worker.go]’s run() waits until raft index is caught up, nacking otherwise
As both workers and the raft state do work in background async threads, the lack of coordination means that it’s very possible for workers to end up waiting for the raft state to catch up, and for the problem above to happen.
Alternatives
Snapshot more often
As an alternative, to reduce the likelihood of slow job processing on node startup, we can reduce the time taken to catch up by encouraging more snapshotting: raft_snapshot_threshold (int: "8192") raft_snapshot_interval (string: "120s")
However, we think about this alternative as a work-around rather than a sustainable solution for two reasons:
It tends not to be effective for clusters which are very busy (if a reasonable snapshot interval causes the state snapshots to be far apart). Coincidentally very busy clusters are those which will suffer the most from delays
It reduces the likelihood of nacks happening, but it does not eliminate them (it is analogous to making a race condition less likely).
Implement custom configuration toggles for nack delays, nack timeouts, raft state catch-up timeouts
If users can configure:
Reduced raft-state catch-up timeouts
Reduced nack timeouts
Reduced nack delays
This will not reduce the rate of incidence, but reduces the latency impact from each incidence. Scheduler workers will nack faster and increase the likelihood that the evaluation will be processed by something else.
In addition to not being a complete solution, this alternative solution also has other drawbacks. Toggling the above could make schedulers more susceptible to overload when there is some other failure, and this risk is difficult to assess (it may lead to “death spirals” of failures).
Details & How we observed the issue
Running Nomad v1.5.5.
We restarted one server agent, and upon restart we saw our worker logs[0] reporting the job evaluation being dequeued and nacked because of timeout waiting for raft index. Note that in our use case, the job was stopped after 30s and so the evaluation did not proceed afterwards.
We grepped nacks around the reboot time in the logs, and we found another nack for a different evaluation around that time. The encoded nack delay configurations (see config.go) explain why it will take longer than 30s as we saw three nacks (0s + 1s + 20s), in addition to waiting for raftIndex timeout in between nacks (2 x 5s).
To confirm that the raft index was still catching up, we looked at the Nomad appliedIndex metric, and data was missing for around ~3 mins after the host was back online. Logs[1] also showed that we were ~5k indices behind, which might explain the time taken.
Hi @stephaniesac! This is definitely something we've discussed wanting to do. There's a few small complications described in #15560 but I don't think those are insurmountable.
I'm going to close this issue as a duplicate and backlink to it from #15560 for context. Thanks!
Problem
We observed some of our jobs taking a while (longer than 30 seconds) to be started (go from the pending to running state) when at least one Nomad server has been restarted.
We believe this is because Nomad workers dequeue and nack evaluations while the raft state is not yet fully caught up with the rest of the cluster. Because subsequent nacks cause delays in the re-enqueuing of evaluations, this will naturally slow down job startup times.
See the Source Code & Details sections below with our findings.
Proposal
We don’t think there is an easy workaround to avoid this problem (See Alternatives section).
We propose adding a configuration toggle for scheduler workers to only start dequeuing work from the global evaluation queue (Eval broker) after the server’s raft index has fully caught up with the rest of the cluster (or is within a configurable “distance” from the current raft state). The default value for the toggle could maintain the existing optimistic behavior.
By doing this, scheduler workers will only process evaluations once the follower’s raft state is in sync with the cluster, which should help to avoid unnecessary job delays and improve the overall job scheduling process.
Thank you very much for considering our feature request!
Source Code
We took a look at source code to verify our hypothesis:
[server.go] is where we initialize the schedulers and workers
[worker.go]’s run() waits until raft index is caught up, nacking otherwise
As both workers and the raft state do work in background async threads, the lack of coordination means that it’s very possible for workers to end up waiting for the raft state to catch up, and for the problem above to happen.
Alternatives
Snapshot more often
As an alternative, to reduce the likelihood of slow job processing on node startup, we can reduce the time taken to catch up by encouraging more snapshotting:
raft_snapshot_threshold (int: "8192")
raft_snapshot_interval (string: "120s")
However, we think about this alternative as a work-around rather than a sustainable solution for two reasons:
Implement custom configuration toggles for nack delays, nack timeouts, raft state catch-up timeouts
If users can configure:
This will not reduce the rate of incidence, but reduces the latency impact from each incidence. Scheduler workers will nack faster and increase the likelihood that the evaluation will be processed by something else.
In addition to not being a complete solution, this alternative solution also has other drawbacks. Toggling the above could make schedulers more susceptible to overload when there is some other failure, and this risk is difficult to assess (it may lead to “death spirals” of failures).
Details & How we observed the issue
Running Nomad v1.5.5.
We restarted one server agent, and upon restart we saw our worker logs[0] reporting the job evaluation being dequeued and nacked because of timeout waiting for raft index. Note that in our use case, the job was stopped after 30s and so the evaluation did not proceed afterwards.
We grepped nacks around the reboot time in the logs, and we found another nack for a different evaluation around that time. The encoded nack delay configurations (see config.go) explain why it will take longer than 30s as we saw three nacks (0s + 1s + 20s), in addition to waiting for raftIndex timeout in between nacks (2 x 5s).
To confirm that the raft index was still catching up, we looked at the Nomad appliedIndex metric, and data was missing for around ~3 mins after the host was back online. Logs[1] also showed that we were ~5k indices behind, which might explain the time taken.
The text was updated successfully, but these errors were encountered: