[Feature Request] Add config option for workers to wait until RAFT state has fully caught-up before dequeuing work #18110

stephaniesac · 2023-08-01T03:14:33Z

Problem

We observed some of our jobs taking a while (longer than 30 seconds) to be started (go from the pending to running state) when at least one Nomad server has been restarted.

We believe this is because Nomad workers dequeue and nack evaluations while the raft state is not yet fully caught up with the rest of the cluster. Because subsequent nacks cause delays in the re-enqueuing of evaluations, this will naturally slow down job startup times.

See the Source Code & Details sections below with our findings.

Proposal

We don’t think there is an easy workaround to avoid this problem (See Alternatives section).

We propose adding a configuration toggle for scheduler workers to only start dequeuing work from the global evaluation queue (Eval broker) after the server’s raft index has fully caught up with the rest of the cluster (or is within a configurable “distance” from the current raft state). The default value for the toggle could maintain the existing optimistic behavior.

By doing this, scheduler workers will only process evaluations once the follower’s raft state is in sync with the cluster, which should help to avoid unnecessary job delays and improve the overall job scheduling process.

Thank you very much for considering our feature request!

Source Code

We took a look at source code to verify our hypothesis:

[server.go] is where we initialize the schedulers and workers
- Server’s SetupRaft() call sets up a new raft store
  - NewRaft() doesn’t block and starts up async threads
- SetupWorkers calls SetupWorkersLocked()
  - Which calls NewWorker() for each worker to start
  - Worker’s Start() also starts as a go routine
[worker.go]’s run() waits until raft index is caught up, nacking otherwise

As both workers and the raft state do work in background async threads, the lack of coordination means that it’s very possible for workers to end up waiting for the raft state to catch up, and for the problem above to happen.

Alternatives

Snapshot more often
As an alternative, to reduce the likelihood of slow job processing on node startup, we can reduce the time taken to catch up by encouraging more snapshotting:
raft_snapshot_threshold (int: "8192")
raft_snapshot_interval (string: "120s")

However, we think about this alternative as a work-around rather than a sustainable solution for two reasons:

It tends not to be effective for clusters which are very busy (if a reasonable snapshot interval causes the state snapshots to be far apart). Coincidentally very busy clusters are those which will suffer the most from delays
It reduces the likelihood of nacks happening, but it does not eliminate them (it is analogous to making a race condition less likely).

Implement custom configuration toggles for nack delays, nack timeouts, raft state catch-up timeouts
If users can configure:

Reduced raft-state catch-up timeouts
Reduced nack timeouts
Reduced nack delays

This will not reduce the rate of incidence, but reduces the latency impact from each incidence. Scheduler workers will nack faster and increase the likelihood that the evaluation will be processed by something else.

In addition to not being a complete solution, this alternative solution also has other drawbacks. Toggling the above could make schedulers more susceptible to overload when there is some other failure, and this risk is difficult to assess (it may lead to “death spirals” of failures).

Details & How we observed the issue

Running Nomad v1.5.5.
We restarted one server agent, and upon restart we saw our worker logs[0] reporting the job evaluation being dequeued and nacked because of timeout waiting for raft index. Note that in our use case, the job was stopped after 30s and so the evaluation did not proceed afterwards.

We grepped nacks around the reboot time in the logs, and we found another nack for a different evaluation around that time. The encoded nack delay configurations (see config.go) explain why it will take longer than 30s as we saw three nacks (0s + 1s + 20s), in addition to waiting for raftIndex timeout in between nacks (2 x 5s).

To confirm that the raft index was still catching up, we looked at the Nomad appliedIndex metric, and data was missing for around ~3 mins after the host was back online. Logs[1] also showed that we were ~5k indices behind, which might explain the time taken.

[0] Worker logs, grepping for the evaluation we care about:
2023-06-29T01:15:47.094+0800 [DEBUG] worker: dequeued evaluation: worker_id=091d978c-7f61-9548-c9f9-5a11c11d5152 eval_id=bc11e372-916c-32bd-60b6-4d6beaa96877 type=batch namespace=<NAMESPACE> job_id=<ID> node_id="" triggered_by=job-register
 
2023-06-29T01:15:52.095+0800 [WARN]  worker: timeout waiting for Raft index required by eval: worker_id=091d978c-7f61-9548-c9f9-5a11c11d5152 eval=bc11e372-916c-32bd-60b6-4d6beaa96877 index=7794843 timeout=5s
2023-06-29T01:15:52.095+0800 [DEBUG] worker: nack evaluation: worker_id=091d978c-7f61-9548-c9f9-5a11c11d5152 eval_id=bc11e372-916c-32bd-60b6-4d6beaa96877 type=batch namespace=<NAMESPACE> job_id=<ID> node_id="" triggered_by=job-register
2023-06-29T01:15:53.102+0800 [DEBUG] worker: dequeued evaluation: worker_id=a56a79a6-4d5c-8389-b7f4-91e42fa7caaf eval_id=bc11e372-916c-32bd-60b6-4d6beaa96877 type=batch namespace=<NAMESPACE> job_id=<ID> node_id="" triggered_by=job-register
 
2023-06-29T01:15:58.102+0800 [WARN]  worker: timeout waiting for Raft index required by eval: worker_id=a56a79a6-4d5c-8389-b7f4-91e42fa7caaf eval=bc11e372-916c-32bd-60b6-4d6beaa96877 index=7794843 timeout=5s
2023-06-29T01:15:58.102+0800 [DEBUG] worker: nack evaluation: worker_id=a56a79a6-4d5c-8389-b7f4-91e42fa7caaf eval_id=bc11e372-916c-32bd-60b6-4d6beaa96877 type=batch namespace=<NAMESPACE> job_id=<ID> node_id="" triggered_by=job-register
 
2023-06-29T01:16:18.106+0800 [DEBUG] worker: dequeued evaluation: worker_id=a56a79a6-4d5c-8389-b7f4-91e42fa7caaf eval_id=bc11e372-916c-32bd-60b6-4d6beaa96877 type=batch namespace=<NAMESPACE> job_id=<ID> node_id="" triggered_by=job-register 2023-06-29T01:16:18.119+0800 [DEBUG] worker.batch_sched: reconciled current state with desired state: eval_id=bc11e372-916c-32bd-60b6-4d6beaa96877 job_id=<ID> namespace=<NAMESPACE> worker_id=a56a79a6-4d5c-8389-b7f4-91e42fa7caaf
  results=
  | Total changes: (place 0) (destructive 0) (inplace 0) (stop 0) (disconnect 0) (reconnect 0)
  | Desired Changes for "<NAMESPACE>": (place 0) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 0) (canary 0)
 
2023-06-29T01:16:18.119+0800 [DEBUG] worker.batch_sched: setting eval status: eval_id=bc11e372-916c-32bd-60b6-4d6beaa96877 job_id=<ID> namespace=<NAMESPACE> worker_id=a56a79a6-4d5c-8389-b7f4-91e42fa7caaf status=complete
2023-06-29T01:16:18.149+0800 [DEBUG] worker: updated evaluation: worker_id=a56a79a6-4d5c-8389-b7f4-91e42fa7caaf eval="<Eval \"bc11e372-916c-32bd-60b6-4d6beaa96877\" JobID: \"v1\" Namespace: \"<NAMESPACE>\">"
 
2023-06-29T01:16:18.150+0800 [DEBUG] worker: ack evaluation: worker_id=a56a79a6-4d5c-8389-b7f4-91e42fa7caaf eval_id=bc11e372-916c-32bd-60b6-4d6beaa96877 type=batch namespace=<NAMESPACE> job_id=<ID> node_id="" triggered_by=job-register

[1] We grepped server logs on raft: 
2023-06-29T01:15:41.586+0800 [INFO]  nomad.raft: starting restore from snapshot: id=6178-7789371-1687952641027 last-index=7789371 last-term=6178 size-in-bytes=194284499
2023-06-29T01:15:44.831+0800 [INFO]  nomad.raft: snapshot restore progress: id=6178-7789371-1687952641027 last-index=7789371 last-term=6178 size-in-bytes=194284499 read-bytes=194284499 percent-complete="100.00%"
2023-06-29T01:15:44.831+0800 [INFO]  nomad.raft: restored from snapshot: id=6178-7789371-1687952641027 last-index=7789371 last-term=6178 size-in-bytes=194284499
2023-06-29T01:15:45.691+0800 [INFO]  nomad.raft: initial configuration: index=6383875 servers="[{Suffrage:Voter ID:25081771-9452-f92a-fdae-e6f6c2ebad2d Address:<host-and-port>} {Suffrage:Voter ID:e6f19b9f-3ad6-ab1b-299d-6b78d898091b Address:<host-and-port>} {Suffrage:Voter ID:bf6bb987-63dc-c758-801e-ab7660c67e49 Address:<host-and-port>}]"
2023-06-29T01:15:45.691+0800 [INFO]  nomad.raft: entering follower state: follower="Node at <host-and-port> [Follower]" leader-address= leader-id=

The text was updated successfully, but these errors were encountered:

tgross · 2023-08-01T12:30:12Z

Hi @stephaniesac! This is definitely something we've discussed wanting to do. There's a few small complications described in #15560 but I don't think those are insurmountable.

I'm going to close this issue as a duplicate and backlink to it from #15560 for context. Thanks!

stephaniesac added the type/enhancement label Aug 1, 2023

tgross closed this as not planned Won't fix, can't repro, duplicate, stale Aug 1, 2023

tgross mentioned this issue Aug 1, 2023

disable scheduling until initial snapshot is restored #15560

Open

tgross added theme/scheduling stage/duplicate labels Aug 1, 2023

stswidwinski mentioned this issue Aug 21, 2023

Nomad Server may instruct Clients to erroneously stop and GC all of their allocations. #18267

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Add config option for workers to wait until RAFT state has fully caught-up before dequeuing work #18110

[Feature Request] Add config option for workers to wait until RAFT state has fully caught-up before dequeuing work #18110

stephaniesac commented Aug 1, 2023 •

edited

Loading

tgross commented Aug 1, 2023

[Feature Request] Add config option for workers to wait until RAFT state has fully caught-up before dequeuing work #18110

[Feature Request] Add config option for workers to wait until RAFT state has fully caught-up before dequeuing work #18110

Comments

stephaniesac commented Aug 1, 2023 • edited Loading

Problem

Proposal

Source Code

Alternatives

Details & How we observed the issue

tgross commented Aug 1, 2023

stephaniesac commented Aug 1, 2023 •

edited

Loading