Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Add config option for workers to wait until RAFT state has fully caught-up before dequeuing work #18110

Closed
stephaniesac opened this issue Aug 1, 2023 · 1 comment

Comments

@stephaniesac
Copy link

stephaniesac commented Aug 1, 2023

Problem

We observed some of our jobs taking a while (longer than 30 seconds) to be started (go from the pending to running state) when at least one Nomad server has been restarted.

We believe this is because Nomad workers dequeue and nack evaluations while the raft state is not yet fully caught up with the rest of the cluster. Because subsequent nacks cause delays in the re-enqueuing of evaluations, this will naturally slow down job startup times.

See the Source Code & Details sections below with our findings.

Proposal

We don’t think there is an easy workaround to avoid this problem (See Alternatives section).

We propose adding a configuration toggle for scheduler workers to only start dequeuing work from the global evaluation queue (Eval broker) after the server’s raft index has fully caught up with the rest of the cluster (or is within a configurable “distance” from the current raft state). The default value for the toggle could maintain the existing optimistic behavior.

By doing this, scheduler workers will only process evaluations once the follower’s raft state is in sync with the cluster, which should help to avoid unnecessary job delays and improve the overall job scheduling process.

Thank you very much for considering our feature request!

Source Code

We took a look at source code to verify our hypothesis:

  • [server.go] is where we initialize the schedulers and workers

  • [worker.go]’s run() waits until raft index is caught up, nacking otherwise

As both workers and the raft state do work in background async threads, the lack of coordination means that it’s very possible for workers to end up waiting for the raft state to catch up, and for the problem above to happen.

Alternatives

Snapshot more often
As an alternative, to reduce the likelihood of slow job processing on node startup, we can reduce the time taken to catch up by encouraging more snapshotting:
raft_snapshot_threshold (int: "8192")
raft_snapshot_interval (string: "120s")

However, we think about this alternative as a work-around rather than a sustainable solution for two reasons:

  • It tends not to be effective for clusters which are very busy (if a reasonable snapshot interval causes the state snapshots to be far apart). Coincidentally very busy clusters are those which will suffer the most from delays
  • It reduces the likelihood of nacks happening, but it does not eliminate them (it is analogous to making a race condition less likely).

Implement custom configuration toggles for nack delays, nack timeouts, raft state catch-up timeouts
If users can configure:

  • Reduced raft-state catch-up timeouts
  • Reduced nack timeouts
  • Reduced nack delays

This will not reduce the rate of incidence, but reduces the latency impact from each incidence. Scheduler workers will nack faster and increase the likelihood that the evaluation will be processed by something else.

In addition to not being a complete solution, this alternative solution also has other drawbacks. Toggling the above could make schedulers more susceptible to overload when there is some other failure, and this risk is difficult to assess (it may lead to “death spirals” of failures).

Details & How we observed the issue

Running Nomad v1.5.5.
We restarted one server agent, and upon restart we saw our worker logs[0] reporting the job evaluation being dequeued and nacked because of timeout waiting for raft index. Note that in our use case, the job was stopped after 30s and so the evaluation did not proceed afterwards.

We grepped nacks around the reboot time in the logs, and we found another nack for a different evaluation around that time. The encoded nack delay configurations (see config.go) explain why it will take longer than 30s as we saw three nacks (0s + 1s + 20s), in addition to waiting for raftIndex timeout in between nacks (2 x 5s).

To confirm that the raft index was still catching up, we looked at the Nomad appliedIndex metric, and data was missing for around ~3 mins after the host was back online. Logs[1] also showed that we were ~5k indices behind, which might explain the time taken.

[0] Worker logs, grepping for the evaluation we care about:
2023-06-29T01:15:47.094+0800 [DEBUG] worker: dequeued evaluation: worker_id=091d978c-7f61-9548-c9f9-5a11c11d5152 eval_id=bc11e372-916c-32bd-60b6-4d6beaa96877 type=batch namespace=<NAMESPACE> job_id=<ID> node_id="" triggered_by=job-register
 
2023-06-29T01:15:52.095+0800 [WARN]  worker: timeout waiting for Raft index required by eval: worker_id=091d978c-7f61-9548-c9f9-5a11c11d5152 eval=bc11e372-916c-32bd-60b6-4d6beaa96877 index=7794843 timeout=5s
2023-06-29T01:15:52.095+0800 [DEBUG] worker: nack evaluation: worker_id=091d978c-7f61-9548-c9f9-5a11c11d5152 eval_id=bc11e372-916c-32bd-60b6-4d6beaa96877 type=batch namespace=<NAMESPACE> job_id=<ID> node_id="" triggered_by=job-register
2023-06-29T01:15:53.102+0800 [DEBUG] worker: dequeued evaluation: worker_id=a56a79a6-4d5c-8389-b7f4-91e42fa7caaf eval_id=bc11e372-916c-32bd-60b6-4d6beaa96877 type=batch namespace=<NAMESPACE> job_id=<ID> node_id="" triggered_by=job-register
 
2023-06-29T01:15:58.102+0800 [WARN]  worker: timeout waiting for Raft index required by eval: worker_id=a56a79a6-4d5c-8389-b7f4-91e42fa7caaf eval=bc11e372-916c-32bd-60b6-4d6beaa96877 index=7794843 timeout=5s
2023-06-29T01:15:58.102+0800 [DEBUG] worker: nack evaluation: worker_id=a56a79a6-4d5c-8389-b7f4-91e42fa7caaf eval_id=bc11e372-916c-32bd-60b6-4d6beaa96877 type=batch namespace=<NAMESPACE> job_id=<ID> node_id="" triggered_by=job-register
 
2023-06-29T01:16:18.106+0800 [DEBUG] worker: dequeued evaluation: worker_id=a56a79a6-4d5c-8389-b7f4-91e42fa7caaf eval_id=bc11e372-916c-32bd-60b6-4d6beaa96877 type=batch namespace=<NAMESPACE> job_id=<ID> node_id="" triggered_by=job-register 2023-06-29T01:16:18.119+0800 [DEBUG] worker.batch_sched: reconciled current state with desired state: eval_id=bc11e372-916c-32bd-60b6-4d6beaa96877 job_id=<ID> namespace=<NAMESPACE> worker_id=a56a79a6-4d5c-8389-b7f4-91e42fa7caaf
  results=
  | Total changes: (place 0) (destructive 0) (inplace 0) (stop 0) (disconnect 0) (reconnect 0)
  | Desired Changes for "<NAMESPACE>": (place 0) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 0) (canary 0)
 
2023-06-29T01:16:18.119+0800 [DEBUG] worker.batch_sched: setting eval status: eval_id=bc11e372-916c-32bd-60b6-4d6beaa96877 job_id=<ID> namespace=<NAMESPACE> worker_id=a56a79a6-4d5c-8389-b7f4-91e42fa7caaf status=complete
2023-06-29T01:16:18.149+0800 [DEBUG] worker: updated evaluation: worker_id=a56a79a6-4d5c-8389-b7f4-91e42fa7caaf eval="<Eval \"bc11e372-916c-32bd-60b6-4d6beaa96877\" JobID: \"v1\" Namespace: \"<NAMESPACE>\">"
 
2023-06-29T01:16:18.150+0800 [DEBUG] worker: ack evaluation: worker_id=a56a79a6-4d5c-8389-b7f4-91e42fa7caaf eval_id=bc11e372-916c-32bd-60b6-4d6beaa96877 type=batch namespace=<NAMESPACE> job_id=<ID> node_id="" triggered_by=job-register

[1] We grepped server logs on raft: 
2023-06-29T01:15:41.586+0800 [INFO]  nomad.raft: starting restore from snapshot: id=6178-7789371-1687952641027 last-index=7789371 last-term=6178 size-in-bytes=194284499
2023-06-29T01:15:44.831+0800 [INFO]  nomad.raft: snapshot restore progress: id=6178-7789371-1687952641027 last-index=7789371 last-term=6178 size-in-bytes=194284499 read-bytes=194284499 percent-complete="100.00%"
2023-06-29T01:15:44.831+0800 [INFO]  nomad.raft: restored from snapshot: id=6178-7789371-1687952641027 last-index=7789371 last-term=6178 size-in-bytes=194284499
2023-06-29T01:15:45.691+0800 [INFO]  nomad.raft: initial configuration: index=6383875 servers="[{Suffrage:Voter ID:25081771-9452-f92a-fdae-e6f6c2ebad2d Address:<host-and-port>} {Suffrage:Voter ID:e6f19b9f-3ad6-ab1b-299d-6b78d898091b Address:<host-and-port>} {Suffrage:Voter ID:bf6bb987-63dc-c758-801e-ab7660c67e49 Address:<host-and-port>}]"
2023-06-29T01:15:45.691+0800 [INFO]  nomad.raft: entering follower state: follower="Node at <host-and-port> [Follower]" leader-address= leader-id=

@tgross
Copy link
Member

tgross commented Aug 1, 2023

Hi @stephaniesac! This is definitely something we've discussed wanting to do. There's a few small complications described in #15560 but I don't think those are insurmountable.

I'm going to close this issue as a duplicate and backlink to it from #15560 for context. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants