kvserver: track per-replica and aggregate to-apply entry bytes #97044

tbg · 2023-02-13T16:00:15Z

Is your feature request related to a problem? Please describe.

In recent experiments @andrewbaptist, found¹ that because log entry append is sequential whereas log entry application can cause random writes to the LSM, a node that catches up on a log of raft log (for example after restarting following a period of downtime) can end up with an inverted LSM that results from applying a large amount of rapidly appended log entries.

This can be seen in the screenshot below. The top graph shows the inverted LSM - higher is more inverted. The middle graph shows log appends - they level off way before the LSM fully inverts. (As a result, replica pausing in the bottom graph comes too late, as that only delays appending which at this point in time has already completed).

Similar to how writes to the LSM don't account for compaction debt, appends to the raft log are even a step further back, because they don't even account for the writes to the LSM (which may have very different characteristics than the appends themselves, as we see here).

It's desirable to be able to change something about this behavior, but this issue does not attempt to propose a solution. Instead, we note that it would be helpful to at least have a quantity that can detect this debt before it inverts the LSM, as this will likely play a role in both short-term and long-term mitigations to this kind of overload.

Describe the solution you'd like

We could track the number of unapplied entry bytes per replica and also maintain a store-global aggregate. Per Replica, we would have a gauge unappliedEntryBytes which tracks all bytes between the AppliedIndex and the LastIndex². Any change to this gauge would be reflected in a store-aggregate gauge, which could be read off quickly without a need to visit all replicas.

The gauge would need to be updated on raft log appends (taking care to do this properly on sideloading and also when the log tail is replaced), and on snapshots. Thanks to the local copy of the gauge, the global one can be "partially reset" accordingly.

Describe alternatives you've considered

Additional context

Jira issue: CRDB-24481

Epic CRDB-39898

https://cockroachlabs.slack.com/archives/G01G8LK77DK/p1675457831537829?thread_ts=1675371760.534349&cid=G01G8LK77DK ↩
it's likely unimportant whether we use the durable or inflight last-index, the inflight one likely makes more sense. ↩

The text was updated successfully, but these errors were encountered:

blathers-crl · 2023-02-13T16:00:19Z

cc @cockroachdb/replication

irfansharif · 2023-02-16T15:13:18Z

node that catches up on a log of raft log (for example after restarting following a period of downtime) can end up with an inverted LSM that results from applying a large amount of rapidly appended log entries

X-ref (+cc @sumeerbhola): the review discussions and doc.go changes in #96642 talk about how we could "just" use flow control tokens during raft log catchup to avoid exceeding the IO threshold. I don't think it would've helped in the particular experiment we ran above which uses the worst-case for LSM writes (large write batches across a uniform key distribution: #95161) but I expect it be effective for non-pathological cases where I assume it's rarer to cause a build up of "to-apply entry bytes" since we don't yet use separate goroutines for log appends and state machine application (#94854 and #94853).

sumeerbhola · 2023-02-17T16:30:17Z

I don't quite understand these experiment results, hence some questions:
Presumably quorum has been achieved for the raft log entries that are being sent to this node doing catchup? If not, why not? If yes, is it not immediately appending to the raft log and applying to the state machine -- why?
Can you share a zoomed in graph of both sub-levels and L0 sst-counts? I am curios when admission control would have kicked in, based on the current thresholds, assuming we did what @irfansharif and I have been discussing regarding catchup.

The middle graph shows log appends

Is this showing the activity on the range at the leaseholder, and not the appends as received by the catching up node?

andrewbaptist · 2023-02-17T20:11:11Z

There are more graphs of this on #96521 and #95159 although maybe not everything you need. I think the entries are immediately appended and applied.

There is more subtly to this in the fact that lease transfers are sent then "fail" since the node they are sent to is so far behind on the raft log.

irfansharif · 2023-02-17T21:14:30Z

Fortunately the graphs from this experiment are still available here.

Presumably quorum has been achieved for the raft log entries that are being sent to this node doing catchup? If not, why not?

Quorum is achieved, yes.

If yes, is it not immediately appending to the raft log and applying to the state machine -- why?

I have a naive question. In the graphs below there's a clear point at which we stop receiving MsgApps. We continue applying commands long past that, from those earlier MsgApps. Does the MsgApp receive point correspond to log append? No, right? We could very well have received a MsgApp and not scheduled the raft processor to append (+apply) it? I'm confused since in the graphs above, @tbg annotated that point where we stopped receiving MsgApps as the point where we were "done appending".

Can you share a zoomed in graph of both sub-levels and L0 sst-counts? I am curios when admission control would have kicked in, based on the current thresholds

See graphs below. It would've kicked in almost instantly, while we were still issuing catchup MsgApps.

Is this showing the activity on the range at the leaseholder, and not the appends as received by the catching up node?

The middle graph above is showing the rate of MsgApps received. See my naive question above, asking whether that maps 1:1 to log appends on the catching up node.

tbg · 2023-02-20T20:43:22Z

Does the MsgApp receive point correspond to log append?

The delay should be very short, since an MsgApp goes into the raft receive queue and this queue is fully emptied into the raft group on the next handle loop, which will emit them to storage in the subsequent one. So ~seconds on an overloaded node.

However, log application is bounded, we do at most 64mb per ready, so if we manage to stuff in more than 64mb per ready of log appends, then we would build up a backlog.

64mb per ready is quite a lot so I'm not sure this can really happen - it was our best guess as to what's happening here.

petermattis · 2023-03-09T15:26:17Z

Random idea: can we optimize application of a large number of Raft log entries? I believe log entries are applied via individual (unindexed?) Pebble batches. There would be some benefit to using fewer Pebble batches. An even bigger benefit if we could figure out how to convert the log entry application into an sstable to be ingested. That feels hard because log entry application might be overwriting keys written by earlier log entries. There might be something doable here by providing a way to transform a Pebble batch into an sstable, omitting overwrites. The idea here might be a non-starter. Feels worthwhile to loop in some Storage folks to think about the possibilities here.

erikgrinaker · 2023-03-10T09:31:53Z

Random idea: can we optimize application of a large number of Raft log entries?

Wrote up #98363 with a few comments.

sumeerbhola · 2023-10-19T17:12:41Z

I am not sure if there is anything to do for AC here given the "it was our best guess as to what's happening here" comment in #97044 (comment). If log append and state machine application happen near each other (< 15s apart) the existing AC flow control mechanisms should suffice.

And for node restarts there is an existing issue #98710. So I am removing the AC label.

tbg added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-replication Relating to Raft, consensus, and coordination. A-admission-control T-kv-replication labels Feb 13, 2023

andrewbaptist mentioned this issue Feb 16, 2023

kv: Start nodes in a new status to prevent lease transfers #96980

Closed

erikgrinaker mentioned this issue Mar 10, 2023

kvserver: consider optimizing bulk Raft log application #98363

Open

exalate-issue-sync bot assigned pav-kv Mar 27, 2023

aadityasondhi assigned sumeerbhola Oct 17, 2023

sumeerbhola removed the A-admission-control label Oct 19, 2023

sumeerbhola removed their assignment Oct 19, 2023

exalate-issue-sync bot added T-kv KV Team and removed T-kv-replication labels Jun 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: track per-replica and aggregate to-apply entry bytes #97044

kvserver: track per-replica and aggregate to-apply entry bytes #97044

tbg commented Feb 13, 2023 •

edited by exalate-issue-sync bot

Loading

blathers-crl bot commented Feb 13, 2023

irfansharif commented Feb 16, 2023

sumeerbhola commented Feb 17, 2023

andrewbaptist commented Feb 17, 2023

irfansharif commented Feb 17, 2023

tbg commented Feb 20, 2023

petermattis commented Mar 9, 2023

erikgrinaker commented Mar 10, 2023

sumeerbhola commented Oct 19, 2023 •

edited

Loading

kvserver: track per-replica and aggregate to-apply entry bytes #97044

kvserver: track per-replica and aggregate to-apply entry bytes #97044

Comments

tbg commented Feb 13, 2023 • edited by exalate-issue-sync bot Loading

Footnotes

blathers-crl bot commented Feb 13, 2023

irfansharif commented Feb 16, 2023

sumeerbhola commented Feb 17, 2023

andrewbaptist commented Feb 17, 2023

irfansharif commented Feb 17, 2023

tbg commented Feb 20, 2023

petermattis commented Mar 9, 2023

erikgrinaker commented Mar 10, 2023

sumeerbhola commented Oct 19, 2023 • edited Loading

tbg commented Feb 13, 2023 •

edited by exalate-issue-sync bot

Loading

sumeerbhola commented Oct 19, 2023 •

edited

Loading