Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: track per-replica and aggregate to-apply entry bytes #97044

Open
tbg opened this issue Feb 13, 2023 · 9 comments
Open

kvserver: track per-replica and aggregate to-apply entry bytes #97044

tbg opened this issue Feb 13, 2023 · 9 comments
Assignees
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team

Comments

@tbg
Copy link
Member

tbg commented Feb 13, 2023

Is your feature request related to a problem? Please describe.

In recent experiments @andrewbaptist, found1 that because log entry append is sequential whereas log entry application can cause random writes to the LSM, a node that catches up on a log of raft log (for example after restarting following a period of downtime) can end up with an inverted LSM that results from applying a large amount of rapidly appended log entries.

This can be seen in the screenshot below. The top graph shows the inverted LSM - higher is more inverted. The middle graph shows log appends - they level off way before the LSM fully inverts. (As a result, replica pausing in the bottom graph comes too late, as that only delays appending which at this point in time has already completed).

Similar to how writes to the LSM don't account for compaction debt, appends to the raft log are even a step further back, because they don't even account for the writes to the LSM (which may have very different characteristics than the appends themselves, as we see here).

It's desirable to be able to change something about this behavior, but this issue does not attempt to propose a solution. Instead, we note that it would be helpful to at least have a quantity that can detect this debt before it inverts the LSM, as this will likely play a role in both short-term and long-term mitigations to this kind of overload.

image

Describe the solution you'd like

We could track the number of unapplied entry bytes per replica and also maintain a store-global aggregate. Per Replica, we would have a gauge unappliedEntryBytes which tracks all bytes between the AppliedIndex and the LastIndex2. Any change to this gauge would be reflected in a store-aggregate gauge, which could be read off quickly without a need to visit all replicas.

The gauge would need to be updated on raft log appends (taking care to do this properly on sideloading and also when the log tail is replaced), and on snapshots. Thanks to the local copy of the gauge, the global one can be "partially reset" accordingly.

Describe alternatives you've considered

Additional context

Jira issue: CRDB-24481

Epic CRDB-39898

Footnotes

  1. https://cockroachlabs.slack.com/archives/G01G8LK77DK/p1675457831537829?thread_ts=1675371760.534349&cid=G01G8LK77DK

  2. it's likely unimportant whether we use the durable or inflight last-index, the inflight one likely makes more sense.

@tbg tbg added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-replication Relating to Raft, consensus, and coordination. A-admission-control T-kv-replication labels Feb 13, 2023
@blathers-crl
Copy link

blathers-crl bot commented Feb 13, 2023

cc @cockroachdb/replication

@irfansharif
Copy link
Contributor

node that catches up on a log of raft log (for example after restarting following a period of downtime) can end up with an inverted LSM that results from applying a large amount of rapidly appended log entries

X-ref (+cc @sumeerbhola): the review discussions and doc.go changes in #96642 talk about how we could "just" use flow control tokens during raft log catchup to avoid exceeding the IO threshold. I don't think it would've helped in the particular experiment we ran above which uses the worst-case for LSM writes (large write batches across a uniform key distribution: #95161) but I expect it be effective for non-pathological cases where I assume it's rarer to cause a build up of "to-apply entry bytes" since we don't yet use separate goroutines for log appends and state machine application (#94854 and #94853).

@sumeerbhola
Copy link
Collaborator

I don't quite understand these experiment results, hence some questions:
Presumably quorum has been achieved for the raft log entries that are being sent to this node doing catchup? If not, why not? If yes, is it not immediately appending to the raft log and applying to the state machine -- why?
Can you share a zoomed in graph of both sub-levels and L0 sst-counts? I am curios when admission control would have kicked in, based on the current thresholds, assuming we did what @irfansharif and I have been discussing regarding catchup.

The middle graph shows log appends

Is this showing the activity on the range at the leaseholder, and not the appends as received by the catching up node?

@andrewbaptist
Copy link
Collaborator

There are more graphs of this on #96521 and #95159 although maybe not everything you need. I think the entries are immediately appended and applied.

There is more subtly to this in the fact that lease transfers are sent then "fail" since the node they are sent to is so far behind on the raft log.

@irfansharif
Copy link
Contributor

Fortunately the graphs from this experiment are still available here.

Presumably quorum has been achieved for the raft log entries that are being sent to this node doing catchup? If not, why not?

Quorum is achieved, yes.

If yes, is it not immediately appending to the raft log and applying to the state machine -- why?

I have a naive question. In the graphs below there's a clear point at which we stop receiving MsgApps. We continue applying commands long past that, from those earlier MsgApps. Does the MsgApp receive point correspond to log append? No, right? We could very well have received a MsgApp and not scheduled the raft processor to append (+apply) it? I'm confused since in the graphs above, @tbg annotated that point where we stopped receiving MsgApps as the point where we were "done appending".

image

Can you share a zoomed in graph of both sub-levels and L0 sst-counts? I am curios when admission control would have kicked in, based on the current thresholds

See graphs below. It would've kicked in almost instantly, while we were still issuing catchup MsgApps.

image

Is this showing the activity on the range at the leaseholder, and not the appends as received by the catching up node?

The middle graph above is showing the rate of MsgApps received. See my naive question above, asking whether that maps 1:1 to log appends on the catching up node.

@tbg
Copy link
Member Author

tbg commented Feb 20, 2023

Does the MsgApp receive point correspond to log append?

The delay should be very short, since an MsgApp goes into the raft receive queue and this queue is fully emptied into the raft group on the next handle loop, which will emit them to storage in the subsequent one. So ~seconds on an overloaded node.

However, log application is bounded, we do at most 64mb per ready, so if we manage to stuff in more than 64mb per ready of log appends, then we would build up a backlog.

64mb per ready is quite a lot so I'm not sure this can really happen - it was our best guess as to what's happening here.

@petermattis
Copy link
Collaborator

Random idea: can we optimize application of a large number of Raft log entries? I believe log entries are applied via individual (unindexed?) Pebble batches. There would be some benefit to using fewer Pebble batches. An even bigger benefit if we could figure out how to convert the log entry application into an sstable to be ingested. That feels hard because log entry application might be overwriting keys written by earlier log entries. There might be something doable here by providing a way to transform a Pebble batch into an sstable, omitting overwrites. The idea here might be a non-starter. Feels worthwhile to loop in some Storage folks to think about the possibilities here.

@erikgrinaker
Copy link
Contributor

Random idea: can we optimize application of a large number of Raft log entries?

Wrote up #98363 with a few comments.

@sumeerbhola
Copy link
Collaborator

sumeerbhola commented Oct 19, 2023

I am not sure if there is anything to do for AC here given the "it was our best guess as to what's happening here" comment in #97044 (comment). If log append and state machine application happen near each other (< 15s apart) the existing AC flow control mechanisms should suffice.

And for node restarts there is an existing issue #98710. So I am removing the AC label.

@sumeerbhola sumeerbhola removed their assignment Oct 19, 2023
@exalate-issue-sync exalate-issue-sync bot added T-kv KV Team and removed T-kv-replication labels Jun 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team
Projects
No open projects
Status: Incoming
Development

No branches or pull requests

7 participants