-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: return byte batches in ScanResponse #26553
Conversation
This is essentially a rework of #18913. |
A definite simplification to skip the batch repr header to make concatenation easier. I wonder why we're not seeing a perf benefit. Review status: 0 of 11 files reviewed at latest revision, all discussions resolved, some commit checks failed. pkg/roachpb/api.go, line 251 at r1 (raw file):
The TODO from my PR still applies: we need to handle one response which has Rows set while the other response has BatchResponse set. Converting Rows into BatchResponse would be the straightforward solution. Also need to make sure this is tested (via a unit test in this package). pkg/roachpb/api.proto, line 348 at r1 (raw file):
pkg/roachpb/api.proto, line 365 at r1 (raw file):
I think the name here should match the pkg/storage/engine/rocksdb.go, line 2700 at r1 (raw file):
This appears to be unused. Comments from Reviewable |
Review status: complete! 0 of 0 LGTMs obtained pkg/roachpb/api.proto, line 348 at r1 (raw file):
I'd highly recommend making this an enum. Somehow bools in proto messages always end up later wanting to have been an enum from the start. In this case, I can easily imagine us wanting to add a columnar format at some point. Comments from Reviewable |
this all looks great to me, provided that the response format is not configurable, but is instead dictated by the cluster version Review status: complete! 0 of 0 LGTMs obtained pkg/roachpb/api.go, line 251 at r1 (raw file): Previously, petermattis (Peter Mattis) wrote…
not if you gate everything on a cluster version you don't Comments from Reviewable |
Review status: complete! 0 of 0 LGTMs obtained pkg/roachpb/api.go, line 251 at r1 (raw file): Previously, andreimatei (Andrei Matei) wrote…
Cluster versions don't flip atomically across the cluster. It'll work if you have the client code that sets the accept-format field controlled by a cluster version, but not if each node is deciding on the server-side what format to send. pkg/roachpb/api.proto, line 348 at r1 (raw file): Previously, danhhz (Daniel Harrison) wrote…
+1 to a Comments from Reviewable |
c0c8af5
to
fdfcf96
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Picking this up again now. I don't think it's feasible to make the response format depend on the cluster version.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/roachpb/api.go, line 251 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
Cluster versions don't flip atomically across the cluster. It'll work if you have the client code that sets the accept-format field controlled by a cluster version, but not if each node is deciding on the server-side what format to send.
This code doesn't suffer from the lack of rows-batch merging because the kvfetcher handles both the batch format and the kv format. I don't see a way of avoiding that, do you? And if we still have to do that, what's the point of gating the request or response by cluster version?
pkg/roachpb/api.proto, line 348 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
+1 to a
format
enum, but note that it's legal (from a proto wire format perspective) to replace abool
with an enum or an int (true and false become 1 and 0).
Done.
pkg/roachpb/api.proto, line 348 at r1 (raw file):
Previously, petermattis (Peter Mattis) wrote…
bytes
is a bit general for my taste.accepts_batch_repr
? Though I suppose the response is not precisely batch repr format. Perhapsaccepts_batch_response
though that suffers from the "bit general" criticism as well.
Done.
pkg/roachpb/api.proto, line 365 at r1 (raw file):
Previously, petermattis (Peter Mattis) wrote…
I think the name here should match the
accepts_*
name. That is,batch_repr
(if we decide to useaccepts_batch_repr
).
Done.
pkg/storage/engine/rocksdb.go, line 2700 at r1 (raw file):
Previously, petermattis (Peter Mattis) wrote…
This appears to be unused.
Look again! It's used in both the new API and in the old pathway. Look at buildScanResumeKey
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 11 of 11 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/roachpb/api.go, line 251 at r1 (raw file):
Previously, jordanlewis (Jordan Lewis) wrote…
This code doesn't suffer from the lack of rows-batch merging because the kvfetcher handles both the batch format and the kv format. I don't see a way of avoiding that, do you? And if we still have to do that, what's the point of gating the request or response by cluster version?
KVFetcher supports one or the other, but not both at the same time. Without a cluster version, you could see a scan split across servers of different versions, and the combined response would have both Rows and BatchResponse set. If callers can be updated to handle both in the same response, then it should be fine to do this without a cluster version, but I think the cluster version is probably the simpler approach (consider ordering - would the consumer need to merge-sort the two result sets together?)
pkg/roachpb/api.proto, line 376 at r2 (raw file):
repeated KeyValue intent_rows = 3 [(gogoproto.nullable) = false]; // If set, the results of the scan in batch format - the key/value pairs are
This format is different from the BatchRepr format defined in rocksdb; I'd like for it to have a different name. This description also looks inaccurate - I don't see any varints in MVCCScanDecodeKeyValue.
While we're changing the on-the-wire format, do we want to think about things like prefix-compressing the keys?
pkg/storage/batcheval/cmd_scan.go, line 46 at r2 (raw file):
if err != nil { return result.Result{}, err }
Can we do else engine.MVCCScan
here and share the NumKeys/ResumeSpan/ReadConsistency stuff after the two branches?
Or would it make sense to always call ScanToBytes and convert to KVs if required here instead of in MVCCScan?
fdfcf96
to
14f0009
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/roachpb/api.go, line 251 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
KVFetcher supports one or the other, but not both at the same time. Without a cluster version, you could see a scan split across servers of different versions, and the combined response would have both Rows and BatchResponse set. If callers can be updated to handle both in the same response, then it should be fine to do this without a cluster version, but I think the cluster version is probably the simpler approach (consider ordering - would the consumer need to merge-sort the two result sets together?)
Good point. So if we make the client set the new format if the cluster version is new enough, then we guarantee that the whole cluster has software version new enough to correctly return the new format.
pkg/roachpb/api.proto, line 376 at r2 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
This format is different from the BatchRepr format defined in rocksdb; I'd like for it to have a different name. This description also looks inaccurate - I don't see any varints in MVCCScanDecodeKeyValue.
While we're changing the on-the-wire format, do we want to think about things like prefix-compressing the keys?
batch_response isn't different enough from batchRepr?
Do you think doing prefix compression is worth the computation cost? It seems like a tradeoff - decrease the on-the-wire size, but increase computation cost and code complexity on the send and receive side. I was hoping this change could be a simple win that also paves the way for an easier time introducing new formats. If we think that prefix compression is a good idea, maybe we could add it in a separate effort.
pkg/storage/batcheval/cmd_scan.go, line 46 at r2 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
Can we do
else engine.MVCCScan
here and share the NumKeys/ResumeSpan/ReadConsistency stuff after the two branches?Or would it make sense to always call ScanToBytes and convert to KVs if required here instead of in MVCCScan?
Done.
14f0009
to
383cea8
Compare
79a2c31
to
e0e8d5f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RFAL
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/roachpb/api.go, line 251 at r1 (raw file):
Previously, jordanlewis (Jordan Lewis) wrote…
Good point. So if we make the client set the new format if the cluster version is new enough, then we guarantee that the whole cluster has software version new enough to correctly return the new format.
Done. Note that I added the check inside of dist sender itself, since it's quite painful to plumb a cluster setting into everywhere rowfetcher+kvfetcher are instantiated. This might be too gross to keep, though...
e0e8d5f
to
99105f8
Compare
99105f8
to
c07d6f7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 16 of 16 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/roachpb/api.go, line 251 at r1 (raw file):
Previously, jordanlewis (Jordan Lewis) wrote…
Done. Note that I added the check inside of dist sender itself, since it's quite painful to plumb a cluster setting into everywhere rowfetcher+kvfetcher are instantiated. This might be too gross to keep, though...
I think I'm OK with the hack for one cycle.
pkg/roachpb/api.proto, line 376 at r2 (raw file):
Previously, jordanlewis (Jordan Lewis) wrote…
batch_response isn't different enough from batchRepr?
Do you think doing prefix compression is worth the computation cost? It seems like a tradeoff - decrease the on-the-wire size, but increase computation cost and code complexity on the send and receive side. I was hoping this change could be a simple win that also paves the way for an easier time introducing new formats. If we think that prefix compression is a good idea, maybe we could add it in a separate effort.
Fine with me.
pkg/storage/engine/mvcc.go, line 1725 at r3 (raw file):
) ([]byte, int64, *roachpb.Span, []roachpb.Intent, error) { kvData, numKvs, resumeSpan, intents, err := mvccScanInternal(ctx, engine, iter, key, endKey, max, timestamp, consistent, tombstones, txn, reverse) if err != nil || resumeSpan != nil {
This doesn't look right. Can't resume spans be returned with non-empty kvData (which is being discarded here)?
c07d6f7
to
799746d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/storage/engine/mvcc.go, line 1725 at r3 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
This doesn't look right. Can't resume spans be returned with non-empty kvData (which is being discarded here)?
Not from mvccScanInternal
, which only returns the resume span if there were intents encountered. The caller of mvccScanInternal
is responsible for determining the resume span in the normal case. I've cleaned up the comments to reflect this fact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/storage/engine/mvcc.go, line 1725 at r3 (raw file):
OK, I was looking at the "before" version of mvcc.go and it wasn't clear that this was a change in this PR.
Not from mvccScanInternal, which only returns the resume span if there were intents encountered.
Not quite true - it also returns a resume span if max == 0
. Maybe the caller should be responsible for this too, so that the || resumeSpan != nil
is redundant (a resume span will only be set if there is an error). I'm more comfortable treating errors as exceptional cases than having multiple return combinations that have to be considered.
mvccScanInternal
will also never return intents
with an error or resume span, so by this logic the return line should be return nil, 0, resumeSpan, nil, err
.
799746d
to
ffc4a71
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/storage/engine/mvcc.go, line 1725 at r3 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
OK, I was looking at the "before" version of mvcc.go and it wasn't clear that this was a change in this PR.
Not from mvccScanInternal, which only returns the resume span if there were intents encountered.
Not quite true - it also returns a resume span if
max == 0
. Maybe the caller should be responsible for this too, so that the|| resumeSpan != nil
is redundant (a resume span will only be set if there is an error). I'm more comfortable treating errors as exceptional cases than having multiple return combinations that have to be considered.
mvccScanInternal
will also never returnintents
with an error or resume span, so by this logic the return line should bereturn nil, 0, resumeSpan, nil, err
.
Good point. The client is now responsible for dealing with max == 0.
ffc4a71
to
872b672
Compare
@petermattis want to take another look? |
Yes, just haven't had the chance yet. I'll get to this soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
pkg/storage/engine/mvcc.go, line 1731 at r4 (raw file):
return nil, 0, &roachpb.Span{Key: key, EndKey: endKey}, nil, nil } kvData, numKvs, resumeSpan, intents, err := mvccScanInternal(ctx, engine, iter, key, endKey, max, timestamp, consistent, tombstones, txn, reverse)
Nit: line length.
pkg/storage/engine/mvcc.go, line 1770 at r4 (raw file):
return nil, &roachpb.Span{Key: key, EndKey: endKey}, nil, nil } kvData, numKvs, resumeSpan, intents, err := mvccScanInternal(ctx, engine, iter, key, endKey, max, timestamp, consistent, tombstones, txn, reverse)
Nit: Can you wrap some of these super-long lines? We try to limit line length to 100 characters.
Previously, the MVCCScan API call completely deserialized the batch it got from C++ into a slice of roachpb.KeyValue, and sent that slice of roachpb.KeyValue over gRPC via ScanResponse. This is needlessly expensive for several reasons. - gRPC must re-serialize what we sent it to a flat byte stream. But, we already had a flat byte stream to begin with, before inflating it into KeyValues. In effect, we're doing pointless deserialization and reserialization. - We needed to dynamically allocate a slice of roachpb.KeyValue on every scan request, in buildScanResponse. This was the second largest cause of allocations in our system, beside the first copy from C++ to Go. But, it's pointless, since we're just going to throw that slice away again when we either serialize to the network or iterate over it and inflate the KeyValues into rows later down the pipe. Now, MVCCScan can optionally skip this inflation and return the raw write batch that it got from C++. The txnKVFetcher and rowFetcher are modified to use this option. They now deserialize keys from the write batch as necessary. This results in a large decrease in the number of allocations performed per scan. When going over the network, only 1 object has to be marshalled and demarshalled (the batch) instead of the number of returned keys. Also, we don't have to allocate the initial slice of []KeyValue, or any of the slices within Key or Value, to return data. Release note: None
872b672
to
0a8c55b
Compare
TFTRs! |
bors r+ |
26553: storage: return byte batches in ScanResponse r=jordanlewis a=jordanlewis Previously, the MVCCScan API call completely deserialized the batch it got from C++ into a slice of roachpb.KeyValue, and sent that slice of roachpb.KeyValue over gRPC via ScanResponse. This is needlessly expensive for several reasons. - gRPC must re-serialize what we sent it to a flat byte stream. But, we already had a flat byte stream to begin with, before inflating it into KeyValues. In effect, we're doing pointless deserialization and reserialization. - We needed to dynamically allocate a slice of roachpb.KeyValue on every scan request, in buildScanResponse. This was the second largest cause of allocations in our system, beside the first copy from C++ to Go. But, it's pointless, since we're just going to throw that slice away again when we either serialize to the network or iterate over it and inflate the KeyValues into rows later down the pipe. Now, MVCCScan can optionally skip this inflation and return the raw write batch that it got from C++. The txnKVFetcher and rowFetcher are modified to use this option. They now deserialize keys from the write batch as necessary. This results in a large decrease in the number of allocations performed per scan. When going over the network, only 1 object has to be marshalled and demarshalled (the batch) instead of the number of returned keys. Also, we don't have to allocate the initial slice of []KeyValue, or any of the slices within Key or Value, to return data. I haven't delved into modifying the relevant unit tests yet, but logic tests pass and I've been playing around with the resultant binary for some performance testing. I don't see much of a concrete performance change, but pprof reports reduced allocations as I'd expect. Release note: None Co-authored-by: Jordan Lewis <jordanthelewis@gmail.com>
Build succeeded |
Awesome work! Were there benchmark results that went along with this change? |
Decent but not amazing gains.
|
Hmm, I wonder if making this change inadvertently exposed us to worse memory behavior, since the GC can no longer precisely throw away keys that aren't used - instead, the GC can only throw away one of these batches when nobody is referencing the batch anymore. I'm exploring this as a possibility for what's going on in #36842. |
Previously, the MVCCScan API call completely deserialized the batch it
got from C++ into a slice of roachpb.KeyValue, and sent that slice of
roachpb.KeyValue over gRPC via ScanResponse. This is needlessly
expensive for several reasons.
already had a flat byte stream to begin with, before inflating it into
KeyValues. In effect, we're doing pointless deserialization and
reserialization.
scan request, in buildScanResponse. This was the second largest cause of
allocations in our system, beside the first copy from C++ to Go. But,
it's pointless, since we're just going to throw that slice away again
when we either serialize to the network or iterate over it and inflate
the KeyValues into rows later down the pipe.
Now, MVCCScan can optionally skip this inflation and return the raw
write batch that it got from C++. The txnKVFetcher and rowFetcher are
modified to use this option. They now deserialize keys from the write
batch as necessary.
This results in a large decrease in the number of allocations performed
per scan. When going over the network, only 1 object has to be
marshalled and demarshalled (the batch) instead of the number of
returned keys. Also, we don't have to allocate the initial slice of
[]KeyValue, or any of the slices within Key or Value, to return data.
I haven't delved into modifying the relevant unit tests yet, but logic tests pass and I've been playing around with the resultant binary for some performance testing. I don't see much of a concrete performance change, but pprof reports reduced allocations as I'd expect.
Release note: None