-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Columnar data structures and array-oriented APIs for decoding TSM data #10024
Conversation
d65143c
to
eb057ad
Compare
0f2d534
to
f03d890
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
The basic changes make sense and there is good test coverage. So if/when we find bugs should be easy to fix them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just a couple of nits/suggestions 👍
@@ -0,0 +1,994 @@ | |||
// Package simple8b implements the 64bit integer encoding algoritm as published |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stuartcarnie let's maybe just add a note that this is mainly github.com/jwilder/encoding
and document the additions in the source code?
tsdb/engine/tsm1/bool_test.go
Outdated
b.Fatalf("expected to read %d booleans, but read %d", size, n) | ||
} | ||
func BenchmarkBooleanDecoder_DecodeAll(b *testing.B) { | ||
benchmarks := []struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: []int
?
tsdb/engine/tsm1/bool_test.go
Outdated
} | ||
for _, bm := range benchmarks { | ||
b.Run(fmt.Sprintf("%d", bm.n), func(b *testing.B) { | ||
size := bm.n |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This really doesn't matter, but an alternative is to move the setup outside of b.Run
, then you don't need to reset the timer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it; going to update all the benchmarks I touched with this
@@ -0,0 +1,1007 @@ | |||
// Generated by tmpl | |||
// https://github.com/benbjohnson/tmpl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this generated by https://github.com/benbjohnson/tmpl
or does this need to be updated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed this was generated by https://github.com/benbjohnson/tmpl
tsdb/arrayvalues.gen.go
Outdated
} | ||
|
||
// Include returns the subset values between min and max inclusive. The values must | ||
// be deduplicated and sorted before calling Exclude or the results are undefined. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/Exclude/Include
// Normally, both a and b should not contain duplicates. Due to a bug in older versions, it's | ||
// possible stored blocks might contain duplicate values. Remove them if they exists before | ||
// merging. | ||
// a = a.Deduplicate() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure this is safe to do? Was the bug in post-1.0 releases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to talk about this a bit more – I'm thinking 2.0, this may be important
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@edd, here is the commit for 1.0.0beta that adds the duplicate point check. Presumably the bug was discovered earlier, which may be this one, fixed in 0.11.0
|
||
if b.MaxTime() < a.MinTime() { | ||
var tmp FloatArray | ||
tmp.Timestamps = append(b.Timestamps, a.Timestamps...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It didn't occur to me that this needed to be a temp slice. I would have thought a.Timestamps = append(b.Timestamps, a.Timestamps...)
would be OK?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am maintaining that b
is not mutated. Perhaps it is ok to do this, as len(b.Timestamps)
doesn't change. I'll check assumptions elsewhere to make sure this is ok.
* includes additional APIs for batch decoding of byte slices to improve performance * fix for `unpack120` that was decoding 240 values rather than 120
These benchmarks will be implemented in batched decoders to compare performance.
* APIs decode an entire byte slice of encoded data into the provided `dst` slice * APIs are stateless and in almost all cases avoid any allocations * Intended to be used future batch-oriented TSM block decode APIs * duplicated tests from original iterator-based APIs
* separate slices for time and values * structured to be Arrow ready * batch decoders fill time and value slices independently that vastly improves performance (benchmarks linked in PR)
* These APIs will be used by `TSMReader` and `KeyCursor` types via new APIs, using similar naming convention (Array)
During a run of megacheck the following issues were discovered:
|
Partial implementation of #9981 through to and including Add batch block decoder APIs.
NOTE: The goal of this PR is to teach the low-level TSM decoders how to operate on entire blocks of encoded TSM data. These new APIs will be used by the storage read in a follow-up PR.