Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Columnar data structures and array-oriented APIs for decoding TSM data #10024

Merged
merged 7 commits into from
Jul 13, 2018

Conversation

stuartcarnie
Copy link
Contributor

@stuartcarnie stuartcarnie commented Jul 2, 2018

Partial implementation of #9981 through to and including Add batch block decoder APIs.

NOTE: The goal of this PR is to teach the low-level TSM decoders how to operate on entire blocks of encoded TSM data. These new APIs will be used by the storage read in a follow-up PR.


  • Suggested approach is to review individual commits, which are grouped into specific units of work
  • Benchmarks for improvements to the typed decoders (Float, String, Boolean, Integer and Time)
  • Benchmarks for improvements to the TSM block decoders (Float, String, Boolean, Integer / Unsigned)

@stuartcarnie stuartcarnie self-assigned this Jul 2, 2018
@ghost ghost added the review label Jul 2, 2018
@stuartcarnie stuartcarnie force-pushed the sgc-batch branch 2 times, most recently from d65143c to eb057ad Compare July 2, 2018 23:38
@influxdata influxdata deleted a comment from hercules-influx Jul 2, 2018
@influxdata influxdata deleted a comment from hercules-influx Jul 2, 2018
@influxdata influxdata deleted a comment from hercules-influx Jul 2, 2018
@influxdata influxdata deleted a comment from hercules-influx Jul 2, 2018
@influxdata influxdata deleted a comment from hercules-influx Jul 2, 2018
@influxdata influxdata deleted a comment from hercules-influx Jul 3, 2018
@stuartcarnie stuartcarnie requested review from e-dard and nathanielc July 3, 2018 14:45
@stuartcarnie stuartcarnie force-pushed the sgc-batch branch 2 times, most recently from 0f2d534 to f03d890 Compare July 10, 2018 20:27
Copy link
Contributor

@nathanielc nathanielc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

The basic changes make sense and there is good test coverage. So if/when we find bugs should be easy to fix them.

Copy link
Contributor

@e-dard e-dard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a couple of nits/suggestions 👍

@@ -0,0 +1,994 @@
// Package simple8b implements the 64bit integer encoding algoritm as published
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stuartcarnie let's maybe just add a note that this is mainly github.com/jwilder/encoding and document the additions in the source code?

b.Fatalf("expected to read %d booleans, but read %d", size, n)
}
func BenchmarkBooleanDecoder_DecodeAll(b *testing.B) {
benchmarks := []struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: []int ?

}
for _, bm := range benchmarks {
b.Run(fmt.Sprintf("%d", bm.n), func(b *testing.B) {
size := bm.n
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This really doesn't matter, but an alternative is to move the setup outside of b.Run, then you don't need to reset the timer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it; going to update all the benchmarks I touched with this

@@ -0,0 +1,1007 @@
// Generated by tmpl
// https://github.com/benbjohnson/tmpl
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this generated by https://github.com/benbjohnson/tmpl or does this need to be updated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed this was generated by https://github.com/benbjohnson/tmpl

}

// Include returns the subset values between min and max inclusive. The values must
// be deduplicated and sorted before calling Exclude or the results are undefined.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Exclude/Include

// Normally, both a and b should not contain duplicates. Due to a bug in older versions, it's
// possible stored blocks might contain duplicate values. Remove them if they exists before
// merging.
// a = a.Deduplicate()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure this is safe to do? Was the bug in post-1.0 releases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to talk about this a bit more – I'm thinking 2.0, this may be important

Copy link
Contributor Author

@stuartcarnie stuartcarnie Jul 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@edd, here is the commit for 1.0.0beta that adds the duplicate point check. Presumably the bug was discovered earlier, which may be this one, fixed in 0.11.0


if b.MaxTime() < a.MinTime() {
var tmp FloatArray
tmp.Timestamps = append(b.Timestamps, a.Timestamps...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It didn't occur to me that this needed to be a temp slice. I would have thought a.Timestamps = append(b.Timestamps, a.Timestamps...) would be OK?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am maintaining that b is not mutated. Perhaps it is ok to do this, as len(b.Timestamps) doesn't change. I'll check assumptions elsewhere to make sure this is ok.

* includes additional APIs for batch decoding of byte slices to improve
  performance
* fix for `unpack120` that was decoding 240 values rather than 120
These benchmarks will be implemented in batched decoders to compare
performance.
* APIs decode an entire byte slice of encoded data into the provided
  `dst` slice
* APIs are stateless and in almost all cases avoid any allocations
* Intended to be used future batch-oriented TSM block decode APIs
* duplicated tests from original iterator-based APIs
* separate slices for time and values
* structured to be Arrow ready
* batch decoders fill time and value slices independently that
  vastly improves performance (benchmarks linked in PR)
* These APIs will be used by `TSMReader` and `KeyCursor` types via new
  APIs, using similar naming convention (Array)
@hercules-influx
Copy link
Contributor

During a run of megacheck the following issues were discovered:

/tmp/787470241/src/github.com/influxdata/influxdb/cmd/store/query/query.go:404:5: this value of line is never used (SA4006)

@stuartcarnie stuartcarnie merged commit 0841c51 into master Jul 13, 2018
@ghost ghost removed the review label Jul 13, 2018
@stuartcarnie stuartcarnie deleted the sgc-batch branch July 13, 2018 23:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants