Columnar data structures and array-oriented APIs for decoding TSM data #10024

stuartcarnie · 2018-07-02T20:56:56Z

Partial implementation of #9981 through to and including Add batch block decoder APIs.

NOTE: The goal of this PR is to teach the low-level TSM decoders how to operate on entire blocks of encoded TSM data. These new APIs will be used by the storage read in a follow-up PR.

Suggested approach is to review individual commits, which are grouped into specific units of work
Benchmarks for improvements to the typed decoders (Float, String, Boolean, Integer and Time)
Benchmarks for improvements to the TSM block decoders (Float, String, Boolean, Integer / Unsigned)

nathanielc

LGTM

The basic changes make sense and there is good test coverage. So if/when we find bugs should be easy to fix them.

e-dard

LGTM, just a couple of nits/suggestions 👍

e-dard · 2018-07-13T13:43:26Z

pkg/encoding/simple8b/encoding.go

@@ -0,0 +1,994 @@
+// Package simple8b implements the 64bit integer encoding algoritm as published


@stuartcarnie let's maybe just add a note that this is mainly github.com/jwilder/encoding and document the additions in the source code?

e-dard · 2018-07-13T13:44:54Z

tsdb/engine/tsm1/bool_test.go

-			b.Fatalf("expected to read %d booleans, but read %d", size, n)
-		}
+func BenchmarkBooleanDecoder_DecodeAll(b *testing.B) {
+	benchmarks := []struct {


nit: []int ?

e-dard · 2018-07-13T13:48:11Z

tsdb/engine/tsm1/bool_test.go

+	}
+	for _, bm := range benchmarks {
+		b.Run(fmt.Sprintf("%d", bm.n), func(b *testing.B) {
+			size := bm.n


This really doesn't matter, but an alternative is to move the setup outside of b.Run, then you don't need to reset the timer.

I like it; going to update all the benchmarks I touched with this

e-dard · 2018-07-13T13:55:41Z

tsdb/arrayvalues.gen.go

@@ -0,0 +1,1007 @@
+// Generated by tmpl
+// https://github.com/benbjohnson/tmpl


Was this generated by https://github.com/benbjohnson/tmpl or does this need to be updated?

Indeed this was generated by https://github.com/benbjohnson/tmpl

e-dard · 2018-07-13T14:00:47Z

tsdb/arrayvalues.gen.go

+}
+
+// Include returns the subset values between min and max inclusive. The values must
+// be deduplicated and sorted before calling Exclude or the results are undefined.


s/Exclude/Include

e-dard · 2018-07-13T14:01:48Z

tsdb/arrayvalues.gen.go

+	// Normally, both a and b should not contain duplicates.  Due to a bug in older versions, it's
+	// possible stored blocks might contain duplicate values.  Remove them if they exists before
+	// merging.
+	// a = a.Deduplicate()


Are we sure this is safe to do? Was the bug in post-1.0 releases?

I'd like to talk about this a bit more – I'm thinking 2.0, this may be important

@edd, here is the commit for 1.0.0beta that adds the duplicate point check. Presumably the bug was discovered earlier, which may be this one, fixed in 0.11.0

e-dard · 2018-07-13T14:04:12Z

tsdb/arrayvalues.gen.go

+
+	if b.MaxTime() < a.MinTime() {
+		var tmp FloatArray
+		tmp.Timestamps = append(b.Timestamps, a.Timestamps...)


It didn't occur to me that this needed to be a temp slice. I would have thought a.Timestamps = append(b.Timestamps, a.Timestamps...) would be OK?

I am maintaining that b is not mutated. Perhaps it is ok to do this, as len(b.Timestamps) doesn't change. I'll check assumptions elsewhere to make sure this is ok.

* includes additional APIs for batch decoding of byte slices to improve performance * fix for `unpack120` that was decoding 240 values rather than 120

These benchmarks will be implemented in batched decoders to compare performance.

* APIs decode an entire byte slice of encoded data into the provided `dst` slice * APIs are stateless and in almost all cases avoid any allocations * Intended to be used future batch-oriented TSM block decode APIs * duplicated tests from original iterator-based APIs

* separate slices for time and values * structured to be Arrow ready * batch decoders fill time and value slices independently that vastly improves performance (benchmarks linked in PR)

* These APIs will be used by `TSMReader` and `KeyCursor` types via new APIs, using similar naming convention (Array)

hercules-influx · 2018-07-13T18:21:45Z

During a run of megacheck the following issues were discovered:

/tmp/787470241/src/github.com/influxdata/influxdb/cmd/store/query/query.go:404:5: this value of line is never used (SA4006)

stuartcarnie self-assigned this Jul 2, 2018

ghost added the review label Jul 2, 2018

stuartcarnie force-pushed the sgc-batch branch 2 times, most recently from d65143c to eb057ad Compare July 2, 2018 23:38

influxdata deleted a comment from hercules-influx Jul 2, 2018

stuartcarnie force-pushed the sgc-batch branch from eb057ad to 43dbc18 Compare July 2, 2018 23:57

influxdata deleted a comment from hercules-influx Jul 2, 2018

influxdata deleted a comment from hercules-influx Jul 3, 2018

stuartcarnie force-pushed the sgc-batch branch from 43dbc18 to b0d2995 Compare July 3, 2018 14:32

stuartcarnie requested review from e-dard and nathanielc July 3, 2018 14:45

stuartcarnie force-pushed the sgc-batch branch 2 times, most recently from 0f2d534 to f03d890 Compare July 10, 2018 20:27

stuartcarnie mentioned this pull request Jul 11, 2018

Implemented cursors using Array block APIs #10071

Merged

nathanielc approved these changes Jul 12, 2018

View reviewed changes

e-dard approved these changes Jul 13, 2018

View reviewed changes

stuartcarnie added 7 commits July 13, 2018 11:19

chore(simple8b): Add simple 8b encoder / decoder

3a83cd6

* includes additional APIs for batch decoding of byte slices to improve performance * fix for `unpack120` that was decoding 240 values rather than 120

chore(tsm1): Add benchmarks for existing typed decoders

89dcfd5

These benchmarks will be implemented in batched decoders to compare performance.

fix(tsm1): Reset vals to ensure Include is correctly tested

23f1c70

feat(tsm1): Provide columnar value types

cfd9af3

* separate slices for time and values * structured to be Arrow ready * batch decoders fill time and value slices independently that vastly improves performance (benchmarks linked in PR)

feat(tsm1): Implement APIs to decode TSM data into array data structures

8da44f1

* These APIs will be used by `TSMReader` and `KeyCursor` types via new APIs, using similar naming convention (Array)

pr(tsdb): Feedback items from PR review

a5360ca

stuartcarnie force-pushed the sgc-batch branch from f03d890 to a5360ca Compare July 13, 2018 18:20

stuartcarnie merged commit 0841c51 into master Jul 13, 2018

ghost removed the review label Jul 13, 2018

stuartcarnie deleted the sgc-batch branch July 13, 2018 23:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Columnar data structures and array-oriented APIs for decoding TSM data #10024

Columnar data structures and array-oriented APIs for decoding TSM data #10024

stuartcarnie commented Jul 2, 2018 •

edited

Loading

nathanielc left a comment

e-dard left a comment

e-dard Jul 13, 2018

e-dard Jul 13, 2018

e-dard Jul 13, 2018

stuartcarnie Jul 13, 2018

e-dard Jul 13, 2018

stuartcarnie Jul 13, 2018

e-dard Jul 13, 2018

e-dard Jul 13, 2018

stuartcarnie Jul 13, 2018

stuartcarnie Jul 13, 2018 •

edited

Loading

e-dard Jul 13, 2018

stuartcarnie Jul 13, 2018

hercules-influx commented Jul 13, 2018

		@@ -0,0 +1,994 @@
		// Package simple8b implements the 64bit integer encoding algoritm as published

		@@ -0,0 +1,1007 @@
		// Generated by tmpl
		// https://github.com/benbjohnson/tmpl

Columnar data structures and array-oriented APIs for decoding TSM data #10024

Columnar data structures and array-oriented APIs for decoding TSM data #10024

Conversation

stuartcarnie commented Jul 2, 2018 • edited Loading

nathanielc left a comment

Choose a reason for hiding this comment

e-dard left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stuartcarnie Jul 13, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hercules-influx commented Jul 13, 2018

stuartcarnie commented Jul 2, 2018 •

edited

Loading

stuartcarnie Jul 13, 2018 •

edited

Loading