Add sample function to query language #7415

desa · 2016-10-05T22:05:15Z

Required for all non-trivial PRs

Rebased/mergable
Tests pass
CHANGELOG.md updated
Sign CLA (if not already signed)

Required only if applicable

You can erase any checkboxes below this note if they are not applicable to your Pull Request.

InfluxQL Spec updated
Provide example syntax

This PR introduces a new function sample to InfluxQL that will return a random sample of data. The random points are generated via reservoir sampling. The function works for all field types.

Example Usage

Suppose I insert the following data

cpu,host=A value=1
cpu,host=A value=2
cpu,host=A value=3

Then the query can out put any of the following

SELECT sample(n, 2) FROM cpu

name: cpu
time                sample
----                ------
2016-10-05T22:00:58.69326808Z   1
2016-10-05T22:01:01.819955056Z  2

name: cpu
time                sample
----                ------
2016-10-05T22:00:58.69326808Z   1
2016-10-05T22:01:05.420453399Z  3

name: cpu
time                sample
----                ------
2016-10-05T22:01:01.819955056Z  2
2016-10-05T22:01:05.420453399Z  3

If the querier asks for a sample larger than the number of point it is querying, all points are returned.

SELECT sample(n, 4) FROM cpu

name: cpu
time                sample
----                ------
2016-10-05T22:00:58.69326808Z   1
2016-10-05T22:01:01.819955056Z  2
2016-10-05T22:01:05.420453399Z  3

desa · 2016-10-05T22:06:09Z

Relevant issues: #7394 #484

desa · 2016-10-05T22:07:41Z

@jsternberg

jwilder · 2016-10-05T22:08:23Z

@jsternberg @nathanielc can you take a look?

jsternberg · 2016-10-05T22:18:27Z

influxql/ast.go

+	}
+
+	switch expr.Args[1].(type) {
+	case *IntegerLiteral, *NumberLiteral:


This is likely an error for it to be a number. A number literal is only used when there is a decimal. You can mentally just replace "Number" with "Float". Unless we want a random sampling of 2.5 points, this should probably stick with just accepting an IntegerLiteral.

jsternberg · 2016-10-05T22:24:56Z

influxql/functions_test.go

+
+// TestSample_IsRandom attempts to verify that the subset of data that is returned
+// by sample is actually random.
+func TestSample_IsRandom(t *testing.T) {


Can you explain how this function works? I'm a bit confused about how it verifies the randomness of the sample.

Actually doesn't verify randomness, but it does verify that all possible combinations are possible within 6 iterations. Not sure of what to call the test.

jsternberg · 2016-10-05T22:26:18Z

influxql/functions.gen.go

+	// Generate a random integer between 1 and the count and
+	// if that number is less than the length of the slice
+	// replace the point at that index rnd with p.
+	rnd := rand.Intn(r.count + 1)


Do you have an article or website that can be referenced to show that this produces two random points with only minimal bias from the pseudorandom generator? I just want to make sure this method doesn't have an accidental introduction of unnecessary bias.

The method described here http://eternallyconfuzzled.com/arts/jsw_art_rand.aspx is the same as what is implemented in rand.Intn

Is that what you were looking for, or did you have something else in mind. I just realized I forgot to seed the RNG. I'll fix that now.

Along those lines I think its mostly correct as is.
See the second answer here.

It provides a nice algorithm for selecting a sample from an unknown sized set.

I saw two differences:

If the total number of points selected is less than size, it is technically not a valid sample. This may not matter in this case but we should discuss.

It seems that rand value should be prob := size/count where count is the number of points seen. Then with probability prob evict a random value from the list and replace it with the current point p.

So to avoid the division and double reads of a random value one could rearrange the expression from:

prob := size / count if rand.Float() < prod { i := rand.Intn(r.count+1) r.points[i] = *p }

to

i := rand.Intn(count +1) if i < size { r.points[i] = *p }

With which we arrive at @desa's solution, I am not sure that reusing the rand Intn is ok but it seems like it should be fine.

In summary it is correct to me except r.count needs to be incremented for all new points not just the first size points.

nathanielc · 2016-10-05T22:53:50Z

influxql/functions.gen.go

+	// Fill the reservoir with the first n points
+	if r.count < len(r.points) {
+		r.points[r.count] = *p
+		r.count++


We need to count all points, not just the first size points.

Is the increment on https://github.com/influxdata/influxdb/blob/md-sample/influxql/functions.gen.go#L413 not sufficient?

Regardless, I'll change it so that I increment the counter at the start and decrement the index I set

if r.count - 1 < len(r.points) { r.points[r.count - 1] = *p return }

should be fixed now.

desa · 2016-10-06T15:37:08Z

@jsternberg @nathanielc fixed everything you brought up. Let me know if there's anything I missed.

nathanielc · 2016-10-06T15:38:05Z

@desa LGTM, sorry I missed the increment at the end of the function, thanks for clarifying that bit.

desa · 2016-10-06T15:41:48Z

@nathanielc no problem :). It was definitely not clear that the counter would always increment. Updated it so that its clear now that every iteration increments the counter.

jsternberg

Squash the commits, but other than that, approved.

jsternberg · 2016-10-06T15:45:57Z

Also remember to add a changelog entry.

First Pass at implementing sample Add sample iterators for all types Remove size from sample struct Fix off by one error when generating random number Add benchmarks for sample iterator Add test and associated fixes for off by one error Add test for sample function Remove NumericLiteral from sample function call Make clear that the counter is incr w/ each call Rename IsRandom to AllSamplesSeen Add a rng for each reducer that is created The default rng that comes with math/rand has a global lock. To avoid having to worry about any contention on the lock, each reducer now has its own time seeded rng. Add sample function to changelog

desa force-pushed the md-sample branch from 47a8e7f to 3a705ea Compare October 5, 2016 22:13

jsternberg suggested changes Oct 5, 2016

View reviewed changes

nathanielc reviewed Oct 5, 2016

View reviewed changes

jsternberg approved these changes Oct 6, 2016

View reviewed changes

desa force-pushed the md-sample branch from 0a7cfa2 to f9b8129 Compare October 6, 2016 16:42

desa merged commit 616d4d2 into master Oct 6, 2016

desa deleted the md-sample branch October 6, 2016 17:04

This was referenced Oct 6, 2016

Add sample() to query language influxdata/docs.influxdata.com-ARCHIVE#767

Closed

RANDOM() Aggregate Function #7394

Closed

timhallinflux added this to the 1.1.0 milestone Dec 19, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add sample function to query language #7415

Add sample function to query language #7415

desa commented Oct 5, 2016 •

edited by beckettsean

Loading

desa commented Oct 5, 2016 •

edited

Loading

desa commented Oct 5, 2016

jwilder commented Oct 5, 2016

jsternberg Oct 5, 2016

desa Oct 5, 2016

jsternberg Oct 5, 2016

desa Oct 5, 2016

jsternberg Oct 5, 2016 •

edited

Loading

desa Oct 5, 2016

nathanielc Oct 5, 2016 •

edited

Loading

nathanielc Oct 5, 2016

desa Oct 5, 2016

desa Oct 5, 2016 •

edited

Loading

desa Oct 6, 2016

desa commented Oct 6, 2016

nathanielc commented Oct 6, 2016

desa commented Oct 6, 2016

jsternberg left a comment

jsternberg commented Oct 6, 2016

Add sample function to query language #7415

Add sample function to query language #7415

Conversation

desa commented Oct 5, 2016 • edited by beckettsean Loading

Required for all non-trivial PRs

Required only if applicable

Example Usage

desa commented Oct 5, 2016 • edited Loading

desa commented Oct 5, 2016

jwilder commented Oct 5, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jsternberg Oct 5, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nathanielc Oct 5, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

desa Oct 5, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

desa commented Oct 6, 2016

nathanielc commented Oct 6, 2016

desa commented Oct 6, 2016

jsternberg left a comment

Choose a reason for hiding this comment

jsternberg commented Oct 6, 2016

desa commented Oct 5, 2016 •

edited by beckettsean

Loading

desa commented Oct 5, 2016 •

edited

Loading

jsternberg Oct 5, 2016 •

edited

Loading

nathanielc Oct 5, 2016 •

edited

Loading

desa Oct 5, 2016 •

edited

Loading