Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sample function to query language #7415

Merged
merged 1 commit into from
Oct 6, 2016
Merged

Add sample function to query language #7415

merged 1 commit into from
Oct 6, 2016

Conversation

desa
Copy link
Contributor

@desa desa commented Oct 5, 2016

Required for all non-trivial PRs
  • Rebased/mergable
  • Tests pass
  • CHANGELOG.md updated
  • Sign CLA (if not already signed)
Required only if applicable

You can erase any checkboxes below this note if they are not applicable to your Pull Request.

This PR introduces a new function sample to InfluxQL that will return a random sample of data. The random points are generated via reservoir sampling. The function works for all field types.

Example Usage

Suppose I insert the following data

cpu,host=A value=1
cpu,host=A value=2
cpu,host=A value=3

Then the query can out put any of the following

SELECT sample(n, 2) FROM cpu
name: cpu
time                sample
----                ------
2016-10-05T22:00:58.69326808Z   1
2016-10-05T22:01:01.819955056Z  2
name: cpu
time                sample
----                ------
2016-10-05T22:00:58.69326808Z   1
2016-10-05T22:01:05.420453399Z  3
name: cpu
time                sample
----                ------
2016-10-05T22:01:01.819955056Z  2
2016-10-05T22:01:05.420453399Z  3

If the querier asks for a sample larger than the number of point it is querying, all points are returned.

SELECT sample(n, 4) FROM cpu
name: cpu
time                sample
----                ------
2016-10-05T22:00:58.69326808Z   1
2016-10-05T22:01:01.819955056Z  2
2016-10-05T22:01:05.420453399Z  3

@desa
Copy link
Contributor Author

desa commented Oct 5, 2016

Relevant issues: #7394 #484

@desa
Copy link
Contributor Author

desa commented Oct 5, 2016

@jsternberg

@jwilder
Copy link
Contributor

jwilder commented Oct 5, 2016

@jsternberg @nathanielc can you take a look?

}

switch expr.Args[1].(type) {
case *IntegerLiteral, *NumberLiteral:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is likely an error for it to be a number. A number literal is only used when there is a decimal. You can mentally just replace "Number" with "Float". Unless we want a random sampling of 2.5 points, this should probably stick with just accepting an IntegerLiteral.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


// TestSample_IsRandom attempts to verify that the subset of data that is returned
// by sample is actually random.
func TestSample_IsRandom(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain how this function works? I'm a bit confused about how it verifies the randomness of the sample.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually doesn't verify randomness, but it does verify that all possible combinations are possible within 6 iterations. Not sure of what to call the test.

// Generate a random integer between 1 and the count and
// if that number is less than the length of the slice
// replace the point at that index rnd with p.
rnd := rand.Intn(r.count + 1)
Copy link
Contributor

@jsternberg jsternberg Oct 5, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have an article or website that can be referenced to show that this produces two random points with only minimal bias from the pseudorandom generator? I just want to make sure this method doesn't have an accidental introduction of unnecessary bias.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method described here http://eternallyconfuzzled.com/arts/jsw_art_rand.aspx is the same as what is implemented in rand.Intn

Is that what you were looking for, or did you have something else in mind. I just realized I forgot to seed the RNG. I'll fix that now.

Copy link
Contributor

@nathanielc nathanielc Oct 5, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Along those lines I think its mostly correct as is.
See the second answer here.

It provides a nice algorithm for selecting a sample from an unknown sized set.

I saw two differences:

  1. If the total number of points selected is less than size, it is technically not a valid sample. This may not matter in this case but we should discuss.
  2. It seems that rand value should be prob := size/count where count is the number of points seen. Then with probability prob evict a random value from the list and replace it with the current point p.

So to avoid the division and double reads of a random value one could rearrange the expression from:

prob := size / count
if rand.Float() < prod {
    i := rand.Intn(r.count+1)
    r.points[i] = *p
}

to

i := rand.Intn(count +1)
if i < size {
    r.points[i] = *p
}

With which we arrive at @desa's solution, I am not sure that reusing the rand Intn is ok but it seems like it should be fine.

In summary it is correct to me except r.count needs to be incremented for all new points not just the first size points.

// Fill the reservoir with the first n points
if r.count < len(r.points) {
r.points[r.count] = *p
r.count++
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to count all points, not just the first size points.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@desa desa Oct 5, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regardless, I'll change it so that I increment the counter at the start and decrement the index I set

    if r.count - 1 < len(r.points) {
        r.points[r.count - 1] = *p
        return
    }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be fixed now.

@desa
Copy link
Contributor Author

desa commented Oct 6, 2016

@jsternberg @nathanielc fixed everything you brought up. Let me know if there's anything I missed.

@nathanielc
Copy link
Contributor

@desa LGTM, sorry I missed the increment at the end of the function, thanks for clarifying that bit.

@desa
Copy link
Contributor Author

desa commented Oct 6, 2016

@nathanielc no problem :). It was definitely not clear that the counter would always increment. Updated it so that its clear now that every iteration increments the counter.

Copy link
Contributor

@jsternberg jsternberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Squash the commits, but other than that, approved.

@jsternberg
Copy link
Contributor

Also remember to add a changelog entry.

First Pass at implementing sample

Add sample iterators for all types

Remove size from sample struct

Fix off by one error when generating random number

Add benchmarks for sample iterator

Add test and associated fixes for off by one error

Add test for sample function

Remove NumericLiteral from sample function call

Make clear that the counter is incr w/ each call

Rename IsRandom to AllSamplesSeen

Add a rng for each reducer that is created

The default rng that comes with math/rand has a global lock. To avoid
having to worry about any contention on the lock, each reducer now has
its own time seeded rng.

Add sample function to changelog
@desa desa merged commit 616d4d2 into master Oct 6, 2016
@desa desa deleted the md-sample branch October 6, 2016 17:04
@timhallinflux timhallinflux added this to the 1.1.0 milestone Dec 19, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants