-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add sample function to query language #7415
Conversation
@jsternberg @nathanielc can you take a look? |
} | ||
|
||
switch expr.Args[1].(type) { | ||
case *IntegerLiteral, *NumberLiteral: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is likely an error for it to be a number. A number literal is only used when there is a decimal. You can mentally just replace "Number" with "Float". Unless we want a random sampling of 2.5 points, this should probably stick with just accepting an IntegerLiteral
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
|
||
// TestSample_IsRandom attempts to verify that the subset of data that is returned | ||
// by sample is actually random. | ||
func TestSample_IsRandom(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain how this function works? I'm a bit confused about how it verifies the randomness of the sample.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually doesn't verify randomness, but it does verify that all possible combinations are possible within 6 iterations. Not sure of what to call the test.
// Generate a random integer between 1 and the count and | ||
// if that number is less than the length of the slice | ||
// replace the point at that index rnd with p. | ||
rnd := rand.Intn(r.count + 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have an article or website that can be referenced to show that this produces two random points with only minimal bias from the pseudorandom generator? I just want to make sure this method doesn't have an accidental introduction of unnecessary bias.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method described here http://eternallyconfuzzled.com/arts/jsw_art_rand.aspx is the same as what is implemented in rand.Intn
Is that what you were looking for, or did you have something else in mind. I just realized I forgot to seed the RNG. I'll fix that now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Along those lines I think its mostly correct as is.
See the second answer here.
It provides a nice algorithm for selecting a sample from an unknown sized set.
I saw two differences:
- If the total number of points selected is less than size, it is technically not a valid sample. This may not matter in this case but we should discuss.
- It seems that rand value should be
prob := size/count
where count is the number of points seen. Then with probabilityprob
evict a random value from the list and replace it with the current pointp
.
So to avoid the division and double reads of a random value one could rearrange the expression from:
prob := size / count
if rand.Float() < prod {
i := rand.Intn(r.count+1)
r.points[i] = *p
}
to
i := rand.Intn(count +1)
if i < size {
r.points[i] = *p
}
With which we arrive at @desa's solution, I am not sure that reusing the rand Intn is ok but it seems like it should be fine.
In summary it is correct to me except r.count needs to be incremented for all new points not just the first size points.
// Fill the reservoir with the first n points | ||
if r.count < len(r.points) { | ||
r.points[r.count] = *p | ||
r.count++ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to count all points, not just the first size points.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the increment on https://github.com/influxdata/influxdb/blob/md-sample/influxql/functions.gen.go#L413 not sufficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regardless, I'll change it so that I increment the counter at the start and decrement the index I set
if r.count - 1 < len(r.points) {
r.points[r.count - 1] = *p
return
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be fixed now.
@jsternberg @nathanielc fixed everything you brought up. Let me know if there's anything I missed. |
@desa LGTM, sorry I missed the increment at the end of the function, thanks for clarifying that bit. |
@nathanielc no problem :). It was definitely not clear that the counter would always increment. Updated it so that its clear now that every iteration increments the counter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Squash the commits, but other than that, approved.
Also remember to add a changelog entry. |
First Pass at implementing sample Add sample iterators for all types Remove size from sample struct Fix off by one error when generating random number Add benchmarks for sample iterator Add test and associated fixes for off by one error Add test for sample function Remove NumericLiteral from sample function call Make clear that the counter is incr w/ each call Rename IsRandom to AllSamplesSeen Add a rng for each reducer that is created The default rng that comes with math/rand has a global lock. To avoid having to worry about any contention on the lock, each reducer now has its own time seeded rng. Add sample function to changelog
Required for all non-trivial PRs
Required only if applicable
You can erase any checkboxes below this note if they are not applicable to your Pull Request.
This PR introduces a new function
sample
to InfluxQL that will return a random sample of data. The random points are generated via reservoir sampling. The function works for all field types.Example Usage
Suppose I insert the following data
Then the query can out put any of the following
If the querier asks for a sample larger than the number of point it is querying, all points are returned.