Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GROUP BY does not handle buckets correctly #2005

Closed
jnutzmann opened this issue Mar 18, 2015 · 2 comments
Closed

GROUP BY does not handle buckets correctly #2005

jnutzmann opened this issue Mar 18, 2015 · 2 comments

Comments

@jnutzmann
Copy link

I am running a fork of master right now looking to add the offset functionality that is being discussed in #1837. In the process of doing that, I discovered a bug in how buckets are created. Here is the symptom:

These two queries produce completely different results (for all rows), but have the same timestamps in the time column (note the difference in timestamp):

select count(value) from duty_cycle_hose_2 where time > '2015-03-18T15:00:00Z' - 11d group by time(1h) limit 5

select count(value) from duty_cycle_hose_2 where time > '2015-03-18T15:30:00Z' - 11d group by time(1h) limit 5

What's happening is there are different sources for the bucket times compared to the times that result. This code creates the array resultTimes that specifies the time in the results.

(line 137 in engine.go)

// ensure that the start time for the results is on the start of the window
startTimeBucket := m.TMin / m.interval * m.interval

for i, _ := range resultTimes {
    t := startTimeBucket + (int64(i) * m.interval * int64(m.stmt.Offset+1))
    resultTimes[i] = t
    // we always include time so we need one more column than we have aggregates
    vals := make([]interface{}, 0, len(aggregates)+1)
    resultValues[i] = append(vals, time.Unix(0, t).UTC())
}

The actual buckets are based off m.TMin and are created NextInterval(). This simply adds m.interval to T.Min to get the new bucket start time, so the buckets start from the minimum time in the query and not the time where in resultValues. This essentially shifts all the buckets by the modulo (in my example, 30 minutes).

I can create a PR to fix, but I need to know the desired outcome. The cleanest way (in code) would be to adjust T.Min to start at the natural boundary, however this might be counter-intuitive to users if they get points back that are before their specified start time. Thoughts?

@jnutzmann
Copy link
Author

To clarify, if your where clause contains time > '2015-03-18T15:00:00Z' you would get the "correct" buckets where values are between 15:00-16:00 for time 15:00, 16:00-17:00 for time 16:00, etc. But, if your where clause contains time > '2015-03-18T15:30:00Z' you would get the incorrect buckets where values are between 15:30-16:30 for time 15:00, 16:30-17:30 for time 16:00, etc. I discovered this when I was using now() and getting slightly different results every time the query returned.

@otoolep
Copy link
Contributor

otoolep commented Mar 20, 2015

Fixed by the patch. Thanks again @jnutzmann

@otoolep otoolep closed this as completed Mar 20, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants