-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add "integral" function to InfluxQL #7591
Add "integral" function to InfluxQL #7591
Conversation
BTW, I propose a separate |
Documentation pull request is at influxdata/docs.influxdata.com-ARCHIVE#855 |
The "area under the curve" via the trapezoidal rule looks great. I reviewed the implementation of the aggregator and it makes sense to me. The function signature also makes sense to me. As for the edge cases around the group by time boundaries, @jsternberg do you know if the derivative takes into account the last point of the previous group by interval? I thought I remembered you working on that at some point. As for the rest of the code of boiler plate QL functions it looked good to me but I'll defer to @jsternberg on any of that code. |
Looks like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs a changlog entry added.
@nathanielc derivative will try to take into account the previous point, but only when there is a group by interval. The only concern I have with this is that derivative will eliminate the first point so if there are 10 points, derivative will return 9. I figure that integral, which should be the opposite of derivative, should restore that point. While we wouldn't be able to restore the If this isn't an issue, then I'm fine with this. @nathanielc comments? |
This was my concern, as derivative will correctly take into account group by time boundaries, where the current implementation will skip the trapezoid that spans the group by time boundary. Specifically for this test case https://github.com/influxdata/influxdb/pull/7591/files#diff-4efc5fdc06e36abc4addef9f831c2345R2677 I would expect the integral over the entire time range to be the same as the sum of the integrals over each group by time interval. The data for the test case is (in line protocol):
The integral over the entire range is I would expect the behavior to be that the first group by interval value is In short just like derivative will carry over the last point from the previous interval if we are grouping by time, I think that integral should do the same. And obviously for the first interval there is nothing to carry over. @Tomcat-Engineering @jsternberg @jwilder Does my expectation make sense? Or would you expect each group by time interval to be independent? As for the |
The PS - the |
@nathanielc your expectation makes sense and I'll have a go at implementing it... can anyone point me at the relevant code from |
Thanks Tomcat! I have been waiting for an integral component for my energy monitoring. Makes it a bit tough to monitor usage without being able to get the kWh! Im looking forward to adding this to my dashboard, and ultimately using the cheers! |
Same here, integral function would be very useful. |
The basic problem is that it is not possible to efficiently find the neighbouring points immediately before and after the queried time interval. These are necessary to give consistent integral results, as discussed above, but you may have to search the whole of time to find them which would be slow. It may be possible to do clever things to pass points between adjacent group-by-time intervals, but I don't think we would want to do that as the results would be even more inconsistent: a particular time-group could then have four different integral values depending on whether it is at the beginning, middle or end of the query interval, or is the only group in the interval. Having thought about it for a while, I can't come up with anything better than my original algorithm, despite its limitations. Anyone else got any bright ideas? |
I think you're right. If we find a solution to that in the future, I assume it won't be a problem for us to add back on that one extra point since it will be at a different time than the previous values. Can you rebase this and resolve the conflicts? Unfortunately, we're in the weird period where 1.2 has been branched off into its own branch, but we're not merging 1.3 bound PRs to master yet. I also don't have time to look at this in full until probably next week. I'm sorry for the inconvenience. If you can also update the changelog to target 1.3, I'll try to get this merged early next week. Sorry for taking so long to respond to this. |
fn := NewFloatIntegralReducer(interval, groupByTime) | ||
return fn, fn | ||
} | ||
return &floatReduceFloatIterator{input: newBufFloatIterator(input), opt: opt, create: createFn}, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does integral()
have the same behavior as derivative()
? I think this should be using the stream iterator instead. That way, it can get called on every incoming point rather than only after a group by interval is read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
integral
returns a single total value aggregating all the incoming points, whereas derivative
returns a value for each incoming value, hence the different iterators.
See above discussion of the proposed cumulative_integral
function for an integral which returns a point for each incoming point.
@@ -2645,6 +2645,104 @@ func TestSelect_Elapsed_Boolean(t *testing.T) { | |||
} | |||
} | |||
|
|||
func CheckPoints(t *testing.T, ic *IteratorCreator, query string, expected [][]influxql.Point) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is necessary. I think you can use the same pattern the other tests use. I see that this is used in other parts of the code:
if err != nil {
t.Fatal(err)
} else if a, err := Iterators(itrs).ReadAll(); err != nil {
t.Fatalf("unexpected point: %s", err)
} else if !deep.Equal(a, [][]influxql.Point{
{&influxql.FloatPoint{Name: "cpu", Tags: ParseTags("host=A"), Time: 0 * Second, Value: 19, Aggregated: 2}},
{&influxql.FloatPoint{Name: "cpu", Tags: ParseTags("host=B"), Time: 0 * Second, Value: 10, Aggregated: 1}},
{&influxql.FloatPoint{Name: "cpu", Tags: ParseTags("host=A"), Time: 10 * Second, Value: 2, Aggregated: 2}},
{&influxql.FloatPoint{Name: "cpu", Tags: ParseTags("host=A"), Time: 30 * Second, Value: 100, Aggregated: 1}},
}) {
t.Fatalf("unexpected points: %s", spew.Sdump(a))
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, I just thought that rather than duplicating the same code loads of times it would be best practice to write it once and wrap it in a function - but I'm happy to remove the function and get copy-pasting if that is the preferred style!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, if you can remove this that would be great. I understand the desire and I would consider a PR that changes everything for this pattern, but I don't want to mix test styles at the moment. When I've tried that in the past, it became unruly and I had to undo all of it.
I'm afraid I might be a bit confused about the direction of this PR and I think some new features (like subqueries) may also factor into that confusion. Can you explain to me in words which of the following queries are valid and what they are supposed to look like?
The queries are abbreviated so just assume they have the proper where clause. If I'm understanding correctly, all of those queries would return one value that would just be the sum of the area under the curve. Is that correct? I'm also thinking subqueries may influence what we do here. Before 1.2, you would have needed a special function to find the sum of an integral. We had one person who really needed a sum of derivatives functionality for a similar purpose. I'm thinking that we might want to make this function more similar to
Basically, make the proposed Thoughts? |
|
Perhaps the confusion is mostly about naming - if you expect However we define it, I think the current implementation is still useful enough to include, perhaps with the name This is exactly analogous to the Likewise, to get the total integral out of the |
Yea, you'll never be able to reverse the derivative completely and you'll still lose a point, but there will always be limitations and integrating in calculus already has the I think changing this to be the proposed @nathanielc thoughts? Maybe I'm off-base and I don't understand the use cases very well. I'm also not sure which time the integral should output. I would say the time of the second point rather than the first point (like derivative). |
|
For the use cases, if I understand the current implementation correctly, this function will basically resolve the need of time-weighted average, described in #7445, if divided by the time difference between last and first point in each interval. The result is still using interpolation between points, not steps for the value change as described in #7445, but will be already much better than in case of using simple mean() which doesn't consider time. Therefore it will be essential for downsampling time-unregular sensor data. |
I don't think integral will solve any problem of time weighted averages because I don't think integral performs any time weighted average. I think we need to implement the I'll have a longer post responding to the specifics of integral in a bit. I need some time to organize my thoughts into understandable words. |
The value of the integral on the interval divided by the duration of this interval is the average value on this interval, isn't it? |
I talked with @pauldix a bit today about the I'm a bit confused by the description of how integral works here, so I launched a graphite instance and grafana to try some simple graphs to see what these functions did without having to read the documentation. The descriptions are pretty accurate, but they're also confusing because the definition of I would also like to know if Going back to the difference between aggregate and transformation, I don't think an aggregate is correct for integral. I think we put that down in the documentation before anybody took a serious look at implementing it or deciding what that meant. An aggregate is something that will either be used on the entire span of the series and produce one value (usually without skipping values like derivative sometimes does) or it will perform that same operation on disparate intervals. Aggregates don't consider other intervals when calculating the current interval. I don't think that applies to integral and grafana categorizes both derivative and integral for graphite as transformations. So I think we should be thinking of integral as a transformation rather than an aggregate. I think an initial question to answer is this: should integral in InfluxDB match integral in Graphite or should we implement an actual integral? I'm also going to write up a follow-up of how I think integral should be implemented, but I wanted to write this to get some initial thoughts so we could be on the same page. |
Thanks for the writeup @jsternberg. Seeing this I think we should implement an actual integral function. Sounds like we already have the equivalent to Graphite's function with cumulative_sum. |
A few more comments on this and how we should interpret the query. I think we need to define what integral should encompass and determine the syntax. While it doesn't fit into an aggregate, it may also not be the same as derivative and I would like to explore that a bit. The integral is the area under the curve, but we also need to figure out which part of the area under the curve we're referring to. If we have an example where there are 100 points equally spaced apart, is it desirable to find the area between each of these points? That would result in 99 different areas. There's also the question of whether or not we're finding the cumulative sum of these areas. If you were to use an integral to reverse a derivative, the function that would graph that would be the cumulative sum of these areas. So if the original function is So the first question. Should the integral be the cumulative sum of the areas or should it just be the area of the current interval? My preference is that it is the area of the current interval (so from The next question. If you ask for an integral of a point and specify nothing else, should this produce a single value (the area under the entire curve) or should it produce the area under the curve between each point? The former might be what is expected by users, but the latter is more flexible because you can use the latter with a subquery and That also ties into bucketing. If we want to find the area under the curve for every 10 minute interval, we could use a subquery with a As confusing as all of that is, I propose two different syntaxes for consideration. First syntax: This has integral work as a pseudo-aggregate. It has some properties of streams and some properties of aggregates. It will likely need its own custom implementation since it doesn't neatly fit into either of them.
Second syntax:
I'm leaning towards the first. While I think it'll be slightly more complicated to implement because it doesn't have a precedent, I think it will be more clear to users. The second lends itself to using an aggregate within the function better than the first, but I don't like that syntax anyway and I don't consider the syntax necessary now that we have subqueries. The first also has an advantage in that it seems like it matches what people would expect more. The negative is the first does not make it easy to find the cumulative sum of the integral since it isn't easy to find the dividing line between each of the lines. But you can simulate it easily by just selecting a grouping time like 1s or 10s. Since I'm proposing that it will interpolate the points between the buckets, then you could have points 10 second apart from each other and still use a 1 second grouping since it will interpolate the areas of the buckets. Thoughts? |
Thanks for the notes @jsternberg. As far as I can tell, with one minor exception your "first syntax" is exactly what I have implemented. The minor exception is that my algorithm doesn't interpolate across time-bin boundaries in a group-by-time query. This is because it is really hard (or impossible?) to do this efficiently and I think would require re-writing large parts of the query engine. This was discussed quite a bit in previous comments, and I thought that we had come to the conclusion that we should release it as-is for now, with suitable notes in the documentation, because lots of users are waiting for this feature. PS - your conclusions on Graphite's integral are correct. You based the current InfluxQ |
PS - I think the idea of using any form of The first graph shows a simple time series with four points. Each square on the paper is one unit. The second graph is the result of the existing To get the correct results, we would need to define If you apply the cumulative version of your emit-lots-of points integrator to this interpolation you get back to the first curve (though the constant of integration is only correct because the first point happened to be at t=0). However, we will definitely not be popular if we define our general-purpose |
Ok, I'll review it with an assumption of the first syntax then. I would like the interpolation (and I don't think it would be too difficult), but let's get this merged. Now that I understand the syntax better, it'll be easier for me to review this. |
// this means that we allow the curve we are integrating | ||
// to have step changes in it, but who knows whether the points | ||
// actually arrive in any particular order...? | ||
if r.prev.Time == p.Time { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a potential error location. Points aren't guaranteed to be ordered. There are two ways to resolve this. Either have integral request the points be ordered from the storage engine (see median()
in influxql/select.go
) or by having this function operate on the entire slice instead (see median()
again).
// a timestamp of zero. Within a group-by-time, we can set the time to ZeroTime | ||
// and a higher level will change it to the start of the time group. | ||
func (r *FloatIntegralReducer) Emit() []FloatPoint { | ||
if r.groupByTime { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be needed. If you use ZeroTime
, one of the other iterators takes care of setting the start time correctly. Using 0
would result in the wrong time here.
@@ -882,6 +882,16 @@ func (opt IteratorOptions) ElapsedInterval() Interval { | |||
return Interval{Duration: time.Nanosecond} | |||
} | |||
|
|||
// IntegralInterval returns the time interval for the integral function. | |||
func (opt IteratorOptions) IntegralInterval() Interval { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks slightly different from DerivativeInterval()
. Is there a reason this is different?
@@ -279,7 +279,7 @@ func buildExprIterator(expr Expr, ic IteratorCreator, opt IteratorOptions, selec | |||
opt.Interval = Interval{} | |||
|
|||
return newHoltWintersIterator(input, opt, int(h.Val), int(m.Val), includeFitData, interval) | |||
case "derivative", "non_negative_derivative", "difference", "moving_average", "elapsed": | |||
case "derivative", "non_negative_derivative", "difference", "moving_average", "elapsed", "integral": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
integral()
seems to be different than the rest of these. The rest of these are here because they aren't aggregates so they don't support GROUP BY
or fill()
. If this is going to be an aggregate, it needs to be moved to the area below this.
@@ -2645,6 +2645,104 @@ func TestSelect_Elapsed_Boolean(t *testing.T) { | |||
} | |||
} | |||
|
|||
func CheckPoints(t *testing.T, ic *IteratorCreator, query string, expected [][]influxql.Point) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, if you can remove this that would be great. I understand the desire and I would consider a PR that changes everything for this pattern, but I don't want to mix test styles at the moment. When I've tried that in the past, it became unruly and I had to undo all of it.
@Tomcat-Engineering I'm going to use your PR as a base for the functionality I described in the first syntax as we discussed before. I'll also be rebasing this onto a more recent commit and making it compatible with subqueries along with trying to implement the interpolation aspect I talked about. Just giving you a heads up. Is that ok? |
@jsternberg - sounds great, thanks a lot and sorry I didn't get around to doing it myself! |
I've done the fixup and adapted the PR. I'm going to close this one and open up a new PR. I still want to add the interpolation feature I mentioned above so integral gives more accurate results. I'm hesitant to merge it without that since adding it later will likely count as a breaking change. The difficulty is that, because of that, integral acts differently than both streams and aggregates and it's showing some holes in the method employed to implement both of those. I'll be continuing to work on that, but I wanted to get up what I currently had for the sake of openness. |
Closing this in favor of #8194. |
Required for all non-trivial PRs
Required only if applicable
I'll open a pull request against the documentation repository, but essentially this is an aggregator function which returns a single value for the "area under the curve".
You can optionally specify a time interval argument which determines the units of the returned values. This defaults to one second, so
will return the energy in units of watt-seconds, whereas
will return the energy in watt-hours.
I think the only potentially contentious/confusing thing is the behaviour at the "edges" of a time period. As for other InfluxQL functions, we only analyse the datapoints within the specified period (or group-by interval). This means that the integral does not include the time between the start of the interval and the first datapoint, nor the time between the last datapoint and the end of the interval.
This will lead to slight inconsistencies, such as a set of daily integral values not quite adding up to the corresponding weekly integral. The only way to solve that would be to look outside the time period being analysed to find the neighbouring points, so that we can interpolate the curve to the edges of the period, but I don't see a mechanism for doing that. It could be potentially very expensive, as you might have to search forwards and back through all time in case there were points to use.
We already have some slight inconsistencies of a similar nature - for example the
derivative
function doesn't produce a value for the first point in each interval (for the same reason) so a set of derivative series calculated for each day doesn't quite match the result of analysing the whole week at once.I propose that we just explain the above limitation in the documentation for now.