Wire up TOP aggregate #1821

pauldix · 2015-03-03T00:52:31Z

Should work exactly like #1820, but for top.

ecables · 2015-08-11T23:29:19Z

+1

zenvdeluca · 2015-08-17T12:48:31Z

excited for this feature ! nice job so far

pavel-odintsov · 2015-08-17T13:03:37Z

Brilliant! Looking for this feature for deep integration with https://github.com/FastVPSEestiOu/fastnetmon :)

pauldix · 2015-08-19T15:29:33Z

We'll need to update the parser validation to allow this. See #3407 for more context.

corylanou · 2015-08-25T21:11:04Z

Adding some scenarios for comment.

First, I'm suggesting we change the signature to be:

SELECT TOP(field, limit) FROM measurement

This will allow for queries as follows:

SELECT TOP(value, 10) FROM cpu
SELECT TOP(value, 5), host, region FROM cpu

Questions:

What happens if the top limit doesn't match the statement limit:

SELECT TOP(field, 5) FROM cpu LIMIT 20

Are we allowed to have additional fields:

SELECT TOP(field1, 5), field2, field3, tag1, tag2 FROM cpu

For the given data:

cpu,host=server01,region=east value=2.0 946684800000000000
cpu,host=server02,region=east value=3.0 946684800000000000
cpu,host=server03,region=east value=3.0 946684800000000000
cpu,host=server04,region=east value=5.0 946684800000000000
cpu,host=server01,region=east value=6.0 946684801000000000
cpu,host=server02,region=east value=4.0 946684801000000000
cpu,host=server03,region=east value=4.0 946684801000000000
cpu,host=server04,region=east value=3.0 946684801000000000

scenario 1

SELECT TOP(value, 3) FROM cpu

It would return this data:

time,top
946684800000000000,5.0 
946684801000000000,6.0 
???

It now has to decide between multiple values of 4.0. Does it default to time next?
If so, what is preferred, newer time or higher time?

Is time always returned or only if they specify it in the query?
Also, should the results be sorted by time if time isn't specified in the query?

scenario 2

SELECT TOP(value, 3), host FROM cpu

It would return this data:

time,top,host
946684800000000000,5.0,server04
946684801000000000,6.0,server01
???

Again, same scenario, how do we decide what tag to bring back, as it could be either server03, or server04

Summary

Overall, I think the hardest thing to create deterministic output will be choosing which is top when the value is the same.

/cc @pauldix @otoolep @dgnorton

dgnorton · 2015-08-25T21:28:16Z

My first thought on TOP limit and statement limit is that the result would be the lesser of the two. Also may need to consider how TOP interacts with OFFSET and SLIMIT.

pauldix · 2015-08-25T23:34:28Z

You can use the time as the first tie breaker and the series id as the next one. The TOP function should also support having more than one field/tag in there. The first argument is the value that will be evaluated, the other tags and fields will be forced to be unique. For instance if you have this data:

cpu,host=a value=99 10
cpu,host=b value=53 10
cpu,host=a value=88 20

And you did this query:

select top(value, 2), host from cpu

You'd get two data points for host a back. However, if you do this:

select top(value, host, 2) from cpu

You would get the top value for a and the top value for b

otoolep · 2015-08-25T23:37:30Z

This is interesting, similar to the work I am doing in #3823. Time makes sense, but should we break the tie on value? Whatever we do, we should do the same thing as we do in #3823, agreed? That way we have a consistent policy.

I think I prefer breaking the tie on the value, since the series ID is not that obvious to our customers, and changes from deployment to deployment (i.e. you could get different results from different clusters, even for the same data).

pauldix · 2015-08-25T23:39:19Z

@otoolep the point is you can't break the tie on value with this. top specifies the value so that's the thing that is tied. So we need to resort to time and even then we may need to resort to something else. Even the tags won't be guaranteed to be unique. The only thing that is guaranteed is the in-memory series ID in the DatabaseIndex

corylanou · 2015-08-26T14:41:48Z

so to be very literal:

cpu,host=a value=99 10
cpu,host=b value=53 10
cpu,host=a value=88 20

And you did this query:

select top(value, 2), host from cpu

You'd get two data points for host a back.

time, top, host
10,99,a
20,88,a

However, if you do this:

select top(value, host, 2) from cpu

time,top-value,top-host
10,99,a
10,53,b

Suppose the data looked like this:

cpu,host=a,region=east value=99 10
cpu,host=b,region=west value=53 10
cpu,host=a,region=east value=88 20

Could you do a query like:

select top(value, host, region, 2)

Would it bring back 3 rows then? The top for value->host->region like:

(you can see I'm uncertain what the column name is when there are multiple values in top)

time,top-value,top-host,top-region
10,99,a,east
10,53,b,west
20,88,b,east

NOTE: sorted by time, not value?

Also, can I have data such as:

cpu,host=a,region=east value=99,in=40,out=30 10
cpu,host=b,region=west value=53,in=50,out=20 10
cpu,host=a,region=east value=88,in=60,out=10 20

And do a query like:

select top(value, in, out, host, region, 2)

@pauldix

pauldix · 2015-08-27T14:26:37Z

@corylanou: The query with region you show should only bring back 2 values, because of the 2 argument you pass in. The first parameter is the value you're comparing, the other n parameters are the fields or tags whose combination must be unique, and the last parameter is the number of results per group by interval.

For example, say you have this data:

memory,host=a,service=redis value=1000 10
memory,host=b,service=mysql value=2000 10
memory,host=a,service=mysql value=1500 10

The following queries yield these results:

select top(value, host, 2) from memory
/* yields:
memory,host=a,service=redis value=1000 10
memory,host=b,service=mysql value=2000 10
*/

select top(value, host, service, 2) from memory
/* yields:
memory,host=b,service=mysql value=2000 10
memory,host=a,service=mysql value=1500 10
*/

The values should be sorted by time.

corylanou · 2015-08-27T14:54:17Z

@pauldix I'm not sure I'm clear on the difference between this query:

select top(value, host, service, 2) from memory

and this query:

select top(value, 2), host, service from memory

pauldix · 2015-08-27T15:36:17Z

@corylanou the first one says that each result in the set must have a unique host, service combination. The second one says, also return the host and service in the result. Technically, the first one should be:

select top(value, host, service, 2), host, service from memory

But it's useful shorthand to just include the host and service in the result even though it wasn't explicitly asked for.

The reason why the first type of query is useful is that you generally have multiple samples in a window. For example say we have this data:

memory,host=a,service=redis value=1000 10
memory,host=b,service=mysql value=2000 10
memory,host=b,service=redis value=1500 10
memory,host=a,service=redis value=1001 20
memory,host=b,service=mysql value=2001 20
memory,host=b,service=redis value=1502 20
memory,host=a,service=redis value=1002 30
memory,host=b,service=mysql value=2002 30
memory,host=b,service=redis value=1502 30

Then you can see the results:

select top(value, 2), host, service from memory
/* yields:
memory,host=b,service=mysql value=2001 20
memory,host=b,service=mysql value=2002 30
*/

select top(value, host, 2)
/* yields:
memory,host=b,service=mysql value=2002 30
memory,host=a,service=redis value=1002 30
*/

select top(value, service, 2)
/* yields:
memory,host=b,service=mysql value=2002 30
memory,host=b,service=redis value=1502 30
*/

select top(value, host, service, 2)
/* yields
memory,host=a,service=redis value=1002 30
memory,host=b,service=mysql value=2002 30
*/

corylanou · 2015-08-27T15:48:27Z

@pauldix so for this query:

select top(value, host, service, 2) from memory

are we effectively "re-writing' it behind the scenes to behave like this query:

select top(value, host, service, 2), host, service from memory

Otherwise, I would expect that the host,service would not come back in the results, which doesn't seem very useful.

Also, would this query then be valid:

select top(value, host, 2), service from memory

corylanou · 2015-09-03T15:56:26Z

@pauldix

SELECT TOP(value, host, 5) FROM memory

memory,host=a value=1000 10
memory,host=b value=2000 10
memory,host=c value=1500 10
memory,host=a value=1001 20
memory,host=b value=2001 20
memory,host=c value=1502 20
memory,host=a value=1002 30
memory,host=b value=2002 30
memory,host=a value=3002 40
memory,host=b value=3003 40
memory,host=c value=4003 40

To summarize the above, we have

4 host=a
4 host=b
3 host=c

So we need to return at least one value for each a, b, and c.

The question is how do we determine the next two values? Are they the next top regardless of host?
Are then the next top uniquely for a, b, c? and pick the top 2 out of those 3?

I'm going to move forward with always the top of each slice,and if we don't have an even number to match the last set of slicing, then take the top of that last set uniquely by tag/field.

corylanou · 2015-09-03T17:13:10Z

Also, I think your last result on the previous comment should be:

select top(value, host, service, 2)

memory,host=b,service=mysql value=2002 30
memory,host=b,service=redis value=1502 30

not

memory,host=a,service=redis value=1002 30
memory,host=b,service=mysql value=2002 30

To get the above, you can issue this query:

select top(value, host, 2), service

corylanou · 2015-09-08T13:58:10Z

Closed by #3930

Jhors2 · 2015-09-10T20:20:06Z

I started using the TOP aggregate which works VERY close to what we are looking for. With that being said, I think it makes a lot of sense for the TOP group by time to be agnostic of the group by time bucket that represents the granularity of data that is returned.

What I mean is we would prefer to do something like "select top(value, pop, 5, 10m) from cpuUtil group by time(1m)". This would allow a lot of flexibility for the top aggregate. The aggregate can look at buckets of 10 minute windows while still returning a point every minute for really granular data, and better visualization. The current query is quite useful for one of ad-hoc queries, however if you want this to be consistently graphed it can be difficult to get what you want.

otoolep · 2015-09-10T21:47:15Z

@corylanou -- any comment on the feedback above?

pauldix · 2015-09-11T02:32:40Z

I stupidly left this comment in the Grafana thread, but it really belongs here since it's related to the InfluxDB query language.

Tags are associated with the series itself, so in the example result set it doesn't make sense. There are multiple servers in the result of each series so a single tag wouldn't map. In fact, each of the series returned would be best represented as a scatter plot of a different color (all dots in the same series the same color).

This is because the points may or may not be connected to each other. For example if you have a group by time of 10m, and top of 2, you could have 2 totally different servers in each 10m window.

It's more like you want to have two queries in one. The first, which executes a top function on some window of time, and then a second that looks for those specific values and groups by the tags. For example:

-- query 1
select top(value, host, 2) from cpu where time > now() - 10m group by region
-- results show server1, server2 in region west and server3, server4 in region east

-- query 2
select max(value) from cpu where time > now() - 4h and
  server = 'server1' OR
  server = 'server2' OR
  server = 'server3' OR
  server = 'server4'
group by time(1m)

It would be good to have a shorthand to represent that kind of query. Not sure what the cleanest way to do it would be though. It'll probably look janky any way we do it.

Jhors2 · 2015-09-11T02:46:13Z

You are absolutely correct, this really should be two queries, The idea behind connecting the points is the fact that consistently the same results would return for what the "top" aggregate time bucket is queried as.

In my example of top(value, pop, 5, 10m). We can then insure that in that 10 minute span the same tag set will return. In that tag set I want 10 points strung together once per minute to draw a consistent line. This would get away from a scatter plot and be easier to visualize.

Optionally the short hand you are indicating could be accomplished by adding or removing the additional timespan in the top aggregate, this seems to me that it would be clean, albeit the coverage on the parser wouldn't be the cleanest thing to implement.

I hope that makes some sense.

guilhermefigueiredo · 2015-09-17T21:12:32Z

Anyone know how to use top and limit? Ex:

select top(value, cidr, 2) from networks where direction = 'incoming' and resource = 'pps' and time > now() - 10m group by cidr limit 2

limit does not work

corylanou · 2015-09-17T21:15:37Z

Can you provide some clarification on what "does not work" means? What results did you get vs. what did you expect?

guilhermefigueiredo · 2015-09-17T21:46:15Z

hello!

I resolved that using SLIMIT

SELECT top(value, cidr, 1) FROM "networks" WHERE "direction" = 'incoming' AND "resource" = 'pps' AND $timeFilter GROUP BY cidr SLIMIT 5

Now apparently he takes one top of each cidr and limited to 5 results.

pauldix mentioned this issue Mar 3, 2015

DISTINCT error with 0.9rc6 #1804

Closed

beckettsean added this to the 0.9.0 milestone Apr 20, 2015

toddboom added the area/functions label Apr 23, 2015

beckettsean mentioned this issue Apr 30, 2015

Top N of an Avg Value #2467

Closed

toddboom modified the milestones: 0.9.0, 0.9.1 May 8, 2015

beckettsean modified the milestones: 0.9.2, 0.9.1 Jun 16, 2015

beckettsean mentioned this issue Jun 19, 2015

Top N support? #3055

Closed

beckettsean added status/help-wanted difficulty/high This issue needs to be broken down into smaller units of work. labels Jul 15, 2015

beckettsean modified the milestones: 0.9.4, 0.9.2 Jul 15, 2015

pauldix assigned corylanou Aug 24, 2015

pauldix mentioned this issue Aug 25, 2015

[feature request] Highest average, etc. #3680

Closed

corylanou mentioned this issue Sep 1, 2015

Wire up Top aggregate function #3930

Merged

corylanou closed this as completed Sep 8, 2015

mvadu mentioned this issue Jan 13, 2016

[Feature Request] Allow for Nested aggregate in TOP #5345

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wire up TOP aggregate #1821

Wire up TOP aggregate #1821

pauldix commented Mar 3, 2015

ecables commented Aug 11, 2015

zenvdeluca commented Aug 17, 2015

pavel-odintsov commented Aug 17, 2015

pauldix commented Aug 19, 2015

corylanou commented Aug 25, 2015

dgnorton commented Aug 25, 2015

pauldix commented Aug 25, 2015

otoolep commented Aug 25, 2015

pauldix commented Aug 25, 2015

corylanou commented Aug 26, 2015

pauldix commented Aug 27, 2015

corylanou commented Aug 27, 2015

pauldix commented Aug 27, 2015

corylanou commented Aug 27, 2015

corylanou commented Sep 3, 2015

corylanou commented Sep 3, 2015

corylanou commented Sep 8, 2015

Jhors2 commented Sep 10, 2015

otoolep commented Sep 10, 2015

pauldix commented Sep 11, 2015

Jhors2 commented Sep 11, 2015

guilhermefigueiredo commented Sep 17, 2015

corylanou commented Sep 17, 2015

guilhermefigueiredo commented Sep 17, 2015

Wire up TOP aggregate #1821

Wire up TOP aggregate #1821

Comments

pauldix commented Mar 3, 2015

ecables commented Aug 11, 2015

zenvdeluca commented Aug 17, 2015

pavel-odintsov commented Aug 17, 2015

pauldix commented Aug 19, 2015

corylanou commented Aug 25, 2015

For the given data:

scenario 1

scenario 2

Summary

dgnorton commented Aug 25, 2015

pauldix commented Aug 25, 2015

otoolep commented Aug 25, 2015

pauldix commented Aug 25, 2015

corylanou commented Aug 26, 2015

pauldix commented Aug 27, 2015

corylanou commented Aug 27, 2015

pauldix commented Aug 27, 2015

corylanou commented Aug 27, 2015

corylanou commented Sep 3, 2015

corylanou commented Sep 3, 2015

corylanou commented Sep 8, 2015

Jhors2 commented Sep 10, 2015

otoolep commented Sep 10, 2015

pauldix commented Sep 11, 2015

Jhors2 commented Sep 11, 2015

guilhermefigueiredo commented Sep 17, 2015

corylanou commented Sep 17, 2015

guilhermefigueiredo commented Sep 17, 2015