Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wire up TOP aggregate #1821

Closed
pauldix opened this issue Mar 3, 2015 · 24 comments
Closed

Wire up TOP aggregate #1821

pauldix opened this issue Mar 3, 2015 · 24 comments
Assignees
Labels
area/functions difficulty/high This issue needs to be broken down into smaller units of work.
Milestone

Comments

@pauldix
Copy link
Member

pauldix commented Mar 3, 2015

Should work exactly like #1820, but for top.

@beckettsean beckettsean added this to the 0.9.0 milestone Apr 20, 2015
@toddboom toddboom modified the milestones: 0.9.0, 0.9.1 May 8, 2015
@beckettsean beckettsean modified the milestones: 0.9.2, 0.9.1 Jun 16, 2015
@beckettsean beckettsean added status/help-wanted difficulty/high This issue needs to be broken down into smaller units of work. labels Jul 15, 2015
@beckettsean beckettsean modified the milestones: 0.9.4, 0.9.2 Jul 15, 2015
@ecables
Copy link

ecables commented Aug 11, 2015

+1

@zenvdeluca
Copy link

excited for this feature ! nice job so far

@pavel-odintsov
Copy link

Brilliant! Looking for this feature for deep integration with https://github.com/FastVPSEestiOu/fastnetmon :)

@pauldix
Copy link
Member Author

pauldix commented Aug 19, 2015

We'll need to update the parser validation to allow this. See #3407 for more context.

@corylanou
Copy link
Contributor

Adding some scenarios for comment.

First, I'm suggesting we change the signature to be:

SELECT TOP(field, limit) FROM measurement

This will allow for queries as follows:

SELECT TOP(value, 10) FROM cpu
SELECT TOP(value, 5), host, region FROM cpu

Questions:

  • What happens if the top limit doesn't match the statement limit:
SELECT TOP(field, 5) FROM cpu LIMIT 20
  • Are we allowed to have additional fields:
SELECT TOP(field1, 5), field2, field3, tag1, tag2 FROM cpu

For the given data:

cpu,host=server01,region=east value=2.0 946684800000000000
cpu,host=server02,region=east value=3.0 946684800000000000
cpu,host=server03,region=east value=3.0 946684800000000000
cpu,host=server04,region=east value=5.0 946684800000000000
cpu,host=server01,region=east value=6.0 946684801000000000
cpu,host=server02,region=east value=4.0 946684801000000000
cpu,host=server03,region=east value=4.0 946684801000000000
cpu,host=server04,region=east value=3.0 946684801000000000

scenario 1

SELECT TOP(value, 3) FROM cpu

It would return this data:

time,top
946684800000000000,5.0 
946684801000000000,6.0 
???

It now has to decide between multiple values of 4.0. Does it default to time next?
If so, what is preferred, newer time or higher time?

Is time always returned or only if they specify it in the query?
Also, should the results be sorted by time if time isn't specified in the query?

scenario 2

SELECT TOP(value, 3), host FROM cpu

It would return this data:

time,top,host
946684800000000000,5.0,server04
946684801000000000,6.0,server01
???

Again, same scenario, how do we decide what tag to bring back, as it could be either server03, or server04

Summary

Overall, I think the hardest thing to create deterministic output will be choosing which is top when the value is the same.

/cc @pauldix @otoolep @dgnorton

@dgnorton
Copy link
Contributor

My first thought on TOP limit and statement limit is that the result would be the lesser of the two. Also may need to consider how TOP interacts with OFFSET and SLIMIT.

@pauldix
Copy link
Member Author

pauldix commented Aug 25, 2015

You can use the time as the first tie breaker and the series id as the next one. The TOP function should also support having more than one field/tag in there. The first argument is the value that will be evaluated, the other tags and fields will be forced to be unique. For instance if you have this data:

cpu,host=a value=99 10
cpu,host=b value=53 10
cpu,host=a value=88 20

And you did this query:

select top(value, 2), host from cpu

You'd get two data points for host a back. However, if you do this:

select top(value, host, 2) from cpu

You would get the top value for a and the top value for b

@otoolep
Copy link
Contributor

otoolep commented Aug 25, 2015

This is interesting, similar to the work I am doing in #3823. Time makes sense, but should we break the tie on value? Whatever we do, we should do the same thing as we do in #3823, agreed? That way we have a consistent policy.

I think I prefer breaking the tie on the value, since the series ID is not that obvious to our customers, and changes from deployment to deployment (i.e. you could get different results from different clusters, even for the same data).

@pauldix
Copy link
Member Author

pauldix commented Aug 25, 2015

@otoolep the point is you can't break the tie on value with this. top specifies the value so that's the thing that is tied. So we need to resort to time and even then we may need to resort to something else. Even the tags won't be guaranteed to be unique. The only thing that is guaranteed is the in-memory series ID in the DatabaseIndex

@corylanou
Copy link
Contributor

so to be very literal:

cpu,host=a value=99 10
cpu,host=b value=53 10
cpu,host=a value=88 20

And you did this query:

select top(value, 2), host from cpu

You'd get two data points for host a back.

time, top, host
10,99,a
20,88,a

However, if you do this:

select top(value, host, 2) from cpu

time,top-value,top-host
10,99,a
10,53,b

Suppose the data looked like this:

cpu,host=a,region=east value=99 10
cpu,host=b,region=west value=53 10
cpu,host=a,region=east value=88 20

Could you do a query like:

select top(value, host, region, 2)

Would it bring back 3 rows then? The top for value->host->region like:

(you can see I'm uncertain what the column name is when there are multiple values in top)

time,top-value,top-host,top-region
10,99,a,east
10,53,b,west
20,88,b,east

NOTE: sorted by time, not value?

Also, can I have data such as:

cpu,host=a,region=east value=99,in=40,out=30 10
cpu,host=b,region=west value=53,in=50,out=20 10
cpu,host=a,region=east value=88,in=60,out=10 20

And do a query like:

select top(value, in, out, host, region, 2)

@pauldix

@pauldix
Copy link
Member Author

pauldix commented Aug 27, 2015

@corylanou: The query with region you show should only bring back 2 values, because of the 2 argument you pass in. The first parameter is the value you're comparing, the other n parameters are the fields or tags whose combination must be unique, and the last parameter is the number of results per group by interval.

For example, say you have this data:

memory,host=a,service=redis value=1000 10
memory,host=b,service=mysql value=2000 10
memory,host=a,service=mysql value=1500 10

The following queries yield these results:

select top(value, host, 2) from memory
/* yields:
memory,host=a,service=redis value=1000 10
memory,host=b,service=mysql value=2000 10
*/

select top(value, host, service, 2) from memory
/* yields:
memory,host=b,service=mysql value=2000 10
memory,host=a,service=mysql value=1500 10
*/

The values should be sorted by time.

@corylanou
Copy link
Contributor

@pauldix I'm not sure I'm clear on the difference between this query:

select top(value, host, service, 2) from memory

and this query:

select top(value, 2), host, service from memory

@pauldix
Copy link
Member Author

pauldix commented Aug 27, 2015

@corylanou the first one says that each result in the set must have a unique host, service combination. The second one says, also return the host and service in the result. Technically, the first one should be:

select top(value, host, service, 2), host, service from memory

But it's useful shorthand to just include the host and service in the result even though it wasn't explicitly asked for.

The reason why the first type of query is useful is that you generally have multiple samples in a window. For example say we have this data:

memory,host=a,service=redis value=1000 10
memory,host=b,service=mysql value=2000 10
memory,host=b,service=redis value=1500 10
memory,host=a,service=redis value=1001 20
memory,host=b,service=mysql value=2001 20
memory,host=b,service=redis value=1502 20
memory,host=a,service=redis value=1002 30
memory,host=b,service=mysql value=2002 30
memory,host=b,service=redis value=1502 30

Then you can see the results:

select top(value, 2), host, service from memory
/* yields:
memory,host=b,service=mysql value=2001 20
memory,host=b,service=mysql value=2002 30
*/

select top(value, host, 2)
/* yields:
memory,host=b,service=mysql value=2002 30
memory,host=a,service=redis value=1002 30
*/

select top(value, service, 2)
/* yields:
memory,host=b,service=mysql value=2002 30
memory,host=b,service=redis value=1502 30
*/

select top(value, host, service, 2)
/* yields
memory,host=a,service=redis value=1002 30
memory,host=b,service=mysql value=2002 30
*/

@corylanou
Copy link
Contributor

@pauldix so for this query:

select top(value, host, service, 2) from memory

are we effectively "re-writing' it behind the scenes to behave like this query:

select top(value, host, service, 2), host, service from memory

Otherwise, I would expect that the host,service would not come back in the results, which doesn't seem very useful.

Also, would this query then be valid:

select top(value, host, 2), service from memory

@corylanou
Copy link
Contributor

@pauldix

SELECT TOP(value, host, 5) FROM memory
memory,host=a value=1000 10
memory,host=b value=2000 10
memory,host=c value=1500 10
memory,host=a value=1001 20
memory,host=b value=2001 20
memory,host=c value=1502 20
memory,host=a value=1002 30
memory,host=b value=2002 30
memory,host=a value=3002 40
memory,host=b value=3003 40
memory,host=c value=4003 40

To summarize the above, we have

  • 4 host=a
  • 4 host=b
  • 3 host=c

So we need to return at least one value for each a, b, and c.

The question is how do we determine the next two values? Are they the next top regardless of host?
Are then the next top uniquely for a, b, c? and pick the top 2 out of those 3?

I'm going to move forward with always the top of each slice,and if we don't have an even number to match the last set of slicing, then take the top of that last set uniquely by tag/field.

@corylanou
Copy link
Contributor

Also, I think your last result on the previous comment should be:

select top(value, host, service, 2)
memory,host=b,service=mysql value=2002 30
memory,host=b,service=redis value=1502 30

not

memory,host=a,service=redis value=1002 30
memory,host=b,service=mysql value=2002 30

To get the above, you can issue this query:

select top(value, host, 2), service

@corylanou
Copy link
Contributor

Closed by #3930

@Jhors2
Copy link

Jhors2 commented Sep 10, 2015

I started using the TOP aggregate which works VERY close to what we are looking for. With that being said, I think it makes a lot of sense for the TOP group by time to be agnostic of the group by time bucket that represents the granularity of data that is returned.

What I mean is we would prefer to do something like "select top(value, pop, 5, 10m) from cpuUtil group by time(1m)". This would allow a lot of flexibility for the top aggregate. The aggregate can look at buckets of 10 minute windows while still returning a point every minute for really granular data, and better visualization. The current query is quite useful for one of ad-hoc queries, however if you want this to be consistently graphed it can be difficult to get what you want.

@otoolep
Copy link
Contributor

otoolep commented Sep 10, 2015

@corylanou -- any comment on the feedback above?

@pauldix
Copy link
Member Author

pauldix commented Sep 11, 2015

I stupidly left this comment in the Grafana thread, but it really belongs here since it's related to the InfluxDB query language.

Tags are associated with the series itself, so in the example result set it doesn't make sense. There are multiple servers in the result of each series so a single tag wouldn't map. In fact, each of the series returned would be best represented as a scatter plot of a different color (all dots in the same series the same color).

This is because the points may or may not be connected to each other. For example if you have a group by time of 10m, and top of 2, you could have 2 totally different servers in each 10m window.

It's more like you want to have two queries in one. The first, which executes a top function on some window of time, and then a second that looks for those specific values and groups by the tags. For example:

-- query 1
select top(value, host, 2) from cpu where time > now() - 10m group by region
-- results show server1, server2 in region west and server3, server4 in region east

-- query 2
select max(value) from cpu where time > now() - 4h and
  server = 'server1' OR
  server = 'server2' OR
  server = 'server3' OR
  server = 'server4'
group by time(1m)

It would be good to have a shorthand to represent that kind of query. Not sure what the cleanest way to do it would be though. It'll probably look janky any way we do it.

@Jhors2
Copy link

Jhors2 commented Sep 11, 2015

You are absolutely correct, this really should be two queries, The idea behind connecting the points is the fact that consistently the same results would return for what the "top" aggregate time bucket is queried as.

In my example of top(value, pop, 5, 10m). We can then insure that in that 10 minute span the same tag set will return. In that tag set I want 10 points strung together once per minute to draw a consistent line. This would get away from a scatter plot and be easier to visualize.

Optionally the short hand you are indicating could be accomplished by adding or removing the additional timespan in the top aggregate, this seems to me that it would be clean, albeit the coverage on the parser wouldn't be the cleanest thing to implement.

I hope that makes some sense.

@guilhermefigueiredo
Copy link

Anyone know how to use top and limit? Ex:

select top(value, cidr, 2) from networks where direction = 'incoming' and resource = 'pps' and time > now() - 10m group by cidr limit 2

limit does not work

@corylanou
Copy link
Contributor

Can you provide some clarification on what "does not work" means? What results did you get vs. what did you expect?

@guilhermefigueiredo
Copy link

hello!

I resolved that using SLIMIT

SELECT top(value, cidr, 1) FROM "networks" WHERE "direction" = 'incoming' AND "resource" = 'pps' AND $timeFilter GROUP BY cidr SLIMIT 5

Now apparently he takes one top of each cidr and limited to 5 results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/functions difficulty/high This issue needs to be broken down into smaller units of work.
Projects
None yet
Development

No branches or pull requests