-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wire up TOP aggregate #1821
Comments
+1 |
excited for this feature ! nice job so far |
Brilliant! Looking for this feature for deep integration with https://github.com/FastVPSEestiOu/fastnetmon :) |
We'll need to update the parser validation to allow this. See #3407 for more context. |
Adding some scenarios for comment. First, I'm suggesting we change the signature to be: SELECT TOP(field, limit) FROM measurement This will allow for queries as follows: SELECT TOP(value, 10) FROM cpu
SELECT TOP(value, 5), host, region FROM cpu Questions:
SELECT TOP(field, 5) FROM cpu LIMIT 20
SELECT TOP(field1, 5), field2, field3, tag1, tag2 FROM cpu For the given data:
scenario 1SELECT TOP(value, 3) FROM cpu It would return this data:
It now has to decide between multiple values of Is time always returned or only if they specify it in the query? scenario 2SELECT TOP(value, 3), host FROM cpu It would return this data:
Again, same scenario, how do we decide what tag to bring back, as it could be either SummaryOverall, I think the hardest thing to create deterministic output will be choosing which is top when the value is the same. |
My first thought on |
You can use the time as the first tie breaker and the series id as the next one. The
And you did this query: select top(value, 2), host from cpu You'd get two data points for host select top(value, host, 2) from cpu You would get the top value for |
This is interesting, similar to the work I am doing in #3823. Time makes sense, but should we break the tie on value? Whatever we do, we should do the same thing as we do in #3823, agreed? That way we have a consistent policy. I think I prefer breaking the tie on the value, since the series ID is not that obvious to our customers, and changes from deployment to deployment (i.e. you could get different results from different clusters, even for the same data). |
@otoolep the point is you can't break the tie on value with this. top specifies the value so that's the thing that is tied. So we need to resort to time and even then we may need to resort to something else. Even the tags won't be guaranteed to be unique. The only thing that is guaranteed is the in-memory series ID in the |
so to be very literal:
And you did this query: select top(value, 2), host from cpu You'd get two data points for host a back.
However, if you do this: select top(value, host, 2) from cpu
Suppose the data looked like this:
Could you do a query like:
Would it bring back 3 rows then? The top for value->host->region like: (you can see I'm uncertain what the column name is when there are multiple values in top)
NOTE: sorted by time, not value? Also, can I have data such as:
And do a query like:
|
@corylanou: The query with region you show should only bring back 2 values, because of the 2 argument you pass in. The first parameter is the value you're comparing, the other n parameters are the fields or tags whose combination must be unique, and the last parameter is the number of results per group by interval. For example, say you have this data:
The following queries yield these results: select top(value, host, 2) from memory
/* yields:
memory,host=a,service=redis value=1000 10
memory,host=b,service=mysql value=2000 10
*/
select top(value, host, service, 2) from memory
/* yields:
memory,host=b,service=mysql value=2000 10
memory,host=a,service=mysql value=1500 10
*/ The values should be sorted by time. |
@pauldix I'm not sure I'm clear on the difference between this query: select top(value, host, service, 2) from memory and this query: select top(value, 2), host, service from memory |
@corylanou the first one says that each result in the set must have a unique host, service combination. The second one says, also return the host and service in the result. Technically, the first one should be: select top(value, host, service, 2), host, service from memory But it's useful shorthand to just include the host and service in the result even though it wasn't explicitly asked for. The reason why the first type of query is useful is that you generally have multiple samples in a window. For example say we have this data:
Then you can see the results: select top(value, 2), host, service from memory
/* yields:
memory,host=b,service=mysql value=2001 20
memory,host=b,service=mysql value=2002 30
*/
select top(value, host, 2)
/* yields:
memory,host=b,service=mysql value=2002 30
memory,host=a,service=redis value=1002 30
*/
select top(value, service, 2)
/* yields:
memory,host=b,service=mysql value=2002 30
memory,host=b,service=redis value=1502 30
*/
select top(value, host, service, 2)
/* yields
memory,host=a,service=redis value=1002 30
memory,host=b,service=mysql value=2002 30
*/ |
@pauldix so for this query: select top(value, host, service, 2) from memory are we effectively "re-writing' it behind the scenes to behave like this query: select top(value, host, service, 2), host, service from memory Otherwise, I would expect that the Also, would this query then be valid: select top(value, host, 2), service from memory |
SELECT TOP(value, host, 5) FROM memory
To summarize the above, we have
So we need to return at least one value for each The question is how do we determine the next two values? Are they the next top regardless of host? I'm going to move forward with always the top of each slice,and if we don't have an even number to match the last set of slicing, then take the top of that last set uniquely by tag/field. |
Also, I think your last result on the previous comment should be: select top(value, host, service, 2)
not
To get the above, you can issue this query: select top(value, host, 2), service |
Closed by #3930 |
I started using the TOP aggregate which works VERY close to what we are looking for. With that being said, I think it makes a lot of sense for the TOP group by time to be agnostic of the group by time bucket that represents the granularity of data that is returned. What I mean is we would prefer to do something like "select top(value, pop, 5, 10m) from cpuUtil group by time(1m)". This would allow a lot of flexibility for the top aggregate. The aggregate can look at buckets of 10 minute windows while still returning a point every minute for really granular data, and better visualization. The current query is quite useful for one of ad-hoc queries, however if you want this to be consistently graphed it can be difficult to get what you want. |
@corylanou -- any comment on the feedback above? |
I stupidly left this comment in the Grafana thread, but it really belongs here since it's related to the InfluxDB query language. Tags are associated with the series itself, so in the example result set it doesn't make sense. There are multiple servers in the result of each series so a single tag wouldn't map. In fact, each of the series returned would be best represented as a scatter plot of a different color (all dots in the same series the same color). This is because the points may or may not be connected to each other. For example if you have a group by time of 10m, and top of 2, you could have 2 totally different servers in each 10m window. It's more like you want to have two queries in one. The first, which executes a top function on some window of time, and then a second that looks for those specific values and groups by the tags. For example: -- query 1
select top(value, host, 2) from cpu where time > now() - 10m group by region
-- results show server1, server2 in region west and server3, server4 in region east
-- query 2
select max(value) from cpu where time > now() - 4h and
server = 'server1' OR
server = 'server2' OR
server = 'server3' OR
server = 'server4'
group by time(1m) It would be good to have a shorthand to represent that kind of query. Not sure what the cleanest way to do it would be though. It'll probably look janky any way we do it. |
You are absolutely correct, this really should be two queries, The idea behind connecting the points is the fact that consistently the same results would return for what the "top" aggregate time bucket is queried as. In my example of top(value, pop, 5, 10m). We can then insure that in that 10 minute span the same tag set will return. In that tag set I want 10 points strung together once per minute to draw a consistent line. This would get away from a scatter plot and be easier to visualize. Optionally the short hand you are indicating could be accomplished by adding or removing the additional timespan in the top aggregate, this seems to me that it would be clean, albeit the coverage on the parser wouldn't be the cleanest thing to implement. I hope that makes some sense. |
Anyone know how to use top and limit? Ex: select top(value, cidr, 2) from networks where direction = 'incoming' and resource = 'pps' and time > now() - 10m group by cidr limit 2 limit does not work |
Can you provide some clarification on what "does not work" means? What results did you get vs. what did you expect? |
hello! I resolved that using SLIMIT SELECT top(value, cidr, 1) FROM "networks" WHERE "direction" = 'incoming' AND "resource" = 'pps' AND $timeFilter GROUP BY cidr SLIMIT 5 Now apparently he takes one top of each cidr and limited to 5 results. |
Should work exactly like #1820, but for top.
The text was updated successfully, but these errors were encountered: