Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a new quantile histogram aggregation for numeric fields #50386

Open
agirbal opened this issue Dec 19, 2019 · 3 comments
Open

Add a new quantile histogram aggregation for numeric fields #50386

agirbal opened this issue Dec 19, 2019 · 3 comments
Labels
:Analytics/Aggregations Aggregations >enhancement >feature Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)

Comments

@agirbal
Copy link

agirbal commented Dec 19, 2019

This issue is related to #31828 and to some extent #28993. It would be more useful for most of our use cases than #31828 (cc @pcsanwald).
I talked about this feature to @smayzak and @AlonaNadler a bit.

Problem: when doing histograms using a numeric value (on the X-axis) it is very common that the distribution of documents is concentrated in a tiny portion of the histogram. A common example if you want to plot against say "user request latency" of a production system, 90+% of them are going to concentrated in the 1st bucket - it is a long tail problem which is common to most production datasets. Trying to filter out higher values is very tedious and still you end up with a histogram distribution of values that is not conducive to making any analysis / conclusions.

Ideal solution: most data analysis (that we base decisions on) instead use a quantile distribution on the X-axis, meaning that each bucket represents an equivalent portion of the data. For example the first bucket would be the 10% users with best "request latency" (call it p0-10), next would be 10-20% best (p10-20), etc and last bucket is my 10% users with worst performance (p90-100). In turn this lets the operator do very clear analysis: "this change in my software is hurting performance by 5% for my 10% best connected users but improves 15% for my p90 users, so it's a very positive change." Each bucket could be either equal in terms of portion of dataset, or better you could just customize the ranges as percentile ranks, just like you do in the percentiles value function.

Workaround: As suggested by @jpountz you can do a pre-flight request to ES to obtain the quantile bucket bounds, then make a second request for a standard histogram with known buckets. I have done this and it works but it is extremely cumbersome and not viable solution really, besides a fun experiment. I had to create a complex HTML form to allow to pick the fields, percentiles, function to apply to Y-axis, etc. Then hack a complex URL query string to generate the Kibana histogram, guaranteed to break. From there the display in Kibana is not really shareable, you can't change time window or any filter without having to redo the whole thing, because the buckets need to be recalculated.

Note there is already Kibana tickets about it
elastic/kibana#3905 and elastic/kibana#3757 .
But it really seems for this to work seamlessly in Kibana, ES should support it as a native aggregation.
Thanks much!

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

@rjernst rjernst added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label May 4, 2020
@talevy
Copy link
Contributor

talevy commented Jul 23, 2020

It would be great to do this in two passes, on sorted data. blocked on multi-pass aggregation support

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations >enhancement >feature Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)
Projects
None yet
Development

No branches or pull requests

8 participants