Paging support for aggregations #4915

aaneja · 2014-01-27T20:48:34Z

Terms aggregation does not support a way to page through the buckets returned.
To work around this, I've been trying to set 'min_doc_count' to limit the buckets returned and using a 'exclude' filter, to exclude already 'seen' pages.

Will this result in better running time performance on the ES cluster as compared to getting all the buckets and then doing my own paging logic client side ?

jpountz · 2014-02-03T15:07:08Z

Paging is tricky to implement because document counts for terms aggregations are not exact when shard_size is less than the field cardinality and sorting on count desc. So weird things may happen like the first term of the 2nd page having a higher count than the last element of the first page, etc.

Regarding your question, terms aggregations run in two phases on the shard-level: first they compute counts for every possible term, and then they pick the top shard_size ones. Increasing size (or shard_size) only makes the 2nd step more costly. Given that the runtime of the first step is linear with the number of matched documents and that the runtime of the 2nd step is O(#unique_values * log(shard_size)), if you only have a limited number of unique values compared to the number of matched documents, doing the paging on client-side would be more efficient. On the other hand, on high-cardinality-fields, your first approach based on an exclude would probably be better.

As a side-note, min_doc_count has no effect on runtime performance when it is greater than or equal to 1. Only min_doc_count=0 is more costly given that it requires Elasticsearch to also fetch terms that are not contained in any match.

haschrisg · 2014-03-12T03:35:35Z

@jpountz would storing the results of an aggregation in a new index be feasible? In general, it'd be great to have a way of dealing with both aggregations with high cardinality, and nested aggregations that produce a large number (millions) of results -- even if the cost of that is that they're not sorted properly when paging.

jpountz · 2014-03-12T07:32:43Z

If it makes sense for your use-case, this is something that you could consider implementing on client-side, by running hourly/daily these costly aggregations, storing the result in an index and using this index between two runs to explore the results of the aggregation?

apatrida · 2014-09-02T20:31:27Z

When sorting by term instead of count, why would paging then not be possible? For example, having a terms aggregation, with top hits aggregation which could produce an overly large result set without having paging on the terms aggregation. Not all aggregations wants want to sort by count.

- This is a stopgap fix until we figure out what changes we want to make to the API w/r/t the "total" field, as we don't have an inexpensive way of determining the total number of buckets. elastic/elasticsearch#4915

tugberkugurlu · 2014-09-24T16:46:54Z

I can see that this may not be possible but for a top_hits aggregation, I really need this functionality. I have the below aggregation query:

POST sport/_search
{
  "size": 0,
  "query": {
    "filtered": {
      "query": {
        "match_all": {}
      },
      "filter": {
        "bool": {
          "must": [
            {
              "range": {
                "defense_strength": {
                  "lte": 83.43
                }
              }
            },
            {
              "range": {
                "forward_strength": {
                  "gte": 91
                }
              }
            }
          ]
        }
      }
    }
  }, 
  "aggs": {
    "top_teams": {
        "terms": {
          "field": "primaryId"
        },
        "aggs": {
          "top_team_hits": {
            "top_hits": {
              "sort": [
                {
                    "forward_strength": {
                        "order": "desc"
                    }
                }
              ],
              "_source": {
                  "include": [
                      "name"
                  ]
              },
              "from": 0,
              "size" : 1
            }
          }
        }
      }
    }
  }
}

This produces the below result for an insanely cheap index (with low number of docs):

    {
         "took": 2,
         "timed_out": false,
         "_shards": {
                "total": 5,
                "successful": 5,
                "failed": 0
         },
         "hits": {
                "total": 5,
                "max_score": 0,
                "hits": []
         },
         "aggregations": {
                "top_teams": {
                     "buckets": [
                            {
                                 "key": "541afdfc532aec0f305c2c48",
                                 "doc_count": 2,
                                 "top_team_hits": {
                                        "hits": {
                                             "total": 2,
                                             "max_score": null,
                                             "hits": [
                                                    {
                                                         "_index": "sport",
                                                         "_type": "football_team",
                                                         "_id": "y6jZ31xoQMCXaK23rPQgjA",
                                                         "_score": null,
                                                         "_source": {
                                                                "name": "Barcelona"
                                                         },
                                                         "sort": [
                                                                98.32
                                                         ]
                                                    }
                                             ]
                                        }
                                 }
                            },
                            {
                                 "key": "541afe08532aec0f305c5f28",
                                 "doc_count": 2,
                                 "top_team_hits": {
                                        "hits": {
                                             "total": 2,
                                             "max_score": null,
                                             "hits": [
                                                    {
                                                         "_index": "sport",
                                                         "_type": "football_team",
                                                         "_id": "hewWI0ZpTki4OgOeneLn1Q",
                                                         "_score": null,
                                                         "_source": {
                                                                "name": "Arsenal"
                                                         },
                                                         "sort": [
                                                                94.3
                                                         ]
                                                    }
                                             ]
                                        }
                                 }
                            },
                            {
                                 "key": "541afe09532aec0f305c5f2b",
                                 "doc_count": 1,
                                 "top_team_hits": {
                                        "hits": {
                                             "total": 1,
                                             "max_score": null,
                                             "hits": [
                                                    {
                                                         "_index": "sport",
                                                         "_type": "football_team",
                                                         "_id": "x-_YBX5jSba8qsEuB8guTQ",
                                                         "_score": null,
                                                         "_source": {
                                                                "name": "Real Madrid"
                                                         },
                                                         "sort": [
                                                                91.34
                                                         ]
                                                    }
                                             ]
                                        }
                                 }
                            }
                     ]
                }
         }
    }

What I need here is the ability to get first 2 aggregation result and get the other 2 (in this case, only 1) in other request.

missingpixel · 2014-10-31T12:26:21Z

If paging aggregations is not possible, how do we use ES for online stores where products of different colours are grouped together? Or, what if there are five million authors in the example at: http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/top-hits.html ? Aggregate them and perform pagination in-memory?

If that's not possible, what else can be done in place of grouping in Solr?

Thank you

adrienbrault · 2014-12-11T13:56:11Z

A parameter allowing to hide the first X buckets from the response would be nice.

adrienbrault · 2014-12-11T14:01:49Z

@clintongormley Why was this issue closed ?

bobbyhubbard · 2014-12-12T00:39:32Z

Reopen please?

mikelrob · 2015-01-02T14:16:09Z

+1 for pagination while sorted on term not doc count

android-programmer · 2015-01-06T08:37:16Z

+1

daniel1028 · 2015-01-12T14:04:13Z

Can you re-open this please?

I understand that aggregation pagination will create performance issue in larger numbers of records. But it will not affect smaller numbers of records right?

The performance issue will be happen only if have more records.Why don't we have this support at least for smaller set of records.

Why do we have to hesitate to add this support considering larger amount of data? If we have this support , it would be very helpful for us to paginate smaller amount data.

May be we can inform users, this support will be efficient only for smaller amount data. Whenever the amount for data increases ,the performance will hit highly.

clintongormley · 2015-01-14T09:05:49Z

We have been against adding pagination support to the terms (and related) aggregations because it hides the cost of generating the aggregations from the user. Not only that, it can produce incorrect ordering because term based aggregations are approximate.

That said, we support pagination on search requests, which are similarly costly (although accurate).

While some users will definitely shoot themselves in the foot with pagination (eg #4915 (comment)), not supporting pagination does limit some legitimate use cases.

I'll reopen this ticket for further discussion.

byronvoorbach · 2015-01-14T16:23:01Z

I would love to see this feature being added to ES, but I understand the cost at which it would come.
I'm currently working for a client which needed such a feature, but since it didn't exist yet we solved it with doing 2 queries:

The first query has a terms aggregation on our field on which we want grouping and orders the aggregation based on the doc.score. We set the size of the aggregation to 0, so that we get all buckets for that query.
We then parse the result and get the keys from the buckets corresponding to the given size and offset. ( eg bucket 30-40 for page 3).

We then perform a new query, filtering all results based on the keys from the first query. Next to the query is a term aggregation (on the same field as before), and we add a top_hits aggregation to get the results for those (10) buckets.

This way we don't have to load all 40 buckets and get the top_hits for those buckets, which increases performance.

Loading all buckets and top 10 hits per bucket took around 20 seconds for a certain query. With the above change we managed to bring it back to 100ms.

Info:

+-60 million records
around 1500 buckets for average query
around 300 documents per bucket

This might help someone out as a workaround till such a feature exists within Elasticsearch

davidvgalbraith · 2015-03-17T18:42:22Z

Hey! I too would like paging for aggregations. That's all.

bauneroni · 2015-03-19T11:07:35Z

I'd also love to see this someday but I do understand the burden (haven't used that word in a long time) and costs to implement this. This feature would be quite handy for my client's application which is operating on ~250GB+ of data.

Well, yeah.. what he^ said 👍

vinusebastian · 2015-04-01T10:14:40Z

@aaneja with respect to "Terms aggregation does not support a way to page through the buckets returned.
To work around this, I've been trying to set 'min_doc_count' to limit the buckets returned and using a 'exclude' filter, to exclude already 'seen' pages.

Will this result in better running time performance on the ES cluster as compared to getting all the buckets and then doing my own paging logic client side ?"

How did you exclude already seen pages? Or how did you keep track of seen pages? Also what did you learn about performance issues with such an approach?

dakrone · 2015-04-10T16:03:33Z

We discussed this and one potential idea is to add the ability to specify a start_term for aggregations, that would allow the aggregation to skip all of the preceding terms, then the client could implement the paging by retrieving the first page of aggregations, then sending the same request with the start_term being the last term of the previous results. Otherwise the aggregation will still incur the overhead of computing the results and sub-aggregations for each of the "skipped" buckets.

To better understand this, it would be extremely useful to get more use-cases out of why people need this and how they would use it, so please do add those to this ticket.

2e3s · 2015-04-10T20:13:27Z

+1 for that. There may be tens of thousands unique terms by which we group, and gather statistics by subaggregations. It can be sorted by any of these subaggregations, so it's gonna be very costly anyway, but its speed with ES is currently more that bearable as well as its precision, and if not sending such big json data between servers and holding it with PHP (which isn't good at all as for now), it would be fine. I even think of some plugin which would do this simple job. But this still will require computing a sorting subaggregation if used.

a0s · 2015-04-29T15:44:18Z

+1

benneq · 2015-05-06T13:44:28Z

+1

pauleil · 2015-05-06T17:59:00Z

+1

sadokmtir · 2016-09-05T12:23:40Z

I want to get the second 10 buckets from an aggregation. How can I do that ??
Is the pagination supported yet ???

felixbarny · 2016-09-05T12:25:40Z

Nope. You have request for 20 buckets :(

sadokmtir · 2016-09-05T12:28:13Z

But this is only a very humble example to make it simple. But in reality it would be like results from 1000 buckets per page. So, as I see it is not possible to do that through aggregation in elastic-search ??

felixbarny · 2016-09-05T12:34:28Z

That's right. It's currently not possible. That's what this issue is about. But it seems to be very hard to implement because the operation may be distributed across many nodes/shards.

felixbarny · 2016-09-05T12:49:05Z

Because this feature is by far the most requested one, I feel like elastic should care more about it.

Maybe write a blog which explains why it is difficult to implement and what the alternatives are. Additionally, it would be great if they would post one update per month here about the state of the internal discussions and the progress they made towards a solution of this issue.

Ignoring this issue or closing as "won't fix" would upset a lot of folks.

(no offense towards elastic here)

P.S.: please use the 👍 feature of GitHub instead of commenting with +1 (otherwise every participant will receive an useless notification).

bpolaszek · 2016-09-05T15:53:17Z

Hi everyone,

I'm trying to switch a BI application from Solr 5.4 to ElasticSearch 2.3.5, and this was a feature I was using in Solr - https://cwiki.apache.org/confluence/display/solr/Result+Grouping

It would be great if Elastic implements it :)

azngeek · 2016-10-26T15:14:53Z

+1

bioform · 2016-10-27T09:53:21Z

I have the simple usecase.
We have:

list of Model
Model has "version"(integer)
Model has GUID.
Models with the same GUID are "different version of the same model". So Models with the same GUID should have a different "version" field values.
we use ES to search models by text query(we make a search through all Models Versions). And we need to show only highest Model version in the search result.
we need to paginate the above results

I don't see any way to solve this task without aggrigation pagination

cqeescalona · 2016-10-27T16:20:51Z

+1

sourcec0de · 2016-10-31T03:03:50Z

I am performing a pipeline aggregation over a large number of buckets. All I need is the bucket average not the buckets themselves. It would save some significant bandwidth if there were at the very least an option to not return the buckets and just the result of the pipeline agg.

prasadKodeInCloud · 2016-11-02T10:38:33Z

+1

jc-mendez · 2016-11-02T18:10:23Z

+1

azgard121 · 2016-11-03T16:19:06Z

+1

dmitriytitov · 2016-11-03T16:20:58Z

+1

makeyang · 2016-11-04T02:37:11Z

+1

itachi04199 · 2016-11-04T11:06:01Z

+1

jvkumar · 2016-11-16T05:00:24Z

+1

jvkumar · 2016-11-16T05:18:16Z

@jpountz any idea if this feature is in the future roadmap or this will never happen? What is the exact status of this issue?

markharwood · 2016-11-23T15:50:10Z

There isn't a single "this feature" being discussed here so it's worth grouping the different problems that I think have been articulated here:

Paging terms sorted by derived values

Several requests ( #4915 (comment), #4915 (comment) and #4915 (comment) )seem to require doc_count or a sub agg to drive the sort order and has been pointed out this is hard/impossible to implement efficiently in a distributed system (but see the "exhaustive analysis" section below)

Paging terms sorted by term value

This is potentially achievable with an "after term X" parameter. We don't have any work in progress on this at present. Requested in #4915 (comment) and #4915 (comment)

Search result collapsing

Many requests are not about paging aggregations per-se but using the terms agg and top_hits to group documents in search results in a way that limits the number of docs returned under any one group. Requested in #4915 (comment) , #4915 (comment) #4915 (comment) , #4915 (comment) , #4915 (comment) , #4915 (comment) and #4915 (comment)

Before we even consider pagination or distributed systems, this is a tricky requirement on a single shard which recently had work in Lucene. The approach the change uses is best explained with an analogy: if I was making a compilation album of 1967's top hit records:

A vanilla query's results might look like a "Best of the Beatles" album - no diversity
A "grouping query" would produce "The 10 top-selling artists of 1967 - some killer and quite a lot of filler"
A "diversified" query would be the top 20 hit records of that year - with a max of 3 Beatles hits to maintain diversity

When people use a terms agg and top_hits together they are trying to implement option 2) which is not great due to the "filler". Option 3) is implemented using the diversified sampler agg with a top_hits agg.

The bad news with diversified queries is that even on a single shard system we cannot implement paging sensibly (if the top-20 hits are page one of search results on what page would it make sense to introduce some more of the high-quality Beatles hits and relax the diversity constraint?). There's no good single answer to that question. There's also the issue of "backfilling" too when there's no diversity to be had in the data. When we also throw in the distributed aspects to this problem too there's little hope for a good solution.

Exhaustive Analysis

Some comments suggest some users simply want to analyse a high-cardinality field and doing so in one agg request today requires too much memory. In this case pagination isn't necessarily a requirement to serve pages to end users in the right order - it's merely a way of breaking up a big piece of analysis into manageable bite-sized requests. Imagine trying to run through all user-accounts sorting their logged events by last-access date to see which accounts should be marked as dormant. The top-level term (account_id) is not important to the overall order but we do want to sort the child aggs by a term (access_date).
I certainly ran into this issue and the good news is I think we have an approach to tackle it and a PR is in progress.
Basically terms are divided into unsorted partitions but within each partition the client can run an agg to sort top-level terms by any child-aggs etc in a way that filters most of the garbage out. The client would have to do the work of glueing top results from each partitioned request to get a final result together but this is perhaps the most achievable step forward in the short term.

harish0688 · 2016-11-24T14:12:04Z

+1

markharwood · 2016-11-24T16:23:48Z

Closing in favour of #21487 which provides a way to break terms aggregations into an arbitrary number of partitions which clients can page through.

I recognise this will not solve all of the different use cases raised on this ticket and which I tried to summarise in my last comment. Because of the diverse concerns listed on this ticket any remaining concerns should ideally be broken into separate new issues for those where #21487 is not a solution.

For those looking for the "Paging terms sorted by term value" use case I would ask is the sort order important or was it just a way of subdividing a big request into manageable chunks? If the latter then #21487 should work fine for you. If there is a genuine need for the former then please open another issue where we can focus in on solving that particular problem.

For the "Search result collapsing" use case I outlined then the conclusion is that there is no single diversity policy that meets all needs and that even if there was one, implementing it would be prohibitively complex/slow/resource intensive. Again, we can open another issue to debate that specific requirement but I'm pessimistic about the chances of reaching a satisfactory conclusion.

Thanks for all your comments :)

paullovessearch · 2016-11-24T16:52:49Z

Added #21785 to capture the search results collapsing use case.

stopsopa · 2016-12-15T11:06:09Z

+1

clintongormley closed this as completed Jul 23, 2014

stephanebastian mentioned this issue Sep 24, 2014

Sorting based on parent/child relationship #2917

Closed

clintongormley mentioned this issue Oct 14, 2014

Feature: Terms Aggregation with From Parameter #7956

Closed

clintongormley mentioned this issue Dec 31, 2014

Pagination on aggregations #9112

Closed

clintongormley reopened this Jan 14, 2015

clintongormley added discuss :Analytics/Aggregations Aggregations labels Jan 14, 2015

jpountz mentioned this issue Nov 14, 2016

Partitionable aggregations #21487

Closed

bpolaszek mentioned this issue Nov 21, 2016

Order by scripted_metric sub aggregation #8486

Closed

markharwood closed this as completed Nov 24, 2016

markharwood removed the discuss label Nov 24, 2016

bpolaszek mentioned this issue Nov 25, 2016

Collapse results around a field value #21785

Closed

markharwood mentioned this issue Nov 28, 2016

Parent/child grouping without a physical parent document. #21820

Closed

elastic locked and limited conversation to collaborators Dec 16, 2016

Paging support for aggregations #4915

Paging support for aggregations #4915

Comments

aaneja commented Jan 27, 2014

jpountz commented Feb 3, 2014

haschrisg commented Mar 12, 2014

jpountz commented Mar 12, 2014

apatrida commented Sep 2, 2014

tugberkugurlu commented Sep 24, 2014

missingpixel commented Oct 31, 2014

adrienbrault commented Dec 11, 2014

adrienbrault commented Dec 11, 2014

bobbyhubbard commented Dec 12, 2014

mikelrob commented Jan 2, 2015

android-programmer commented Jan 6, 2015

daniel1028 commented Jan 12, 2015

clintongormley commented Jan 14, 2015

byronvoorbach commented Jan 14, 2015

davidvgalbraith commented Mar 17, 2015

bauneroni commented Mar 19, 2015

vinusebastian commented Apr 1, 2015

dakrone commented Apr 10, 2015

2e3s commented Apr 10, 2015

a0s commented Apr 29, 2015

benneq commented May 6, 2015

pauleil commented May 6, 2015

sadokmtir commented Sep 5, 2016

felixbarny commented Sep 5, 2016

sadokmtir commented Sep 5, 2016 • edited Loading

felixbarny commented Sep 5, 2016

felixbarny commented Sep 5, 2016

bpolaszek commented Sep 5, 2016

azngeek commented Oct 26, 2016

bioform commented Oct 27, 2016

cqeescalona commented Oct 27, 2016

sourcec0de commented Oct 31, 2016

prasadKodeInCloud commented Nov 2, 2016

jc-mendez commented Nov 2, 2016

azgard121 commented Nov 3, 2016

dmitriytitov commented Nov 3, 2016

makeyang commented Nov 4, 2016

itachi04199 commented Nov 4, 2016

jvkumar commented Nov 16, 2016

jvkumar commented Nov 16, 2016 • edited Loading

markharwood commented Nov 23, 2016

Paging terms sorted by derived values

Paging terms sorted by term value

Search result collapsing

Exhaustive Analysis

harish0688 commented Nov 24, 2016

markharwood commented Nov 24, 2016

paullovessearch commented Nov 24, 2016 • edited Loading

stopsopa commented Dec 15, 2016

sadokmtir commented Sep 5, 2016 •

edited

Loading

jvkumar commented Nov 16, 2016 •

edited

Loading

paullovessearch commented Nov 24, 2016 •

edited

Loading