Terms aggs Partitions doesn't work as expected #23740

gmoskovicz · 2017-03-24T15:13:00Z

Elasticsearch version: 5.2

Plugins installed: []

JVM version: JVM 1.8

OS version: ANY

Steps to repro:

[1] Add some sample data


POST test/test
{
  "recip_summary" : 1,
  "amount": 50
}

POST test/test
{
  "recip_summary" : 2,
  "amount": 100
}

POST test/test
{
  "recip_summary" : 3,
  "amount": 200
}

[2] Create an aggregation with 3 partitions and size:1

POST test/_search
{
  "size": 0,
  "aggs": {
    "recipients": {
      "terms": {
        "field": "recip_summary",
        "size": 1,
        "include": {
            "partition": 0,
            "num_partitions": 3
         }
      }
    }
  }
}

[3] Verify that for partition 0 you don't get any results

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "recipients": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": []
    }
  }
}

[4] But in partitions 1 and 2 you get results

So basically, shouldn't this return 1 document per partition, having 3 documents and 3 partitions?

The text was updated successfully, but these errors were encountered:

javanna · 2017-03-24T16:55:43Z

This is not a bug. The important bit is that every document belongs to one and only one partition. With low number of terms, it is possible that a partition is empty, or that partitions don't hold exactly the same number of values. The terms partitioning is very similar to the way we shard documents. We get a hash of the field value and perform a modulo N where N is the total number of partitions. With high enough cardinality, partitions will be evenly distributed. If you play around with the field values you will see documents moving from one partition to another etc.

gmoskovicz · 2017-03-24T17:01:50Z

@javanna so we can say the partitions are not reliable? Same thing with sorting, if we cannot ensure that partitions don't contain the same amount of documents, then we cannot ensure that the correct sorting is being applied, correct?

Also, if this is the case, should we mention in the documentation that this is not a correct way of paginating, but just a way of retrieving certain values? I don't see this being reliable in terms of results. Looks similar than the cardinality aggregation, where we have an error %. Maybe this is the same case.

gmoskovicz · 2017-03-24T17:02:27Z

BTW: i added 3/4/5/6 partitions and i randomly get results in different partitions with no specific logic.

markharwood · 2017-03-24T17:08:36Z

The behaviour you are seeing is with only 3 docs.
The hashing gives a random distribution.
Imagine you roll a dice only 6 times. You don't necessarily expect to end up with the values 1,2,3,4,5 and 6. You wouldn't raise a bug for that. However if you roll the dice a billion times you'd expect to get roughly even counts for all those numbers - and if you don't, then we'd have an issue.

In this agg each unique term is a roll of the dice where you get to pick how many numbers are on the dice and which one you want to filter on. The only difference with the dice analogy is that for the same term you will always get the same result which is why different shards can both be guaranteed to examine the same set of terms given the choice of partition number.

gmoskovicz · 2017-03-24T17:25:47Z

@markharwood @javanna thank you very much for the explanation! 😃

Any reason why we don't choose the right configuration by default then, so we have everything evenly distributed across partitions?

markharwood · 2017-03-24T17:52:34Z

Unlike shards-for-docs the partitions-for-terms are not fixed at index time. Terms are partitioned into a number of partitions chosen at search-time. On an index with a million docs you might need only 3 partitions but later on if you have a billion docs you may decide you need to query using 10 partitions. It's not a fixed choice. It's this sort of filtering logic:

int numRequestedPartitionsByUser = 6;
int usersChosenPartition = 3;

for all terms on shard:
	if term.hashCode() % numRequestedPartitionsByUser == usersChosenPartition
		// 1 in 6 terms will land in this bucket..
		accept term
	else
		reject term

javanna · 2017-03-24T17:54:37Z

As for the pagination comparison, partitioning is a good alternative that addresses the same problem (many terms that can't be returned in one go, and pagination is not supported), but it's a trade-off, it is not exactly the same as pagination. I read up our docs on this and I don't see it comparing partitioning with pagination. We may want to add a note about partitions not necessarily being evenly distributed with low cardinality fields.

gmoskovicz · 2017-03-24T17:57:02Z

@javanna agree that we don't mention this, but given the context where this was created, and the issue that closed (the one about pagination). I believe that it's a good thing to add a warning saying that it's an alternative to get fewer results, but it's not "Paginating" aggregations?

markharwood · 2017-03-24T18:29:00Z

it's not "Paginating" aggregations?

Where are we making this claim? The only documentation that I'm aware of is here and that does not make the claim.

but given the context where this was created, and the issue that closed (the one about pagination)

If you mean issue 4915 I don't think we need to worry about managing expectations following its closure - it was was closed for a lack of focus and we spawned new targeted issues which delivered partitioned aggs and field collapsing of search results.

I realise we left some of 4915's wishes behind as impractical along the way. Even putting aside the performance concerns, if you just think about how paging would be expected to work from an end-user perspective when what we have is a tree with many branches rather than a simple list then you appreciate "paging aggregations" was a vague notion rather than a realistic goal.

gmoskovicz · 2017-03-24T18:38:35Z

@markharwood

Sorry if i wasn't clear, what i meant is just to clarify this. From the history many people will look at this as a way to partitioning data to do some sort of pagination, or return different result per run. But there is some clear difference around this.

I believe that it's a good thing to add a warning saying that it's an alternative to get fewer results, but it's not "Paginating" aggregations?

I am just saying that we could add something to the docs, better explanation on how this works, similar on what we have around how terms aggregations work. Not claiming that this should be pagination, or that this meant to be it.

Still, IMO it could be very nice to have clear documentation around how this should be expected to work, the tradeoffs, and why not to expect this to be pagination.

markharwood · 2017-03-27T07:27:24Z

Still, IMO it could be very nice to have clear documentation around how this should be expected to work, the tradeoffs, and why not to expect this to be pagination.

Agreed. That sounds like it could be a section in a revision of "the definitive guide..." rather than the API reference docs.

javanna · 2017-04-13T12:58:49Z

I am reopening this as we could improve our docs and add some more explanation about how partitioning works.

vadim82 · 2017-06-29T17:14:28Z

So given that partitioning is not the right approach to paging aggregated data... is there any solution out there that can be used to do this? We have situations where we need to do a term aggregation and we need to show the user sorted paged results. Are there any issues in the pipeline to address this concern? I imagine it must be a common scenario that people want to do with ElasticSearch.

The comment about documentation was right on, because after reading about partitioning, that is exactly what we tried to use it for.

markharwood · 2017-06-30T09:40:46Z

Are there any issues in the pipeline to address this concern?

No. You may find field collapsing provides a solution to some of your problems but lets not hijack this github issue on partitioning with discussion of your paging use case - that would be best tackled on discuss forums

javanna · 2017-09-25T09:19:25Z

I looked again at our docs but honestly I didn't find a way to improve them as for explaining partitioning of terms aggs. I am closing this issue then. Feel free to open a PR if you have ideas on how to clarify this docs section.

repakaravikanth · 2017-12-11T04:40:48Z

Hi,

I am trying to use partitions for my pagination support. As I mentioned previously here, the terms can be in any partition depend on filed hash value. But, if we traverse all partitions can we get all terms without duplicates? What happens if there is a client side delay(few secs to mins) in between retrieving partitions results? Is it guaranteed to get the next partition results are unique and non-duplicate to previous partition result assuming there is no change in requested partitions count?

Thanks.

markharwood · 2017-12-11T10:28:48Z

I am trying to use partitions for my pagination support.

Probably the wrong place to ask for help. This is intended as a place for reporting issues, not a discussion forum. I have responded to the identical question you raised in the discuss forum.

gmoskovicz added the >bug label Mar 24, 2017

javanna removed the >bug label Mar 24, 2017

javanna closed this as completed Mar 24, 2017

javanna added >docs General docs changes help wanted adoptme labels Apr 13, 2017

javanna reopened this Apr 13, 2017

javanna closed this as completed Sep 25, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Terms aggs Partitions doesn't work as expected #23740

Terms aggs Partitions doesn't work as expected #23740

gmoskovicz commented Mar 24, 2017

javanna commented Mar 24, 2017

gmoskovicz commented Mar 24, 2017

gmoskovicz commented Mar 24, 2017

markharwood commented Mar 24, 2017 •

edited

Loading

gmoskovicz commented Mar 24, 2017

markharwood commented Mar 24, 2017

javanna commented Mar 24, 2017

gmoskovicz commented Mar 24, 2017

markharwood commented Mar 24, 2017

gmoskovicz commented Mar 24, 2017

markharwood commented Mar 27, 2017

javanna commented Apr 13, 2017

vadim82 commented Jun 29, 2017

markharwood commented Jun 30, 2017

javanna commented Sep 25, 2017

repakaravikanth commented Dec 11, 2017

markharwood commented Dec 11, 2017

Terms aggs Partitions doesn't work as expected #23740

Terms aggs Partitions doesn't work as expected #23740

Comments

gmoskovicz commented Mar 24, 2017

javanna commented Mar 24, 2017

gmoskovicz commented Mar 24, 2017

gmoskovicz commented Mar 24, 2017

markharwood commented Mar 24, 2017 • edited Loading

gmoskovicz commented Mar 24, 2017

markharwood commented Mar 24, 2017

javanna commented Mar 24, 2017

gmoskovicz commented Mar 24, 2017

markharwood commented Mar 24, 2017

gmoskovicz commented Mar 24, 2017

markharwood commented Mar 27, 2017

javanna commented Apr 13, 2017

vadim82 commented Jun 29, 2017

markharwood commented Jun 30, 2017

javanna commented Sep 25, 2017

repakaravikanth commented Dec 11, 2017

markharwood commented Dec 11, 2017

markharwood commented Mar 24, 2017 •

edited

Loading