Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Terms aggs Partitions doesn't work as expected #23740

Closed
gmoskovicz opened this issue Mar 24, 2017 · 17 comments
Closed

Terms aggs Partitions doesn't work as expected #23740

gmoskovicz opened this issue Mar 24, 2017 · 17 comments
Labels
>docs General docs changes help wanted adoptme

Comments

@gmoskovicz
Copy link
Contributor

Elasticsearch version: 5.2

Plugins installed: []

JVM version: JVM 1.8

OS version: ANY

Steps to repro:

[1] Add some sample data


POST test/test
{
  "recip_summary" : 1,
  "amount": 50
}

POST test/test
{
  "recip_summary" : 2,
  "amount": 100
}

POST test/test
{
  "recip_summary" : 3,
  "amount": 200
}

[2] Create an aggregation with 3 partitions and size:1

POST test/_search
{
  "size": 0,
  "aggs": {
    "recipients": {
      "terms": {
        "field": "recip_summary",
        "size": 1,
        "include": {
            "partition": 0,
            "num_partitions": 3
         }
      }
    }
  }
}

[3] Verify that for partition 0 you don't get any results

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "recipients": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": []
    }
  }
}

[4] But in partitions 1 and 2 you get results

So basically, shouldn't this return 1 document per partition, having 3 documents and 3 partitions?

@javanna javanna removed the >bug label Mar 24, 2017
@javanna
Copy link
Member

javanna commented Mar 24, 2017

This is not a bug. The important bit is that every document belongs to one and only one partition. With low number of terms, it is possible that a partition is empty, or that partitions don't hold exactly the same number of values. The terms partitioning is very similar to the way we shard documents. We get a hash of the field value and perform a modulo N where N is the total number of partitions. With high enough cardinality, partitions will be evenly distributed. If you play around with the field values you will see documents moving from one partition to another etc.

@javanna javanna closed this as completed Mar 24, 2017
@gmoskovicz
Copy link
Contributor Author

@javanna so we can say the partitions are not reliable? Same thing with sorting, if we cannot ensure that partitions don't contain the same amount of documents, then we cannot ensure that the correct sorting is being applied, correct?

Also, if this is the case, should we mention in the documentation that this is not a correct way of paginating, but just a way of retrieving certain values? I don't see this being reliable in terms of results. Looks similar than the cardinality aggregation, where we have an error %. Maybe this is the same case.

@gmoskovicz
Copy link
Contributor Author

BTW: i added 3/4/5/6 partitions and i randomly get results in different partitions with no specific logic.

@markharwood
Copy link
Contributor

markharwood commented Mar 24, 2017

The behaviour you are seeing is with only 3 docs.
The hashing gives a random distribution.
Imagine you roll a dice only 6 times. You don't necessarily expect to end up with the values 1,2,3,4,5 and 6. You wouldn't raise a bug for that. However if you roll the dice a billion times you'd expect to get roughly even counts for all those numbers - and if you don't, then we'd have an issue.

In this agg each unique term is a roll of the dice where you get to pick how many numbers are on the dice and which one you want to filter on. The only difference with the dice analogy is that for the same term you will always get the same result which is why different shards can both be guaranteed to examine the same set of terms given the choice of partition number.

@gmoskovicz
Copy link
Contributor Author

@markharwood @javanna thank you very much for the explanation! 😃

Any reason why we don't choose the right configuration by default then, so we have everything evenly distributed across partitions?

@markharwood
Copy link
Contributor

Unlike shards-for-docs the partitions-for-terms are not fixed at index time. Terms are partitioned into a number of partitions chosen at search-time. On an index with a million docs you might need only 3 partitions but later on if you have a billion docs you may decide you need to query using 10 partitions. It's not a fixed choice. It's this sort of filtering logic:

int numRequestedPartitionsByUser = 6;
int usersChosenPartition = 3;

for all terms on shard:
	if term.hashCode() % numRequestedPartitionsByUser == usersChosenPartition
		// 1 in 6 terms will land in this bucket..
		accept term
	else
		reject term

@javanna
Copy link
Member

javanna commented Mar 24, 2017

As for the pagination comparison, partitioning is a good alternative that addresses the same problem (many terms that can't be returned in one go, and pagination is not supported), but it's a trade-off, it is not exactly the same as pagination. I read up our docs on this and I don't see it comparing partitioning with pagination. We may want to add a note about partitions not necessarily being evenly distributed with low cardinality fields.

@gmoskovicz
Copy link
Contributor Author

@javanna agree that we don't mention this, but given the context where this was created, and the issue that closed (the one about pagination). I believe that it's a good thing to add a warning saying that it's an alternative to get fewer results, but it's not "Paginating" aggregations?

@markharwood
Copy link
Contributor

it's not "Paginating" aggregations?

Where are we making this claim? The only documentation that I'm aware of is here and that does not make the claim.

but given the context where this was created, and the issue that closed (the one about pagination)

If you mean issue 4915 I don't think we need to worry about managing expectations following its closure - it was was closed for a lack of focus and we spawned new targeted issues which delivered partitioned aggs and field collapsing of search results.

I realise we left some of 4915's wishes behind as impractical along the way. Even putting aside the performance concerns, if you just think about how paging would be expected to work from an end-user perspective when what we have is a tree with many branches rather than a simple list then you appreciate "paging aggregations" was a vague notion rather than a realistic goal.

@gmoskovicz
Copy link
Contributor Author

@markharwood

Sorry if i wasn't clear, what i meant is just to clarify this. From the history many people will look at this as a way to partitioning data to do some sort of pagination, or return different result per run. But there is some clear difference around this.

I believe that it's a good thing to add a warning saying that it's an alternative to get fewer results, but it's not "Paginating" aggregations?

I am just saying that we could add something to the docs, better explanation on how this works, similar on what we have around how terms aggregations work. Not claiming that this should be pagination, or that this meant to be it.

Still, IMO it could be very nice to have clear documentation around how this should be expected to work, the tradeoffs, and why not to expect this to be pagination.

@markharwood
Copy link
Contributor

Still, IMO it could be very nice to have clear documentation around how this should be expected to work, the tradeoffs, and why not to expect this to be pagination.

Agreed. That sounds like it could be a section in a revision of "the definitive guide..." rather than the API reference docs.

@javanna javanna added >docs General docs changes help wanted adoptme labels Apr 13, 2017
@javanna
Copy link
Member

javanna commented Apr 13, 2017

I am reopening this as we could improve our docs and add some more explanation about how partitioning works.

@javanna javanna reopened this Apr 13, 2017
@vadim82
Copy link

vadim82 commented Jun 29, 2017

So given that partitioning is not the right approach to paging aggregated data... is there any solution out there that can be used to do this? We have situations where we need to do a term aggregation and we need to show the user sorted paged results. Are there any issues in the pipeline to address this concern? I imagine it must be a common scenario that people want to do with ElasticSearch.

The comment about documentation was right on, because after reading about partitioning, that is exactly what we tried to use it for.

@markharwood
Copy link
Contributor

Are there any issues in the pipeline to address this concern?

No. You may find field collapsing provides a solution to some of your problems but lets not hijack this github issue on partitioning with discussion of your paging use case - that would be best tackled on discuss forums

@javanna
Copy link
Member

javanna commented Sep 25, 2017

I looked again at our docs but honestly I didn't find a way to improve them as for explaining partitioning of terms aggs. I am closing this issue then. Feel free to open a PR if you have ideas on how to clarify this docs section.

@javanna javanna closed this as completed Sep 25, 2017
@repakaravikanth
Copy link

Hi,

I am trying to use partitions for my pagination support. As I mentioned previously here, the terms can be in any partition depend on filed hash value. But, if we traverse all partitions can we get all terms without duplicates? What happens if there is a client side delay(few secs to mins) in between retrieving partitions results? Is it guaranteed to get the next partition results are unique and non-duplicate to previous partition result assuming there is no change in requested partitions count?

Thanks.

@markharwood
Copy link
Contributor

I am trying to use partitions for my pagination support.

Probably the wrong place to ask for help. This is intended as a place for reporting issues, not a discussion forum. I have responded to the identical question you raised in the discuss forum.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>docs General docs changes help wanted adoptme
Projects
None yet
Development

No branches or pull requests

5 participants