-
Notifications
You must be signed in to change notification settings - Fork 24.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Terms aggs Partitions doesn't work as expected #23740
Comments
This is not a bug. The important bit is that every document belongs to one and only one partition. With low number of terms, it is possible that a partition is empty, or that partitions don't hold exactly the same number of values. The terms partitioning is very similar to the way we shard documents. We get a hash of the field value and perform a modulo N where N is the total number of partitions. With high enough cardinality, partitions will be evenly distributed. If you play around with the field values you will see documents moving from one partition to another etc. |
@javanna so we can say the partitions are not reliable? Same thing with sorting, if we cannot ensure that partitions don't contain the same amount of documents, then we cannot ensure that the correct sorting is being applied, correct? Also, if this is the case, should we mention in the documentation that this is not a correct way of paginating, but just a way of retrieving certain values? I don't see this being reliable in terms of results. Looks similar than the cardinality aggregation, where we have an error %. Maybe this is the same case. |
BTW: i added 3/4/5/6 partitions and i randomly get results in different partitions with no specific logic. |
The behaviour you are seeing is with only 3 docs. In this agg each unique term is a roll of the dice where you get to pick how many numbers are on the dice and which one you want to filter on. The only difference with the dice analogy is that for the same term you will always get the same result which is why different shards can both be guaranteed to examine the same set of terms given the choice of partition number. |
@markharwood @javanna thank you very much for the explanation! 😃 Any reason why we don't choose the right configuration by default then, so we have everything evenly distributed across partitions? |
Unlike shards-for-docs the partitions-for-terms are not fixed at index time. Terms are partitioned into a number of partitions chosen at search-time. On an index with a million docs you might need only 3 partitions but later on if you have a billion docs you may decide you need to query using 10 partitions. It's not a fixed choice. It's this sort of filtering logic:
|
As for the pagination comparison, partitioning is a good alternative that addresses the same problem (many terms that can't be returned in one go, and pagination is not supported), but it's a trade-off, it is not exactly the same as pagination. I read up our docs on this and I don't see it comparing partitioning with pagination. We may want to add a note about partitions not necessarily being evenly distributed with low cardinality fields. |
@javanna agree that we don't mention this, but given the context where this was created, and the issue that closed (the one about pagination). I believe that it's a good thing to add a warning saying that it's an alternative to get fewer results, but it's not "Paginating" aggregations? |
Where are we making this claim? The only documentation that I'm aware of is here and that does not make the claim.
If you mean issue 4915 I don't think we need to worry about managing expectations following its closure - it was was closed for a lack of focus and we spawned new targeted issues which delivered partitioned aggs and field collapsing of search results. I realise we left some of 4915's wishes behind as impractical along the way. Even putting aside the performance concerns, if you just think about how paging would be expected to work from an end-user perspective when what we have is a tree with many branches rather than a simple list then you appreciate "paging aggregations" was a vague notion rather than a realistic goal. |
Sorry if i wasn't clear, what i meant is just to clarify this. From the history many people will look at this as a way to partitioning data to do some sort of pagination, or return different result per run. But there is some clear difference around this.
I am just saying that we could add something to the docs, better explanation on how this works, similar on what we have around how terms aggregations work. Not claiming that this should be pagination, or that this meant to be it. Still, IMO it could be very nice to have clear documentation around how this should be expected to work, the tradeoffs, and why not to expect this to be pagination. |
Agreed. That sounds like it could be a section in a revision of "the definitive guide..." rather than the API reference docs. |
I am reopening this as we could improve our docs and add some more explanation about how partitioning works. |
So given that partitioning is not the right approach to paging aggregated data... is there any solution out there that can be used to do this? We have situations where we need to do a term aggregation and we need to show the user sorted paged results. Are there any issues in the pipeline to address this concern? I imagine it must be a common scenario that people want to do with ElasticSearch. The comment about documentation was right on, because after reading about partitioning, that is exactly what we tried to use it for. |
No. You may find field collapsing provides a solution to some of your problems but lets not hijack this github issue on partitioning with discussion of your paging use case - that would be best tackled on discuss forums |
I looked again at our docs but honestly I didn't find a way to improve them as for explaining partitioning of terms aggs. I am closing this issue then. Feel free to open a PR if you have ideas on how to clarify this docs section. |
Hi, I am trying to use partitions for my pagination support. As I mentioned previously here, the terms can be in any partition depend on filed hash value. But, if we traverse all partitions can we get all terms without duplicates? What happens if there is a client side delay(few secs to mins) in between retrieving partitions results? Is it guaranteed to get the next partition results are unique and non-duplicate to previous partition result assuming there is no change in requested partitions count? Thanks. |
Probably the wrong place to ask for help. This is intended as a place for reporting issues, not a discussion forum. I have responded to the identical question you raised in the discuss forum. |
Elasticsearch version: 5.2
Plugins installed: []
JVM version: JVM 1.8
OS version: ANY
Steps to repro:
[1] Add some sample data
[2] Create an aggregation with 3 partitions and size:1
[3] Verify that for partition 0 you don't get any results
[4] But in partitions 1 and 2 you get results
So basically, shouldn't this return 1 document per partition, having 3 documents and 3 partitions?
The text was updated successfully, but these errors were encountered: