-
Notifications
You must be signed in to change notification settings - Fork 24.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DOCS] Clarify field data cache behavior #64375
Changes from 5 commits
7430993
57411de
391ab35
c121d75
87e148a
784de58
b83b080
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -33,12 +33,14 @@ GET my-index-000001/_search | |
|
||
<1> Querying on the `_id` field (also see the <<query-dsl-ids-query,`ids` query>>) | ||
|
||
The value of the `_id` field is also accessible in aggregations or for sorting, | ||
but doing so is discouraged as it requires to load a lot of data in memory. In | ||
case sorting or aggregating on the `_id` field is required, it is advised to | ||
duplicate the content of the `_id` field in another field that has `doc_values` | ||
enabled. | ||
|
||
The `_id` field is by default not available by default for use with aggregations or sorting. | ||
To aggregate or sort by the `_id` field, it is recommended to | ||
duplicate the `_id` field onto a `keyword` field using the <<copy-to, `copy_to` mapping parameter>>. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Really small comment, the link text is usually just the parameter name: <<copy-to, `copy_to`>> There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I just realized that it's not possible to use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay I'll clarify that, that wasn't clear in the original text. Going to move this entire section to the top. |
||
|
||
It is not recommended to enable `_id` fields to be aggregated using the <<modules-fielddata, in-memory field data cache>>, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since we soon plan to entirely remove the ability to sort/ aggregate on It looks like we forgot to mention |
||
but it is possible. This can be done by <<cluster-update-settings, changing the cluster setting>> | ||
to `"indices.id_field_data.enabled": true`. Enabling this setting and then aggregating on the `_id` | ||
field will use significant memory and show deprecation warnings in the logs. | ||
|
||
[NOTE] | ||
================================================== | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -34,11 +34,12 @@ to be enabled. | |
* Operations on parent and child documents from a `join` field, including | ||
`has_child` queries and `parent` aggregations. | ||
|
||
NOTE: The global ordinal mapping is an on-heap data structure. When measuring | ||
memory usage, Elasticsearch counts the memory from global ordinals as | ||
'fielddata'. Global ordinals memory is included in the | ||
<<fielddata-circuit-breaker, fielddata circuit breaker>>, and is returned | ||
under `fielddata` in the <<cluster-nodes-stats, node stats>> response. | ||
NOTE: The global ordinal mapping use heap memory as part of the | ||
jtibshirani marked this conversation as resolved.
Show resolved
Hide resolved
|
||
<<modules-fielddata, field data cache>>. Aggregations that include high | ||
cardinality values can use a significant amount of heap memory, and | ||
could exceed the threshold of the | ||
<<fielddata-circuit-breaker, field data circuit breaker>>. | ||
It is recommended to set a specific limit for the field data cache size. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We are actually still discussing this recommendation in #59829, perhaps we could hold off on adding this sentence until we have a conclusion. Also maybe "Aggregations that include high cardinality values" -> "Aggregations on high cardinality fields" ? |
||
|
||
==== Loading global ordinals | ||
|
||
|
This file was deleted.
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -120,11 +120,12 @@ PUT my-index-000001/_doc/4?routing=1&refresh | |
<2> `answer` is the name of the join for this document | ||
<3> The parent id of this child document | ||
|
||
==== Parent-join and performance. | ||
==== Parent-join and performance | ||
|
||
The join field shouldn't be used like joins in a relation database. In Elasticsearch the key to good performance | ||
is to de-normalize your data into documents. Each join field, `has_child` or `has_parent` query adds a | ||
significant tax to your query performance. | ||
significant tax to your query performance. It also increases the usage of the JVM heap on the | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not actually sure it's a significant contributor to heap usage, since only one |
||
<<modules-fielddata, field data cache>>. | ||
|
||
The only case where the join field makes sense is if your data contains a one-to-many relationship where | ||
one entity significantly outnumbers the other entity. An example of such case is a use case with products | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few small comments to make the language more precise: