Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOCS] Clarify field data cache behavior #64375

Merged
merged 7 commits into from
Nov 20, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/plugins/mapper-annotated-text.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ include::install_remove.asciidoc[]
[[mapper-annotated-text-usage]]
==== Using the `annotated-text` field

The `annotated-text` tokenizes text content as per the more common `text` field (see
The `annotated-text` tokenizes text content as per the more common {ref}/text.html[`text`] field (see
"limitations" below) but also injects any marked-up annotation tokens directly into
the search index:

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -181,7 +181,7 @@ Each option will hold up to `shard_size` values in memory while performing de-du
- hold ordinals of the field as determined by the Lucene index (`global_ordinals`)
- hold hashes of the field values - with potential for hash collisions (`bytes_hash`)

The default setting is to use `global_ordinals` if this information is available from the Lucene index and reverting to `map` if not.
The default setting is to use <<eager-global-ordinals,`global_ordinals`>> if this information is available from the Lucene index and reverting to `map` if not.
The `bytes_hash` setting may prove faster in some cases but introduces the possibility of false positives in de-duplication logic due to the possibility of hash collisions.
Please note that Elasticsearch will ignore the choice of execution hint if it is not applicable and that there is no backward compatibility guarantee on these hints.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -553,7 +553,7 @@ A description of the different collection modes can be found in the
There are different mechanisms by which terms aggregations can be executed:

- by using field values directly in order to aggregate data per-bucket (`map`)
- by using global ordinals of the field and allocating one bucket per global ordinal (`global_ordinals`)
- by using <<eager-global-ordinals,global ordinals>> of the field and allocating one bucket per global ordinal (`global_ordinals`)

Elasticsearch tries to have sensible defaults so this is something that generally doesn't need to be configured.

Expand Down
4 changes: 2 additions & 2 deletions docs/reference/cat/fielddata.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@
<titleabbrev>cat fielddata</titleabbrev>
++++

Returns the amount of heap memory currently used by fielddata on every data node
in the cluster.
Returns the amount of heap memory currently used by the
<<modules-fielddata, field data cache>> on every data node in the cluster.


[[cat-fielddata-api-request]]
Expand Down
2 changes: 1 addition & 1 deletion docs/reference/cluster/stats.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -246,7 +246,7 @@ activities.
`fielddata`::
(object)
Contains statistics about the field data cache of selected nodes.
Contains statistics about the <<modules-fielddata, field data cache>> of selected nodes.
+
.Properties of `fielddata`
[%collapsible%open]
Expand Down
29 changes: 15 additions & 14 deletions docs/reference/how-to/search-speed.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -303,13 +303,14 @@ may become much worse.
[discrete]
=== Warm up global ordinals

Global ordinals are a data-structure that is used in order to run
<<search-aggregations-bucket-terms-aggregation,`terms`>> aggregations on
<<keyword,`keyword`>> fields. They are loaded lazily in memory because
Elasticsearch does not know which fields will be used in `terms` aggregations
and which fields won't. You can tell Elasticsearch to load global ordinals
eagerly when starting or refreshing a shard by configuring mappings as
described below:
<<eager-global-ordinals,Global ordinals>> are a data structure that is used to
optimize the performance of aggregations. They are calculated lazily and stored in
the JVM heap as part of the <<modules-fielddata, field data cache>>. For fields
that are heavily used for bucketing aggregations, you can tell {es} to construct
and cache the global ordinals before requests are received. This should be done
carefully because it will increase heap usage and can make <<indices-refresh, refreshes>>
take longer. The option can be updated dynamically on an existing mapping by
setting the <<eager-global-ordinals, eager global ordinals>> mapping parameter:

[source,console]
--------------------------------------------------
Expand Down Expand Up @@ -392,19 +393,19 @@ right number of replicas for you is

=== Tune your queries with the Profile API

You can also analyse how expensive each component of your queries and
aggregations are using the {ref}/search-profile.html[Profile API]. This might
allow you to tune your queries to be less expensive, resulting in a positive
performance result and reduced load. Also note that Profile API payloads can be
easily visualised for better readability in the
{kibana-ref}/xpack-profiler.html[Search Profiler], which is a Kibana dev tools
You can also analyse how expensive each component of your queries and
aggregations are using the {ref}/search-profile.html[Profile API]. This might
allow you to tune your queries to be less expensive, resulting in a positive
performance result and reduced load. Also note that Profile API payloads can be
easily visualised for better readability in the
{kibana-ref}/xpack-profiler.html[Search Profiler], which is a Kibana dev tools
UI available in all X-Pack licenses, including the free X-Pack Basic license.

Some caveats to the Profile API are that:

- the Profile API as a debugging tool adds significant overhead to search execution and can also have a very verbose output
- given the added overhead, the resulting took times are not reliable indicators of actual took time, but can be used comparatively between clauses for relative timing differences
- the Profile API is best for exploring possible reasons behind the most costly clauses of a query but isn't intended for accurately measuring absolute timings of each clause
- the Profile API is best for exploring possible reasons behind the most costly clauses of a query but isn't intended for accurately measuring absolute timings of each clause

[[faster-phrase-queries]]
=== Faster phrase queries with `index_phrases`
Expand Down
18 changes: 9 additions & 9 deletions docs/reference/mapping/fields/id-field.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,12 @@

Each document has an `_id` that uniquely identifies it, which is indexed
so that documents can be looked up either with the <<docs-get,GET API>> or the
<<query-dsl-ids-query,`ids` query>>.
<<query-dsl-ids-query,`ids` query>>. The `_id` can either be assigned at
indexing time, or a unique `_id` can be generated by {es}. This field is not
configurable in the mappings.

The value of the `_id` field is accessible in certain queries (`term`,
`terms`, `match`, `query_string`, `simple_query_string`).
The value of the `_id` field is accessible in queries such as `term`,
`terms`, `match`, and `query_string`.

[source,console]
--------------------------
Expand All @@ -33,12 +35,10 @@ GET my-index-000001/_search

<1> Querying on the `_id` field (also see the <<query-dsl-ids-query,`ids` query>>)

The value of the `_id` field is also accessible in aggregations or for sorting,
but doing so is discouraged as it requires to load a lot of data in memory. In
case sorting or aggregating on the `_id` field is required, it is advised to
duplicate the content of the `_id` field in another field that has `doc_values`
enabled.

The `_id` field is restricted from use in aggregations, sorting, and scripting.
In case sorting or aggregating on the `_id` field is required, it is advised to
duplicate the content of the `_id` field into another field that has
`doc_values` enabled.

[NOTE]
==================================================
Expand Down
2 changes: 0 additions & 2 deletions docs/reference/mapping/params.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -49,8 +49,6 @@ include::params/eager-global-ordinals.asciidoc[]

include::params/enabled.asciidoc[]

include::params/fielddata.asciidoc[]

include::params/format.asciidoc[]

include::params/ignore-above.asciidoc[]
Expand Down
9 changes: 4 additions & 5 deletions docs/reference/mapping/params/eager-global-ordinals.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -34,11 +34,10 @@ to be enabled.
* Operations on parent and child documents from a `join` field, including
`has_child` queries and `parent` aggregations.

NOTE: The global ordinal mapping is an on-heap data structure. When measuring
memory usage, Elasticsearch counts the memory from global ordinals as
'fielddata'. Global ordinals memory is included in the
<<fielddata-circuit-breaker, fielddata circuit breaker>>, and is returned
under `fielddata` in the <<cluster-nodes-stats, node stats>> response.
NOTE: The global ordinal mapping uses heap memory as part of the
<<modules-fielddata, field data cache>>. Aggregations on high cardinality fields
can use a lot of memory and trigger the <<fielddata-circuit-breaker, field data
circuit breaker>>.

==== Loading global ordinals

Expand Down
134 changes: 0 additions & 134 deletions docs/reference/mapping/params/fielddata.asciidoc

This file was deleted.

4 changes: 2 additions & 2 deletions docs/reference/mapping/types/parent-join.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -120,11 +120,11 @@ PUT my-index-000001/_doc/4?routing=1&refresh
<2> `answer` is the name of the join for this document
<3> The parent id of this child document

==== Parent-join and performance.
==== Parent-join and performance

The join field shouldn't be used like joins in a relation database. In Elasticsearch the key to good performance
is to de-normalize your data into documents. Each join field, `has_child` or `has_parent` query adds a
significant tax to your query performance.
significant tax to your query performance. It can also trigger <<eager-global-ordinals, global ordinals>> to be built.

The only case where the join field makes sense is if your data contains a one-to-many relationship where
one entity significantly outnumbers the other entity. An example of such case is a use case with products
Expand Down
Loading