diff --git a/docs/plugins/mapper-annotated-text.asciidoc b/docs/plugins/mapper-annotated-text.asciidoc index 4a30da47d62c2..9307b6aaefe13 100644 --- a/docs/plugins/mapper-annotated-text.asciidoc +++ b/docs/plugins/mapper-annotated-text.asciidoc @@ -18,7 +18,7 @@ include::install_remove.asciidoc[] [[mapper-annotated-text-usage]] ==== Using the `annotated-text` field -The `annotated-text` tokenizes text content as per the more common `text` field (see +The `annotated-text` tokenizes text content as per the more common {ref}/text.html[`text`] field (see "limitations" below) but also injects any marked-up annotation tokens directly into the search index: diff --git a/docs/reference/aggregations/bucket/diversified-sampler-aggregation.asciidoc b/docs/reference/aggregations/bucket/diversified-sampler-aggregation.asciidoc index f062a940432d8..4b829255db38d 100644 --- a/docs/reference/aggregations/bucket/diversified-sampler-aggregation.asciidoc +++ b/docs/reference/aggregations/bucket/diversified-sampler-aggregation.asciidoc @@ -181,7 +181,7 @@ Each option will hold up to `shard_size` values in memory while performing de-du - hold ordinals of the field as determined by the Lucene index (`global_ordinals`) - hold hashes of the field values - with potential for hash collisions (`bytes_hash`) -The default setting is to use `global_ordinals` if this information is available from the Lucene index and reverting to `map` if not. +The default setting is to use <> if this information is available from the Lucene index and reverting to `map` if not. The `bytes_hash` setting may prove faster in some cases but introduces the possibility of false positives in de-duplication logic due to the possibility of hash collisions. Please note that Elasticsearch will ignore the choice of execution hint if it is not applicable and that there is no backward compatibility guarantee on these hints. diff --git a/docs/reference/aggregations/bucket/significantterms-aggregation.asciidoc b/docs/reference/aggregations/bucket/significantterms-aggregation.asciidoc index a3df511e57a92..6bd36945eae73 100644 --- a/docs/reference/aggregations/bucket/significantterms-aggregation.asciidoc +++ b/docs/reference/aggregations/bucket/significantterms-aggregation.asciidoc @@ -553,7 +553,7 @@ A description of the different collection modes can be found in the There are different mechanisms by which terms aggregations can be executed: - by using field values directly in order to aggregate data per-bucket (`map`) - - by using global ordinals of the field and allocating one bucket per global ordinal (`global_ordinals`) + - by using <> of the field and allocating one bucket per global ordinal (`global_ordinals`) Elasticsearch tries to have sensible defaults so this is something that generally doesn't need to be configured. diff --git a/docs/reference/cat/fielddata.asciidoc b/docs/reference/cat/fielddata.asciidoc index e855517d01380..149ce423423e8 100644 --- a/docs/reference/cat/fielddata.asciidoc +++ b/docs/reference/cat/fielddata.asciidoc @@ -4,8 +4,8 @@ cat fielddata ++++ -Returns the amount of heap memory currently used by fielddata on every data node -in the cluster. +Returns the amount of heap memory currently used by the +<> on every data node in the cluster. [[cat-fielddata-api-request]] diff --git a/docs/reference/cluster/stats.asciidoc b/docs/reference/cluster/stats.asciidoc index af5b402c5a399..48707ee8a94cd 100644 --- a/docs/reference/cluster/stats.asciidoc +++ b/docs/reference/cluster/stats.asciidoc @@ -246,7 +246,7 @@ activities. `fielddata`:: (object) -Contains statistics about the field data cache of selected nodes. +Contains statistics about the <> of selected nodes. + .Properties of `fielddata` [%collapsible%open] diff --git a/docs/reference/how-to/search-speed.asciidoc b/docs/reference/how-to/search-speed.asciidoc index 79df665127edc..e51c7fa2b7821 100644 --- a/docs/reference/how-to/search-speed.asciidoc +++ b/docs/reference/how-to/search-speed.asciidoc @@ -303,13 +303,14 @@ may become much worse. [discrete] === Warm up global ordinals -Global ordinals are a data-structure that is used in order to run -<> aggregations on -<> fields. They are loaded lazily in memory because -Elasticsearch does not know which fields will be used in `terms` aggregations -and which fields won't. You can tell Elasticsearch to load global ordinals -eagerly when starting or refreshing a shard by configuring mappings as -described below: +<> are a data structure that is used to +optimize the performance of aggregations. They are calculated lazily and stored in +the JVM heap as part of the <>. For fields +that are heavily used for bucketing aggregations, you can tell {es} to construct +and cache the global ordinals before requests are received. This should be done +carefully because it will increase heap usage and can make <> +take longer. The option can be updated dynamically on an existing mapping by +setting the <> mapping parameter: [source,console] -------------------------------------------------- @@ -392,19 +393,19 @@ right number of replicas for you is === Tune your queries with the Profile API -You can also analyse how expensive each component of your queries and -aggregations are using the {ref}/search-profile.html[Profile API]. This might -allow you to tune your queries to be less expensive, resulting in a positive -performance result and reduced load. Also note that Profile API payloads can be -easily visualised for better readability in the -{kibana-ref}/xpack-profiler.html[Search Profiler], which is a Kibana dev tools +You can also analyse how expensive each component of your queries and +aggregations are using the {ref}/search-profile.html[Profile API]. This might +allow you to tune your queries to be less expensive, resulting in a positive +performance result and reduced load. Also note that Profile API payloads can be +easily visualised for better readability in the +{kibana-ref}/xpack-profiler.html[Search Profiler], which is a Kibana dev tools UI available in all X-Pack licenses, including the free X-Pack Basic license. Some caveats to the Profile API are that: - the Profile API as a debugging tool adds significant overhead to search execution and can also have a very verbose output - given the added overhead, the resulting took times are not reliable indicators of actual took time, but can be used comparatively between clauses for relative timing differences - - the Profile API is best for exploring possible reasons behind the most costly clauses of a query but isn't intended for accurately measuring absolute timings of each clause + - the Profile API is best for exploring possible reasons behind the most costly clauses of a query but isn't intended for accurately measuring absolute timings of each clause [[faster-phrase-queries]] === Faster phrase queries with `index_phrases` diff --git a/docs/reference/mapping/fields/id-field.asciidoc b/docs/reference/mapping/fields/id-field.asciidoc index 33f1e8eb7178c..1e963dd6de7d7 100644 --- a/docs/reference/mapping/fields/id-field.asciidoc +++ b/docs/reference/mapping/fields/id-field.asciidoc @@ -3,10 +3,12 @@ Each document has an `_id` that uniquely identifies it, which is indexed so that documents can be looked up either with the <> or the -<>. +<>. The `_id` can either be assigned at +indexing time, or a unique `_id` can be generated by {es}. This field is not +configurable in the mappings. -The value of the `_id` field is accessible in certain queries (`term`, -`terms`, `match`, `query_string`, `simple_query_string`). +The value of the `_id` field is accessible in queries such as `term`, +`terms`, `match`, and `query_string`. [source,console] -------------------------- @@ -33,12 +35,10 @@ GET my-index-000001/_search <1> Querying on the `_id` field (also see the <>) -The value of the `_id` field is also accessible in aggregations or for sorting, -but doing so is discouraged as it requires to load a lot of data in memory. In -case sorting or aggregating on the `_id` field is required, it is advised to -duplicate the content of the `_id` field in another field that has `doc_values` -enabled. - +The `_id` field is restricted from use in aggregations, sorting, and scripting. +In case sorting or aggregating on the `_id` field is required, it is advised to +duplicate the content of the `_id` field into another field that has +`doc_values` enabled. [NOTE] ================================================== diff --git a/docs/reference/mapping/params.asciidoc b/docs/reference/mapping/params.asciidoc index f8a039cce241d..cbf21f55f8d71 100644 --- a/docs/reference/mapping/params.asciidoc +++ b/docs/reference/mapping/params.asciidoc @@ -52,8 +52,6 @@ include::params/eager-global-ordinals.asciidoc[] include::params/enabled.asciidoc[] -include::params/fielddata.asciidoc[] - include::params/format.asciidoc[] include::params/ignore-above.asciidoc[] diff --git a/docs/reference/mapping/params/eager-global-ordinals.asciidoc b/docs/reference/mapping/params/eager-global-ordinals.asciidoc index 4b1ae5f626f71..76f2f41656469 100644 --- a/docs/reference/mapping/params/eager-global-ordinals.asciidoc +++ b/docs/reference/mapping/params/eager-global-ordinals.asciidoc @@ -34,11 +34,10 @@ to be enabled. * Operations on parent and child documents from a `join` field, including `has_child` queries and `parent` aggregations. -NOTE: The global ordinal mapping is an on-heap data structure. When measuring -memory usage, Elasticsearch counts the memory from global ordinals as -'fielddata'. Global ordinals memory is included in the -<>, and is returned -under `fielddata` in the <> response. +NOTE: The global ordinal mapping uses heap memory as part of the +<>. Aggregations on high cardinality fields +can use a lot of memory and trigger the <>. ==== Loading global ordinals diff --git a/docs/reference/mapping/params/fielddata.asciidoc b/docs/reference/mapping/params/fielddata.asciidoc deleted file mode 100644 index 1faa82a53f310..0000000000000 --- a/docs/reference/mapping/params/fielddata.asciidoc +++ /dev/null @@ -1,134 +0,0 @@ -[[fielddata]] -=== `fielddata` - -Most fields are <> by default, which makes them -searchable. Sorting, aggregations, and accessing field values in scripts, -however, requires a different access pattern from search. - -Search needs to answer the question _"Which documents contain this term?"_, -while sorting and aggregations need to answer a different question: _"What is -the value of this field for **this** document?"_. - -Most fields can use index-time, on-disk <> for this -data access pattern, but <> fields do not support `doc_values`. - -Instead, `text` fields use a query-time *in-memory* data structure called -`fielddata`. This data structure is built on demand the first time that a -field is used for aggregations, sorting, or in a script. It is built by -reading the entire inverted index for each segment from disk, inverting the -term ↔︎ document relationship, and storing the result in memory, in the JVM -heap. - -[[fielddata-disabled-text-fields]] -==== Fielddata is disabled on `text` fields by default - -Fielddata can consume a *lot* of heap space, especially when loading high -cardinality `text` fields. Once fielddata has been loaded into the heap, it -remains there for the lifetime of the segment. Also, loading fielddata is an -expensive process which can cause users to experience latency hits. This is -why fielddata is disabled by default. - -If you try to sort, aggregate, or access values from a script on a `text` -field, you will see this exception: - -[literal] -Fielddata is disabled on text fields by default. Set `fielddata=true` on -[`your_field_name`] in order to load fielddata in memory by uninverting the -inverted index. Note that this can however use significant memory. - -[[before-enabling-fielddata]] -==== Before enabling fielddata - -Before you enable fielddata, consider why you are using a `text` field for -aggregations, sorting, or in a script. It usually doesn't make sense to do -so. - -A text field is analyzed before indexing so that a value like -`New York` can be found by searching for `new` or for `york`. A `terms` -aggregation on this field will return a `new` bucket and a `york` bucket, when -you probably want a single bucket called `New York`. - -Instead, you should have a `text` field for full text searches, and an -unanalyzed <> field with <> -enabled for aggregations, as follows: - -[source,console] ---------------------------------- -PUT my-index-000001 -{ - "mappings": { - "properties": { - "my_field": { <1> - "type": "text", - "fields": { - "keyword": { <2> - "type": "keyword" - } - } - } - } - } -} ---------------------------------- - -<1> Use the `my_field` field for searches. -<2> Use the `my_field.keyword` field for aggregations, sorting, or in scripts. - -[[enable-fielddata-text-fields]] -==== Enabling fielddata on `text` fields - -You can enable fielddata on an existing `text` field using the -<> as follows: - -[source,console] ------------------------------------ -PUT my-index-000001/_mapping -{ - "properties": { - "my_field": { <1> - "type": "text", - "fielddata": true - } - } -} ------------------------------------ -// TEST[continued] - -<1> The mapping that you specify for `my_field` should consist of the existing - mapping for that field, plus the `fielddata` parameter. - -[[field-data-filtering]] -==== `fielddata_frequency_filter` - -Fielddata filtering can be used to reduce the number of terms loaded into -memory, and thus reduce memory usage. Terms can be filtered by _frequency_: - -The frequency filter allows you to only load terms whose document frequency falls -between a `min` and `max` value, which can be expressed an absolute -number (when the number is bigger than 1.0) or as a percentage -(eg `0.01` is `1%` and `1.0` is `100%`). Frequency is calculated -*per segment*. Percentages are based on the number of docs which have a -value for the field, as opposed to all docs in the segment. - -Small segments can be excluded completely by specifying the minimum -number of docs that the segment should contain with `min_segment_size`: - -[source,console] --------------------------------------------------- -PUT my-index-000001 -{ - "mappings": { - "properties": { - "tag": { - "type": "text", - "fielddata": true, - "fielddata_frequency_filter": { - "min": 0.001, - "max": 0.1, - "min_segment_size": 500 - } - } - } - } -} --------------------------------------------------- diff --git a/docs/reference/mapping/types/parent-join.asciidoc b/docs/reference/mapping/types/parent-join.asciidoc index 67274e85caf1e..4960bcae5880c 100644 --- a/docs/reference/mapping/types/parent-join.asciidoc +++ b/docs/reference/mapping/types/parent-join.asciidoc @@ -120,11 +120,11 @@ PUT my-index-000001/_doc/4?routing=1&refresh <2> `answer` is the name of the join for this document <3> The parent id of this child document -==== Parent-join and performance. +==== Parent-join and performance The join field shouldn't be used like joins in a relation database. In Elasticsearch the key to good performance is to de-normalize your data into documents. Each join field, `has_child` or `has_parent` query adds a -significant tax to your query performance. +significant tax to your query performance. It can also trigger <> to be built. The only case where the join field makes sense is if your data contains a one-to-many relationship where one entity significantly outnumbers the other entity. An example of such case is a use case with products diff --git a/docs/reference/mapping/types/text.asciidoc b/docs/reference/mapping/types/text.asciidoc index 100561310e2ef..12ccbb0ccac69 100644 --- a/docs/reference/mapping/types/text.asciidoc +++ b/docs/reference/mapping/types/text.asciidoc @@ -146,3 +146,112 @@ The following parameters are accepted by `text` fields: <>:: Metadata about the field. + +[[fielddata-mapping-param]] +==== `fielddata` mapping parameter + +`text` fields are searchable by default, but by default are not available for +aggregations, sorting, or scripting. If you try to sort, aggregate, or access +values from a script on a `text` field, you will see this exception: + +[literal] +Fielddata is disabled on text fields by default. Set `fielddata=true` on +[`your_field_name`] in order to load fielddata in memory by uninverting the +inverted index. Note that this can however use significant memory. + +Field data is the only way to access the analyzed tokens from a full text field +in aggregations, sorting, or scripting. For example, a full text field like `New York` +would get analyzed as `new` and `york`. To aggregate on these tokens requires field data. + +[[before-enabling-fielddata]] +==== Before enabling fielddata + +It usually doesn't make sense to enable fielddata on text fields. Field data +is stored in the heap with the <> because it +is expensive to calculate. Calculating the field data can cause latency spikes, and +increasing heap usage is a cause of cluster performance issues. + +Most users who want to do more with text fields use <> +by having both a `text` field for full text searches, and an +unanalyzed <> field for aggregations, as follows: + +[source,console] +--------------------------------- +PUT my-index-000001 +{ + "mappings": { + "properties": { + "my_field": { <1> + "type": "text", + "fields": { + "keyword": { <2> + "type": "keyword" + } + } + } + } + } +} +--------------------------------- + +<1> Use the `my_field` field for searches. +<2> Use the `my_field.keyword` field for aggregations, sorting, or in scripts. + +[[enable-fielddata-text-fields]] +==== Enabling fielddata on `text` fields + +You can enable fielddata on an existing `text` field using the +<> as follows: + +[source,console] +----------------------------------- +PUT my-index-000001/_mapping +{ + "properties": { + "my_field": { <1> + "type": "text", + "fielddata": true + } + } +} +----------------------------------- +// TEST[continued] + +<1> The mapping that you specify for `my_field` should consist of the existing + mapping for that field, plus the `fielddata` parameter. + +[[field-data-filtering]] +==== `fielddata_frequency_filter` mapping parameter + +Fielddata filtering can be used to reduce the number of terms loaded into +memory, and thus reduce memory usage. Terms can be filtered by _frequency_: + +The frequency filter allows you to only load terms whose document frequency falls +between a `min` and `max` value, which can be expressed an absolute +number (when the number is bigger than 1.0) or as a percentage +(eg `0.01` is `1%` and `1.0` is `100%`). Frequency is calculated +*per segment*. Percentages are based on the number of docs which have a +value for the field, as opposed to all docs in the segment. + +Small segments can be excluded completely by specifying the minimum +number of docs that the segment should contain with `min_segment_size`: + +[source,console] +-------------------------------------------------- +PUT my-index-000001 +{ + "mappings": { + "properties": { + "tag": { + "type": "text", + "fielddata": true, + "fielddata_frequency_filter": { + "min": 0.001, + "max": 0.1, + "min_segment_size": 500 + } + } + } + } +} +-------------------------------------------------- diff --git a/docs/reference/modules/indices/circuit_breaker.asciidoc b/docs/reference/modules/indices/circuit_breaker.asciidoc index d06b3f27c11c5..2fd929f85cedb 100644 --- a/docs/reference/modules/indices/circuit_breaker.asciidoc +++ b/docs/reference/modules/indices/circuit_breaker.asciidoc @@ -32,11 +32,10 @@ The parent-level breaker can be configured with the following settings: [[fielddata-circuit-breaker]] [discrete] ==== Field data circuit breaker -The field data circuit breaker allows Elasticsearch to estimate the amount of -memory a field will require to be loaded into memory. It can then prevent the -field data loading by raising an exception. By default the limit is configured -to 40% of the maximum JVM heap. It can be configured with the following -parameters: +The field data circuit breaker estimates the heap memory required to load a +field into the <>. If loading the field would +cause the cache to exceed a predefined memory limit, the circuit breaker stops the +operation and returns an error. [[fielddata-circuit-breaker-limit]] // tag::fielddata-circuit-breaker-limit-tag[] diff --git a/docs/reference/modules/indices/fielddata.asciidoc b/docs/reference/modules/indices/fielddata.asciidoc index 5a2bbac9f379d..1383bf74d6d4c 100644 --- a/docs/reference/modules/indices/fielddata.asciidoc +++ b/docs/reference/modules/indices/fielddata.asciidoc @@ -1,22 +1,30 @@ [[modules-fielddata]] === Field data cache settings -The field data cache is used mainly when sorting on or computing aggregations -on a field. It loads all the field values to memory in order to provide fast -document based access to those values. The field data cache can be -expensive to build for a field, so its recommended to have enough memory -to allocate it, and to keep it loaded. +The field data cache contains <> and <>, +which are both used to support aggregations on certain field types. +Since these are on-heap data structures, it is important to monitor the cache's use. -The amount of memory used for the field -data cache can be controlled using `indices.fielddata.cache.size`. Note: -reloading the field data which does not fit into your cache will be expensive -and perform poorly. +[discrete] +[[fielddata-sizing]] +==== Cache size + +The entries in the cache are expensive to build, so the default behavior is +to keep the cache loaded in memory. The default cache size is unlimited, +causing the cache to grow until it reaches the limit set by the <>. This behavior can be configured. + +If the cache size limit is set, the cache will begin clearing the least-recently-updated +entries in the cache. This setting can automatically avoid the circuit breaker limit, +at the cost of rebuilding the cache as needed. + +If the circuit breaker limit is reached, further requests that increase the cache +size will be prevented. In this case you should manually <>. `indices.fielddata.cache.size`:: (<>) -The max size of the field data cache, eg `30%` of node heap space, or an -absolute value, eg `12GB`. Defaults to unbounded. Also see -<>. +The max size of the field data cache, eg `38%` of node heap space, or an +absolute value, eg `12GB`. Defaults to unbounded. If you choose to set it, +it should be smaller than <> limit. [discrete] [[fielddata-monitoring]] @@ -24,5 +32,4 @@ absolute value, eg `12GB`. Defaults to unbounded. Also see You can monitor memory usage for field data as well as the field data circuit breaker using -<> - +the <> or the <>. diff --git a/docs/reference/redirects.asciidoc b/docs/reference/redirects.asciidoc index 831c1bb351484..616cf86242bd2 100644 --- a/docs/reference/redirects.asciidoc +++ b/docs/reference/redirects.asciidoc @@ -1186,3 +1186,8 @@ See <>. === Matrix aggregations See <>. + +[[fielddata]] +=== `fielddata` mapping parameter + +See <>.