Add doc_count field mapper (#64503)

Bucket aggregations compute bucket doc_count values by incrementing the doc_count by 1 for every document collected in the bucket. When using summary fields (such as aggregate_metric_double) one field may represent more than one document. To provide this functionality we have implemented a new field mapper (named doc_count field mapper). This field is a positive integer representing the number of documents aggregated in a single summary field. Bucket aggregations will check if a field of type doc_count exists in a document and will take this value into consideration when computing doc counts.
elastic · Nov 3, 2020 · 4dc833f · 4dc833f
1 parent 4add5cb
commit 4dc833f
Show file tree

Hide file tree

Showing 22 changed files with 786 additions and 63 deletions.
diff --git a/docs/reference/mapping/fields.asciidoc b/docs/reference/mapping/fields.asciidoc
@@ -29,6 +29,13 @@ fields can be customized when a mapping is created.
     The size of the `_source` field in bytes, provided by the
     {plugins}/mapper-size.html[`mapper-size` plugin].
 
+q[discrete]
+=== Doc count metadata field
+
+<<mapping-doc-count-field,`_doc_count`>>::
+
+    A custom field used for storing doc counts when a document represents pre-aggregated data.
+
 [discrete]
 === Indexing metadata fields
 
@@ -55,6 +62,7 @@ fields can be customized when a mapping is created.
 
     Application specific metadata.
 
+include::fields/doc-count-field.asciidoc[]
 
 include::fields/field-names-field.asciidoc[]
 
@@ -69,4 +77,3 @@ include::fields/meta-field.asciidoc[]
 include::fields/routing-field.asciidoc[]
 
 include::fields/source-field.asciidoc[]
-
diff --git a/docs/reference/mapping/fields/doc-count-field.asciidoc b/docs/reference/mapping/fields/doc-count-field.asciidoc
@@ -0,0 +1,118 @@
+[[mapping-doc-count-field]]
+=== `_doc_count` data type
+++++
+<titleabbrev>_doc_count</titleabbrev>
+++++
+
+Bucket aggregations always return a field named `doc_count` showing the number of documents that were aggregated and partitioned
+in each bucket. Computation of the value of `doc_count` is very simple. `doc_count` is incremented by 1 for every document collected
+in each bucket.
+
+While this simple approach is effective when computing aggregations over individual documents, it fails to accurately represent
+documents that store pre-aggregated data (such as `histogram` or `aggregate_metric_double` fields), because one summary field may
+represent multiple documents.
+
+To allow for correct computation of the number of documents when working with pre-aggregated data, we have introduced a
+metadata field type named `_doc_count`. `_doc_count` must always be a positive integer representing the number of documents
+aggregated in a single summary field.
+
+When field `_doc_count` is added to a document, all bucket aggregations will respect its value and increment the bucket `doc_count`
+by the value of the field. If a document does not contain any `_doc_count` field, `_doc_count = 1` is implied by default.
+
+[IMPORTANT]
+========
+* A `_doc_count` field can only store a single positive integer per document. Nested arrays are not allowed.
+* If a document contains no `_doc_count` fields, aggregators will increment by 1, which is the default behavior.
+========
+
+[[mapping-doc-count-field-example]]
+==== Example
+
+The following <<indices-create-index, create index>> API request creates a new index with the following field mappings:
+
+* `my_histogram`, a `histogram` field used to store percentile data
+* `my_text`, a `keyword` field used to store a title for the histogram
+
+[source,console]
+--------------------------------------------------
+PUT my_index
+{
+  "mappings" : {
+    "properties" : {
+      "my_histogram" : {
+        "type" : "histogram"
+      },
+      "my_text" : {
+        "type" : "keyword"
+      }
+    }
+  }
+}
+--------------------------------------------------
+
+The following <<docs-index_,index>> API requests store pre-aggregated data for
+two histograms: `histogram_1` and `histogram_2`.
+
+[source,console]
+--------------------------------------------------
+PUT my_index/_doc/1
+{
+  "my_text" : "histogram_1",
+  "my_histogram" : {
+      "values" : [0.1, 0.2, 0.3, 0.4, 0.5],
+      "counts" : [3, 7, 23, 12, 6]
+   },
+  "_doc_count": 45 <1>
+}
+
+PUT my_index/_doc/2
+{
+  "my_text" : "histogram_2",
+  "my_histogram" : {
+      "values" : [0.1, 0.25, 0.35, 0.4, 0.45, 0.5],
+      "counts" : [8, 17, 8, 7, 6, 2]
+   },
+  "_doc_count_": 62 <1>
+}
+--------------------------------------------------
+<1> Field `_doc_count` must be a positive integer storing the number of documents aggregated to produce each histogram.
+
+If we run the following <<search-aggregations-bucket-terms-aggregation, terms aggregation>> on `my_index`:
+
+[source,console]
+--------------------------------------------------
+GET /_search
+{
+    "aggs" : {
+        "histogram_titles" : {
+            "terms" : { "field" : "my_text" }
+        }
+    }
+}
+--------------------------------------------------
+
+We will get the following response:
+
+[source,console-result]
+--------------------------------------------------
+{
+    ...
+    "aggregations" : {
+        "histogram_titles" : {
+            "doc_count_error_upper_bound": 0,
+            "sum_other_doc_count": 0,
+            "buckets" : [
+                {
+                    "key" : "histogram_2",
+                    "doc_count" : 62
+                },
+                {
+                    "key" : "histogram_1",
+                    "doc_count" : 45
+                }
+            ]
+        }
+    }
+}
+--------------------------------------------------
+// TESTRESPONSE[skip:test not setup]
diff --git a/...api-spec/src/main/resources/rest-api-spec/test/search.aggregation/370_doc_count_field.yml b/...api-spec/src/main/resources/rest-api-spec/test/search.aggregation/370_doc_count_field.yml
@@ -0,0 +1,150 @@
+setup:
+  - do:
+      indices.create:
+        index: test_1
+        body:
+          settings:
+            number_of_replicas: 0
+          mappings:
+            properties:
+              str:
+                type: keyword
+              number:
+                type: integer
+
+  - do:
+      bulk:
+        index: test_1
+        refresh: true
+        body:
+          - '{"index": {}}'
+          - '{"_doc_count": 10, "str": "abc", "number" : 500, "unmapped": "abc" }'
+          - '{"index": {}}'
+          - '{"_doc_count": 5, "str": "xyz", "number" : 100, "unmapped": "xyz" }'
+          - '{"index": {}}'
+          - '{"_doc_count": 7, "str": "foo", "number" : 100, "unmapped": "foo" }'
+          - '{"index": {}}'
+          - '{"_doc_count": 1, "str": "foo", "number" : 200, "unmapped": "foo" }'
+          - '{"index": {}}'
+          - '{"str": "abc", "number" : 500, "unmapped": "abc" }'
+
+---
+"Test numeric terms agg with doc_count":
+  - skip:
+      version: " - 7.99.99"
+      reason: "Doc count fields are only implemented in 8.0"
+
+  - do:
+      search:
+        rest_total_hits_as_int: true
+        body: { "size" : 0, "aggs" : { "num_terms" : { "terms" : { "field" : "number" } } } }
+
+  - match: { hits.total: 5 }
+  - length: { aggregations.num_terms.buckets: 3 }
+  - match: { aggregations.num_terms.buckets.0.key: 100 }
+  - match: { aggregations.num_terms.buckets.0.doc_count: 12 }
+  - match: { aggregations.num_terms.buckets.1.key: 500 }
+  - match: { aggregations.num_terms.buckets.1.doc_count: 11 }
+  - match: { aggregations.num_terms.buckets.2.key: 200 }
+  - match: { aggregations.num_terms.buckets.2.doc_count: 1 }
+
+
+---
+"Test keyword terms agg with doc_count":
+  - skip:
+      version: " - 7.99.99"
+      reason: "Doc count fields are only implemented in 8.0"
+  - do:
+      search:
+        rest_total_hits_as_int: true
+        body: { "size" : 0, "aggs" : { "str_terms" : { "terms" : { "field" : "str" } } } }
+
+  - match: { hits.total: 5 }
+  - length: { aggregations.str_terms.buckets: 3 }
+  - match: { aggregations.str_terms.buckets.0.key: "abc" }
+  - match: { aggregations.str_terms.buckets.0.doc_count: 11 }
+  - match: { aggregations.str_terms.buckets.1.key: "foo" }
+  - match: { aggregations.str_terms.buckets.1.doc_count: 8 }
+  - match: { aggregations.str_terms.buckets.2.key: "xyz" }
+  - match: { aggregations.str_terms.buckets.2.doc_count: 5 }
+
+---
+
+"Test unmapped string terms agg with doc_count":
+  - skip:
+      version: " - 7.99.99"
+      reason: "Doc count fields are only implemented in 8.0"
+  - do:
+      bulk:
+        index: test_2
+        refresh: true
+        body:
+          - '{"index": {}}'
+          - '{"_doc_count": 10, "str": "abc" }'
+          - '{"index": {}}'
+          - '{"str": "abc" }'
+  - do:
+      search:
+        index: test_2
+        rest_total_hits_as_int: true
+        body: { "size" : 0, "aggs" : { "str_terms" : { "terms" : { "field" : "str.keyword" } } } }
+
+  - match: { hits.total: 2 }
+  - length: { aggregations.str_terms.buckets: 1 }
+  - match: { aggregations.str_terms.buckets.0.key: "abc" }
+  - match: { aggregations.str_terms.buckets.0.doc_count: 11 }
+
+---
+"Test composite str_terms agg with doc_count":
+  - skip:
+      version: " - 7.99.99"
+      reason: "Doc count fields are only implemented in 8.0"
+  - do:
+      search:
+        rest_total_hits_as_int: true
+        body: { "size" : 0, "aggs" :
+          { "composite_agg" : { "composite" :
+               {
+                 "sources": ["str_terms": { "terms": { "field": "str" } }]
+               }
+           }
+         }
+      }
+
+  - match: { hits.total: 5 }
+  - length: { aggregations.composite_agg.buckets: 3 }
+  - match: { aggregations.composite_agg.buckets.0.key.str_terms: "abc" }
+  - match: { aggregations.composite_agg.buckets.0.doc_count: 11 }
+  - match: { aggregations.composite_agg.buckets.1.key.str_terms: "foo" }
+  - match: { aggregations.composite_agg.buckets.1.doc_count: 8 }
+  - match: { aggregations.composite_agg.buckets.2.key.str_terms: "xyz" }
+  - match: { aggregations.composite_agg.buckets.2.doc_count: 5 }
+
+
+---
+"Test composite num_terms agg with doc_count":
+  - skip:
+      version: " - 7.99.99"
+      reason: "Doc count fields are only implemented in 8.0"
+  - do:
+      search:
+        rest_total_hits_as_int: true
+        body: { "size" : 0, "aggs" :
+          { "composite_agg" :
+              { "composite" :
+                {
+                  "sources": ["num_terms" : { "terms" : { "field" : "number" } }]
+                }
+            }
+          }
+        }
+
+  - match: { hits.total: 5 }
+  - length: { aggregations.composite_agg.buckets: 3 }
+  - match: { aggregations.composite_agg.buckets.0.key.num_terms: 100 }
+  - match: { aggregations.composite_agg.buckets.0.doc_count: 12 }
+  - match: { aggregations.composite_agg.buckets.1.key.num_terms: 200 }
+  - match: { aggregations.composite_agg.buckets.1.doc_count: 1 }
+  - match: { aggregations.composite_agg.buckets.2.key.num_terms: 500 }
+  - match: { aggregations.composite_agg.buckets.2.doc_count: 11 }
+