Skip to content

Commit

Permalink
Add doc_count field mapper (#64503)
Browse files Browse the repository at this point in the history
Bucket aggregations compute bucket doc_count values by incrementing the doc_count by 1 for every document collected in the bucket.

When using summary fields (such as aggregate_metric_double) one field may represent more than one document. To provide this functionality we have implemented a new field mapper (named doc_count field mapper). This field is a positive integer representing the number of documents aggregated in a single summary field.

Bucket aggregations will check if a field of type doc_count exists in a document and will take this value into consideration when computing doc counts.
  • Loading branch information
csoulios authored Nov 3, 2020
1 parent 4add5cb commit 4dc833f
Show file tree
Hide file tree
Showing 22 changed files with 786 additions and 63 deletions.
9 changes: 8 additions & 1 deletion docs/reference/mapping/fields.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,13 @@ fields can be customized when a mapping is created.
The size of the `_source` field in bytes, provided by the
{plugins}/mapper-size.html[`mapper-size` plugin].

q[discrete]
=== Doc count metadata field

<<mapping-doc-count-field,`_doc_count`>>::

A custom field used for storing doc counts when a document represents pre-aggregated data.

[discrete]
=== Indexing metadata fields

Expand All @@ -55,6 +62,7 @@ fields can be customized when a mapping is created.

Application specific metadata.

include::fields/doc-count-field.asciidoc[]

include::fields/field-names-field.asciidoc[]

Expand All @@ -69,4 +77,3 @@ include::fields/meta-field.asciidoc[]
include::fields/routing-field.asciidoc[]

include::fields/source-field.asciidoc[]

118 changes: 118 additions & 0 deletions docs/reference/mapping/fields/doc-count-field.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
[[mapping-doc-count-field]]
=== `_doc_count` data type
++++
<titleabbrev>_doc_count</titleabbrev>
++++

Bucket aggregations always return a field named `doc_count` showing the number of documents that were aggregated and partitioned
in each bucket. Computation of the value of `doc_count` is very simple. `doc_count` is incremented by 1 for every document collected
in each bucket.

While this simple approach is effective when computing aggregations over individual documents, it fails to accurately represent
documents that store pre-aggregated data (such as `histogram` or `aggregate_metric_double` fields), because one summary field may
represent multiple documents.

To allow for correct computation of the number of documents when working with pre-aggregated data, we have introduced a
metadata field type named `_doc_count`. `_doc_count` must always be a positive integer representing the number of documents
aggregated in a single summary field.

When field `_doc_count` is added to a document, all bucket aggregations will respect its value and increment the bucket `doc_count`
by the value of the field. If a document does not contain any `_doc_count` field, `_doc_count = 1` is implied by default.

[IMPORTANT]
========
* A `_doc_count` field can only store a single positive integer per document. Nested arrays are not allowed.
* If a document contains no `_doc_count` fields, aggregators will increment by 1, which is the default behavior.
========

[[mapping-doc-count-field-example]]
==== Example

The following <<indices-create-index, create index>> API request creates a new index with the following field mappings:

* `my_histogram`, a `histogram` field used to store percentile data
* `my_text`, a `keyword` field used to store a title for the histogram

[source,console]
--------------------------------------------------
PUT my_index
{
"mappings" : {
"properties" : {
"my_histogram" : {
"type" : "histogram"
},
"my_text" : {
"type" : "keyword"
}
}
}
}
--------------------------------------------------

The following <<docs-index_,index>> API requests store pre-aggregated data for
two histograms: `histogram_1` and `histogram_2`.

[source,console]
--------------------------------------------------
PUT my_index/_doc/1
{
"my_text" : "histogram_1",
"my_histogram" : {
"values" : [0.1, 0.2, 0.3, 0.4, 0.5],
"counts" : [3, 7, 23, 12, 6]
},
"_doc_count": 45 <1>
}
PUT my_index/_doc/2
{
"my_text" : "histogram_2",
"my_histogram" : {
"values" : [0.1, 0.25, 0.35, 0.4, 0.45, 0.5],
"counts" : [8, 17, 8, 7, 6, 2]
},
"_doc_count_": 62 <1>
}
--------------------------------------------------
<1> Field `_doc_count` must be a positive integer storing the number of documents aggregated to produce each histogram.

If we run the following <<search-aggregations-bucket-terms-aggregation, terms aggregation>> on `my_index`:

[source,console]
--------------------------------------------------
GET /_search
{
"aggs" : {
"histogram_titles" : {
"terms" : { "field" : "my_text" }
}
}
}
--------------------------------------------------

We will get the following response:

[source,console-result]
--------------------------------------------------
{
...
"aggregations" : {
"histogram_titles" : {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets" : [
{
"key" : "histogram_2",
"doc_count" : 62
},
{
"key" : "histogram_1",
"doc_count" : 45
}
]
}
}
}
--------------------------------------------------
// TESTRESPONSE[skip:test not setup]
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
setup:
- do:
indices.create:
index: test_1
body:
settings:
number_of_replicas: 0
mappings:
properties:
str:
type: keyword
number:
type: integer

- do:
bulk:
index: test_1
refresh: true
body:
- '{"index": {}}'
- '{"_doc_count": 10, "str": "abc", "number" : 500, "unmapped": "abc" }'
- '{"index": {}}'
- '{"_doc_count": 5, "str": "xyz", "number" : 100, "unmapped": "xyz" }'
- '{"index": {}}'
- '{"_doc_count": 7, "str": "foo", "number" : 100, "unmapped": "foo" }'
- '{"index": {}}'
- '{"_doc_count": 1, "str": "foo", "number" : 200, "unmapped": "foo" }'
- '{"index": {}}'
- '{"str": "abc", "number" : 500, "unmapped": "abc" }'

---
"Test numeric terms agg with doc_count":
- skip:
version: " - 7.99.99"
reason: "Doc count fields are only implemented in 8.0"

- do:
search:
rest_total_hits_as_int: true
body: { "size" : 0, "aggs" : { "num_terms" : { "terms" : { "field" : "number" } } } }

- match: { hits.total: 5 }
- length: { aggregations.num_terms.buckets: 3 }
- match: { aggregations.num_terms.buckets.0.key: 100 }
- match: { aggregations.num_terms.buckets.0.doc_count: 12 }
- match: { aggregations.num_terms.buckets.1.key: 500 }
- match: { aggregations.num_terms.buckets.1.doc_count: 11 }
- match: { aggregations.num_terms.buckets.2.key: 200 }
- match: { aggregations.num_terms.buckets.2.doc_count: 1 }


---
"Test keyword terms agg with doc_count":
- skip:
version: " - 7.99.99"
reason: "Doc count fields are only implemented in 8.0"
- do:
search:
rest_total_hits_as_int: true
body: { "size" : 0, "aggs" : { "str_terms" : { "terms" : { "field" : "str" } } } }

- match: { hits.total: 5 }
- length: { aggregations.str_terms.buckets: 3 }
- match: { aggregations.str_terms.buckets.0.key: "abc" }
- match: { aggregations.str_terms.buckets.0.doc_count: 11 }
- match: { aggregations.str_terms.buckets.1.key: "foo" }
- match: { aggregations.str_terms.buckets.1.doc_count: 8 }
- match: { aggregations.str_terms.buckets.2.key: "xyz" }
- match: { aggregations.str_terms.buckets.2.doc_count: 5 }

---

"Test unmapped string terms agg with doc_count":
- skip:
version: " - 7.99.99"
reason: "Doc count fields are only implemented in 8.0"
- do:
bulk:
index: test_2
refresh: true
body:
- '{"index": {}}'
- '{"_doc_count": 10, "str": "abc" }'
- '{"index": {}}'
- '{"str": "abc" }'
- do:
search:
index: test_2
rest_total_hits_as_int: true
body: { "size" : 0, "aggs" : { "str_terms" : { "terms" : { "field" : "str.keyword" } } } }

- match: { hits.total: 2 }
- length: { aggregations.str_terms.buckets: 1 }
- match: { aggregations.str_terms.buckets.0.key: "abc" }
- match: { aggregations.str_terms.buckets.0.doc_count: 11 }

---
"Test composite str_terms agg with doc_count":
- skip:
version: " - 7.99.99"
reason: "Doc count fields are only implemented in 8.0"
- do:
search:
rest_total_hits_as_int: true
body: { "size" : 0, "aggs" :
{ "composite_agg" : { "composite" :
{
"sources": ["str_terms": { "terms": { "field": "str" } }]
}
}
}
}

- match: { hits.total: 5 }
- length: { aggregations.composite_agg.buckets: 3 }
- match: { aggregations.composite_agg.buckets.0.key.str_terms: "abc" }
- match: { aggregations.composite_agg.buckets.0.doc_count: 11 }
- match: { aggregations.composite_agg.buckets.1.key.str_terms: "foo" }
- match: { aggregations.composite_agg.buckets.1.doc_count: 8 }
- match: { aggregations.composite_agg.buckets.2.key.str_terms: "xyz" }
- match: { aggregations.composite_agg.buckets.2.doc_count: 5 }


---
"Test composite num_terms agg with doc_count":
- skip:
version: " - 7.99.99"
reason: "Doc count fields are only implemented in 8.0"
- do:
search:
rest_total_hits_as_int: true
body: { "size" : 0, "aggs" :
{ "composite_agg" :
{ "composite" :
{
"sources": ["num_terms" : { "terms" : { "field" : "number" } }]
}
}
}
}

- match: { hits.total: 5 }
- length: { aggregations.composite_agg.buckets: 3 }
- match: { aggregations.composite_agg.buckets.0.key.num_terms: 100 }
- match: { aggregations.composite_agg.buckets.0.doc_count: 12 }
- match: { aggregations.composite_agg.buckets.1.key.num_terms: 200 }
- match: { aggregations.composite_agg.buckets.1.doc_count: 1 }
- match: { aggregations.composite_agg.buckets.2.key.num_terms: 500 }
- match: { aggregations.composite_agg.buckets.2.doc_count: 11 }

Loading

0 comments on commit 4dc833f

Please sign in to comment.