Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add doc_count field mapper #64503

Merged
merged 6 commits into from
Nov 3, 2020

Conversation

csoulios
Copy link
Contributor

@csoulios csoulios commented Nov 2, 2020

Bucket aggregations compute bucket doc_count values by incrementing the doc_count by 1 for every document collected in the bucket.

When using summary fields (such as aggregate_metric_double) one field may represent more than one document. To provide this functionality we have implemented a new field mapper (named doc_count field mapper). This field is a positive integer representing the number of documents aggregated in a single summary field.

Bucket aggregations will check if a field of type doc_count exists in a document and will take this value into consideration when computing doc counts.

Note: This PR is the same as #58339 but has been opened against master branch. For the complete discussion on this PR check #58339

added tests

Build fixes

Added tests for doc_count fieldmapper

doc count tests

Resolve conflicts after merge from master

Added yaml test for doc_count field type

Minor changes to test

Fix issue with not-registering field mapper

Simplify terms agg test

Add doc_count provider in the buckets aggregator

Initialize doc_count provider once

doc_count provider is initialized once in the
buckets agg constructor.

Added tests for FieldBasedDocCountProvider

Added more tests to DocCountFieldMapper

Updated branch to fix build after merge

Added validation for single doc_count field

Added validation to ensure only a single doc_count
field exists in the mapping

Added version skips to fix broken tests

Added documentation for doc_count

Changes to address review comments:

- Minor doc change
- Added yml test that merges template with multiple doc_count
- Changed DocCountFieldType indexing to TextSearchInfo.NONE

Use _doc_count as Lucene field for doc count

Minor change: field rename

Minor change to yml test.

Fix errors from merge

Converted _doc_count to metadata field type

Throw an error if parsed value is not a number

Make _doc_count field a metadata field

Fixed broken tests

Fix bug in low cardinality ordinal terms aggs

Update docs that _doc_count is a metadata field

Fix broken ML tests

Fix errors after merge

Addressed review comments

Addressed reviewer comments

Added DocCountFieldTypeTests

Make composite agg respect _doc_count field

Cleaned up/simplified DocCountProvider class

DocCountProvider rethrows IOException instead of swallowing it

Set familyTypeName of _doc_count to integer

Revert changes to AggregatorTestCase

Doc changes
@csoulios csoulios added :Analytics/Aggregations Aggregations :Search Foundations/Mapping Index mappings, including merging and defining field types v8.0.0 labels Nov 2, 2020
@csoulios csoulios requested a review from jimczi November 2, 2020 20:06
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Mapping)

@elasticmachine elasticmachine added Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) Team:Search Meta label for search team labels Nov 2, 2020
@csoulios csoulios changed the title Initial version of doc_count field mapper Add doc_count field mapper Nov 2, 2020
Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@csoulios
Copy link
Contributor Author

csoulios commented Nov 3, 2020

@elasticmachine run elasticsearch-ci/bwc

1 similar comment
@csoulios
Copy link
Contributor Author

csoulios commented Nov 3, 2020

@elasticmachine run elasticsearch-ci/bwc

@csoulios csoulios merged commit 4dc833f into elastic:master Nov 3, 2020
@csoulios csoulios deleted the doc_count-field-mapper_master branch November 3, 2020 15:47
dimitris-athanasiou added a commit that referenced this pull request Nov 3, 2020
This field is added from version 7.11 onwards. We are
adding it to the list of ignored fields for data frame analytics
in 7.10 to avoid failing to start an outlier detection job
in a mixed cluster environment.

Relates #64503
dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this pull request Nov 3, 2020
dimitris-athanasiou added a commit that referenced this pull request Nov 3, 2020
dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this pull request Nov 3, 2020
dimitris-athanasiou added a commit that referenced this pull request Nov 3, 2020
tylersmalley pushed a commit to tylersmalley/kibana that referenced this pull request Nov 3, 2020
A doc_count field mapper was added in
elastic/elasticsearch#64503

This is currently failing the ES promotion. After this promotion, we can
go back and update this test to be an exact match on the body if we
think that is the desired assertion.

Signed-off-by: Tyler Smalley <tyler.smalley@elastic.co>
cjcenizal pushed a commit to elastic/kibana that referenced this pull request Nov 4, 2020
…82547)

* [test] Updates rollup test to allow incoming doc_count field mapper

A doc_count field mapper was added in
elastic/elasticsearch#64503

This is currently failing the ES promotion. After this promotion, we can
go back and update this test to be an exact match on the body if we
think that is the desired assertion.

Signed-off-by: Tyler Smalley <tyler.smalley@elastic.co>
cjcenizal added a commit to elastic/kibana that referenced this pull request Nov 4, 2020
…82547) (#82554)

* [test] Updates rollup test to allow incoming doc_count field mapper

A doc_count field mapper was added in
elastic/elasticsearch#64503

This is currently failing the ES promotion. After this promotion, we can
go back and update this test to be an exact match on the body if we
think that is the desired assertion.

Signed-off-by: Tyler Smalley <tyler.smalley@elastic.co>

Co-authored-by: Tyler Smalley <tyler.smalley@elastic.co>
@polyfractal polyfractal added the >new-field-mapper Added when a new field type / mapper is being introduced label Nov 4, 2020
csoulios added a commit to csoulios/elasticsearch that referenced this pull request Nov 6, 2020
After merging _doc_count PR in 7.x (elastic#64594), we can enable
the skipped ML dataframe analytics.

Those tests had been broken after merging elastic#64503 in master
csoulios added a commit that referenced this pull request Nov 6, 2020
After merging _doc_count PR in 7.x (#64594), we can enable
the skipped ML dataframe analytics.

Those tests had been broken after merging #64503 in master

long value = parser.longValue(false);
if (value <= 0) {
throw new IllegalArgumentException("Field [" + fieldType().name() + "] must be a positive integer.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add the value that was parsed to the error message? this would make issues easier to investigate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpountz thanks for the input. Now that this has been merged, I will push this change in a subsequent PR.

csoulios added a commit that referenced this pull request Dec 3, 2020
After merging _doc_count field type in v7.11.0 (#64594), this PR lowers the minimum compatibility
version from v8.0.0 to v7.11.0

Relates to #64503
csoulios added a commit that referenced this pull request Dec 3, 2020
After merging _doc_count field type in v7.11.0 (#64594), this PR lowers the minimum compatibility
version from v8.0.0 to v7.11.0

Relates to #64503
csoulios added a commit that referenced this pull request Dec 3, 2020
A while back, Lucene introduced the ability to index custom term frequencies, ie. giving users 
the ability to provide a numeric value that should be indexed as a term frequency rather than 
letting Lucene compute the term frequency by itself based on the number of occurrences of 
a term.

This PR modifies the _doc_count field so that it is stored as Lucene custom term frequency.

A benefit of moving to custom term frequencies is that Lucene will automatically compute global term 
statistics like totalTermFreq which will let us know the sum of the values of the _doc_count field across 
an entire shard. This could in-turn be useful to generalize optimizations to rollup indices,
 e.g. buckets aggregations where all documents fall into the same bucket.

Relates to #64503
csoulios added a commit that referenced this pull request Dec 3, 2020
A while back, Lucene introduced the ability to index custom term frequencies, ie. giving users 
the ability to provide a numeric value that should be indexed as a term frequency rather than 
letting Lucene compute the term frequency by itself based on the number of occurrences of 
a term.

This PR modifies the _doc_count field so that it is stored as Lucene custom term frequency.

A benefit of moving to custom term frequencies is that Lucene will automatically compute global term 
statistics like totalTermFreq which will let us know the sum of the values of the _doc_count field across 
an entire shard. This could in-turn be useful to generalize optimizations to rollup indices,
 e.g. buckets aggregations where all documents fall into the same bucket.

Relates to #64503
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations >feature >new-field-mapper Added when a new field type / mapper is being introduced :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) Team:Search Meta label for search team v7.11.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants