-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add doc_count field mapper #64503
Add doc_count field mapper #64503
Conversation
added tests Build fixes Added tests for doc_count fieldmapper doc count tests Resolve conflicts after merge from master Added yaml test for doc_count field type Minor changes to test Fix issue with not-registering field mapper Simplify terms agg test Add doc_count provider in the buckets aggregator Initialize doc_count provider once doc_count provider is initialized once in the buckets agg constructor. Added tests for FieldBasedDocCountProvider Added more tests to DocCountFieldMapper Updated branch to fix build after merge Added validation for single doc_count field Added validation to ensure only a single doc_count field exists in the mapping Added version skips to fix broken tests Added documentation for doc_count Changes to address review comments: - Minor doc change - Added yml test that merges template with multiple doc_count - Changed DocCountFieldType indexing to TextSearchInfo.NONE Use _doc_count as Lucene field for doc count Minor change: field rename Minor change to yml test. Fix errors from merge Converted _doc_count to metadata field type Throw an error if parsed value is not a number Make _doc_count field a metadata field Fixed broken tests Fix bug in low cardinality ordinal terms aggs Update docs that _doc_count is a metadata field Fix broken ML tests Fix errors after merge Addressed review comments Addressed reviewer comments Added DocCountFieldTypeTests Make composite agg respect _doc_count field Cleaned up/simplified DocCountProvider class DocCountProvider rethrows IOException instead of swallowing it Set familyTypeName of _doc_count to integer Revert changes to AggregatorTestCase Doc changes
Pinging @elastic/es-analytics-geo (:Analytics/Aggregations) |
Pinging @elastic/es-search (:Search/Mapping) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@elasticmachine run elasticsearch-ci/bwc |
1 similar comment
@elasticmachine run elasticsearch-ci/bwc |
This field is added from version 7.11 onwards. We are adding it to the list of ignored fields for data frame analytics in 7.10 to avoid failing to start an outlier detection job in a mixed cluster environment. Relates #64503
Because of elastic#64503. This adds to elastic#64540.
A doc_count field mapper was added in elastic/elasticsearch#64503 This is currently failing the ES promotion. After this promotion, we can go back and update this test to be an exact match on the body if we think that is the desired assertion. Signed-off-by: Tyler Smalley <tyler.smalley@elastic.co>
…82547) * [test] Updates rollup test to allow incoming doc_count field mapper A doc_count field mapper was added in elastic/elasticsearch#64503 This is currently failing the ES promotion. After this promotion, we can go back and update this test to be an exact match on the body if we think that is the desired assertion. Signed-off-by: Tyler Smalley <tyler.smalley@elastic.co>
…82547) (#82554) * [test] Updates rollup test to allow incoming doc_count field mapper A doc_count field mapper was added in elastic/elasticsearch#64503 This is currently failing the ES promotion. After this promotion, we can go back and update this test to be an exact match on the body if we think that is the desired assertion. Signed-off-by: Tyler Smalley <tyler.smalley@elastic.co> Co-authored-by: Tyler Smalley <tyler.smalley@elastic.co>
After merging _doc_count PR in 7.x (elastic#64594), we can enable the skipped ML dataframe analytics. Those tests had been broken after merging elastic#64503 in master
|
||
long value = parser.longValue(false); | ||
if (value <= 0) { | ||
throw new IllegalArgumentException("Field [" + fieldType().name() + "] must be a positive integer."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add the value that was parsed to the error message? this would make issues easier to investigate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jpountz thanks for the input. Now that this has been merged, I will push this change in a subsequent PR.
A while back, Lucene introduced the ability to index custom term frequencies, ie. giving users the ability to provide a numeric value that should be indexed as a term frequency rather than letting Lucene compute the term frequency by itself based on the number of occurrences of a term. This PR modifies the _doc_count field so that it is stored as Lucene custom term frequency. A benefit of moving to custom term frequencies is that Lucene will automatically compute global term statistics like totalTermFreq which will let us know the sum of the values of the _doc_count field across an entire shard. This could in-turn be useful to generalize optimizations to rollup indices, e.g. buckets aggregations where all documents fall into the same bucket. Relates to #64503
A while back, Lucene introduced the ability to index custom term frequencies, ie. giving users the ability to provide a numeric value that should be indexed as a term frequency rather than letting Lucene compute the term frequency by itself based on the number of occurrences of a term. This PR modifies the _doc_count field so that it is stored as Lucene custom term frequency. A benefit of moving to custom term frequencies is that Lucene will automatically compute global term statistics like totalTermFreq which will let us know the sum of the values of the _doc_count field across an entire shard. This could in-turn be useful to generalize optimizations to rollup indices, e.g. buckets aggregations where all documents fall into the same bucket. Relates to #64503
Bucket aggregations compute bucket
doc_count
values by incrementing thedoc_count
by 1 for every document collected in the bucket.When using summary fields (such as
aggregate_metric_double
) one field may represent more than one document. To provide this functionality we have implemented a new field mapper (nameddoc_count
field mapper). This field is a positive integer representing the number of documents aggregated in a single summary field.Bucket aggregations will check if a field of type
doc_count
exists in a document and will take this value into consideration when computing doc counts.Note: This PR is the same as #58339 but has been opened against master branch. For the complete discussion on this PR check #58339