Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] add new bucket_correlation aggregation with initial count_correlation function #72133

Merged
merged 9 commits into from
May 10, 2021
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
[role="xpack"]
[testenv="basic"]
[[search-aggregations-bucket-correlation-aggregation]]
=== Bucket Correlation Aggregation
szabosteve marked this conversation as resolved.
Show resolved Hide resolved
++++
<titleabbrev>Bucket Correlation Aggregation</titleabbrev>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above.

Suggested change
<titleabbrev>Bucket Correlation Aggregation</titleabbrev>
<titleabbrev>Bucket correlation aggregation</titleabbrev>

++++

A sibling pipeline aggregation which executes a correlation function on the
configured sibling multi-bucket aggregation.


[[bucket-correlation-agg-syntax]]
==== Parameters

`buckets_path`::
(Required, string)
Path to the buckets that contain one set of values to correlate.
For syntax, see <<buckets-path-syntax>>
szabosteve marked this conversation as resolved.
Show resolved Hide resolved

`function`::
(Required, object)
The correlation function to execute.
+
.Properties of `function`
[%collapsible%open]
====
`count_correlation`:::
(Required^*^, object)
The configuration to calculate a count correlation. This function is designed for
determining the correlation of a term value and a given metric. Consequently, there
are some restrictions.
szabosteve marked this conversation as resolved.
Show resolved Hide resolved

* The `buckets_path` must point to a `_count` metric
* The total count of all the `bucket_path` count values must be less than or equal to `indicator.doc_count`
* When utilizing this function, an initial calculation to gather the required `indicator` values is required.
szabosteve marked this conversation as resolved.
Show resolved Hide resolved

.Properties of `count_correlation`
[%collapsible%open]
=====
`indicator`:::
(Required, object)
The indicator with which to correlate the configured `bucket_path` values.

.Properties of `indicator`
[%collapsible%open]
=====
`expectations`:::
(Required, array)
An array of numbers with which to correlate the configured `bucket_path` values. The length of this value must always equal
the number of buckets returned by the `bucket_path`.

`fractions`:::
(Optional, array)
An array of fractions to use when averaging and calculating variance. This should be used if the pre-calculated data and the
`buckets_path` have known gaps. The length of `fractions`, if provided, must equal `expectations`.

`doc_count`:::
(Required, integer)
The total number of documents that initially created the `expectations`. This should always be greater than or equal to the sum
szabosteve marked this conversation as resolved.
Show resolved Hide resolved
of all values in the `buckets_path` as this is the originating superset of data to which we are correlating the term values.
szabosteve marked this conversation as resolved.
Show resolved Hide resolved
=====
=====
====

==== Syntax

A `bucket_correlation` aggregation looks like this in isolation:

[source,js]
--------------------------------------------------
{
"bucket_correlation": {
"buckets_path": "range_values>_count", <1>
"function": {
"count_correlation": { <2>
"expectations": [...],
"doc_count": 10000
}
}
}
}
--------------------------------------------------
// NOTCONSOLE
<1> The buckets containing the values to correlate against
<2> The correlation function definition
szabosteve marked this conversation as resolved.
Show resolved Hide resolved


[[bucket-correlation-agg-example]]
==== Example

The following snippet correlates the individual terms in the field `service.version.keyword` with the `latency` metric. Not shown
is the pre-calculation of the `latency` indicator values, which was done utilizing the
<<search-aggregations-metrics-percentile-aggregation,percentiles>> aggregation.

Since this example has 50 individual range buckets and 50 expectation values, they are elided for brevity.

[source,js]
-------------------------------------------------
GET apm-7.12.0-transaction-generated/_search
{
"size": 0,
"aggs": {
"field_terms": {
"terms": {
"field": "service.version.keyword", <1>
"size": 20
},
"aggs": {
"latency_range": {
"range": { <2>
"field": "transaction.duration.us",
"ranges": [...],
"keyed": true
}
},
"correlation": {
"bucket_correlation": { <3>
"buckets_path": "latency_range>_count",
"count_correlation": {
"indicator": {
"expectations": [...],
"doc_count": 20000
}
}
}
}
}
}
}
}
-------------------------------------------------
// NOTCONSOLE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make this a fully real example, i think. It'd be a pain to make the setup for it, but without that we can't be sure it works.

Copy link
Member Author

@benwtrent benwtrent May 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nik9000 there is a non-doc integration test that covers this.

I can attempt to do a set up, but its gonna take a bit to generate data and write out the 50 ranges.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its important that we test these so they don't break eventually. I can't tell you the number of times I've broken stuff in the docs without noticing it. I mean, since we made the tests I can tell you its much rarer, but I still can't tell you.

You can totally use the setup stuff in docs/build.gradle - over there you can write for loops and stuff to emit values.

If we have a response on the page we should assert that it came from a request on the page - but its totally ok to use stuff like ... and filter_path to shrink it. No one is going to read a huge response anyway.....


<1> The term buckets containing a range aggregation and the bucket correlation aggregation. Both are utilized to calculate
the correlation of the term values with the latency
<2> The range aggregation on the latency field. The ranges were created referencing the percentiles of the latency field
<3> The bucket correlation aggregation that will calculate the correlation of the number of term values within each range
and the previously calculated indicator values.
szabosteve marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also have an example response. It seems to be present for most other aggs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See below :)

Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@
import org.elasticsearch.xpack.core.ml.inference.trainedmodel.tree.Tree;
import org.elasticsearch.xpack.core.ml.inference.trainedmodel.tree.TreeNode;
import org.elasticsearch.xpack.core.ml.job.config.JobState;
import org.elasticsearch.xpack.ml.inference.aggs.InferencePipelineAggregationBuilder;
import org.elasticsearch.xpack.ml.aggs.inference.InferencePipelineAggregationBuilder;
import org.elasticsearch.xpack.ml.inference.loadingservice.ModelLoadingService;
import org.elasticsearch.xpack.ml.support.BaseMlIntegTestCase;
import org.junit.Before;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,6 @@
*/
package org.elasticsearch.xpack.ml.integration;

import static java.util.Collections.emptyList;
import static org.elasticsearch.xpack.core.ClientHelper.ML_ORIGIN;
import static org.hamcrest.Matchers.containsString;

Expand All @@ -27,14 +26,10 @@
import org.elasticsearch.cluster.service.MasterService;
import org.elasticsearch.common.settings.ClusterSettings;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.common.xcontent.NamedXContentRegistry;
import org.elasticsearch.common.xcontent.XContentType;
import org.elasticsearch.search.SearchModule;
import org.elasticsearch.threadpool.ThreadPool;
import org.elasticsearch.xpack.core.ml.action.PutJobAction;
import org.elasticsearch.xpack.core.ml.action.UpdateJobAction;
import org.elasticsearch.xpack.core.ml.dataframe.analyses.MlDataFrameAnalysisNamedXContentProvider;
import org.elasticsearch.xpack.core.ml.inference.MlInferenceNamedXContentProvider;
import org.elasticsearch.xpack.core.ml.job.config.AnalysisConfig;
import org.elasticsearch.xpack.core.ml.job.config.AnalysisLimits;
import org.elasticsearch.xpack.core.ml.job.config.DataDescription;
Expand Down Expand Up @@ -181,12 +176,4 @@ private AnalysisConfig.Builder createAnalysisConfig(String byFieldName) {
return new AnalysisConfig.Builder(Collections.singletonList(detector.build()));
}

@Override
public NamedXContentRegistry xContentRegistry() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of files are changed in this PR because of the refactoring of pulling those into the super class. I would really appreciate splitting this refactoring into its own PR that can be merged before this one does. It would help keep this one clean and readable.

List<NamedXContentRegistry.Entry> namedXContent = new ArrayList<>();
namedXContent.addAll(new MlDataFrameAnalysisNamedXContentProvider().getNamedXContentParsers());
namedXContent.addAll(new MlInferenceNamedXContentProvider().getNamedXContentParsers());
namedXContent.addAll(new SearchModule(Settings.EMPTY, emptyList()).getNamedXContents());
return new NamedXContentRegistry(namedXContent);
}
}
Loading