[ML] add new bucket_correlation aggregation with initial count_correlation function #72133

benwtrent · 2021-04-22T20:09:57Z

This commit adds a new pipeline aggregation that allows correlation within the aggregation frame work in bucketed values.

The initial function is a count_correlation function. The purpose of which is to correlate the count in a consistent number of buckets with a pre calculated indicator. The indicator and the aggregated buckets should related to the same metrics with in documents.

Example for correlating terms within a service.version.keyword with latency percentiles. The percentiles and provided correlation indicator both refer to the same source data where the indicator was previously calculated.:

GET apm-7.12.0-transaction-generated/_search
{
  "size": 0,
  "aggs": {
    "field_terms": {
      "terms": {
        "field": "service.version.keyword",
        "size": 20
      },
      "aggs": {
        "latency_range": {
          "range": {
            "field": "transaction.duration.us",
            "ranges": [<snip>],
            "keyed": true
          }
        },
        "correlation": {
          "bucket_correlation": {
            "buckets_path": "latency_range>_count",
            "count_correlation": {
              "indicator": {
                 "expectations": [<snip>],
                 "doc_count": 20000
               }
            }
          }
        }
      }
    }
  }
}

elasticmachine · 2021-04-23T18:19:58Z

Pinging @elastic/ml-core (Team:ML)

szabosteve

Left a couple of minor suggestions.

docs/reference/aggregations/pipeline/bucket-correlation-aggregation.asciidoc

szabosteve · 2021-04-30T08:09:46Z

docs/reference/aggregations/pipeline/bucket-correlation-aggregation.asciidoc

+[[search-aggregations-bucket-correlation-aggregation]]
+=== Bucket Correlation Aggregation
++++
+<titleabbrev>Bucket Correlation Aggregation</titleabbrev>


As above.

Suggested change

<titleabbrev>Bucket Correlation Aggregation</titleabbrev>

<titleabbrev>Bucket correlation aggregation</titleabbrev>

docs/reference/aggregations/pipeline/bucket-correlation-aggregation.asciidoc

benwtrent · 2021-04-30T18:54:13Z

@elasticmachine update branch

szabosteve

Docs are LGTM! Thanks for writing them!

dimitris-athanasiou

Good stuff! There is a few things to work through after my first pass.

dimitris-athanasiou · 2021-05-06T12:44:56Z

docs/reference/aggregations/pipeline/bucket-correlation-aggregation.asciidoc

+    the correlation of the term values with the latency.
+<2> The range aggregation on the latency field. The ranges were created referencing the percentiles of the latency field.
+<3> The bucket correlation aggregation that calculates the correlation of the number of term values within each range
+    and the previously calculated indicator values.


I think we should also have an example response. It seems to be present for most other aggs.

See below :)

dimitris-athanasiou · 2021-05-06T12:50:51Z

...lClusterTest/java/org/elasticsearch/xpack/ml/integration/BucketCorrelationAggregationIT.java

+            for (BulkItemResponse itemResponse : bulkResponse) {
+                if (itemResponse.isFailed()) {
+                    failures++;
+                    logger.error("Item response failure [{}]", itemResponse.getFailureMessage());


Not sure we should log for each failure here. I think you could just log bulkResponse.buildFailureMessage() to have a single log entry.

dimitris-athanasiou · 2021-05-06T12:58:11Z

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/MachineLearning.java

@@ -1089,7 +1091,14 @@ protected Clock getClock() {
                (parser, name) -> InferencePipelineAggregationBuilder.parse(modelLoadingService, getLicenseState(), name, parser));
        spec.addResultReader(InternalInferenceAggregation::new);

-        return Collections.singletonList(spec);
+        return Arrays.asList(


As this is the first time we're adding a second agg, I think it'd be nice to make this list here read nice and simple. How about we create private methods that create the spec for each one so the list in this method reads nice?

dimitris-athanasiou · 2021-05-06T13:01:19Z

...plugin/ml/src/main/java/org/elasticsearch/xpack/ml/aggs/correlation/CorrelationFunction.java

+
+    void validate(PipelineAggregationBuilder.ValidationContext context, String bucketPath);
+
+    static double sum(double[] xs) {


We might want to use MovingFunctions.sum instead. It also deals with NaN. It's worth checking it out.

dimitris-athanasiou · 2021-05-06T13:03:03Z

...ck/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/aggs/correlation/CorrelativeValue.java

+
+    @Override
+    public int hashCode() {
+        int result = Objects.hash(docCount);


Objects.hash(docCount, Arrays.hashCode(expectations), Arrays.hashCode(fractions)) seems to be a good way to do this without the 31s.

This is simply the intellij autogenerated one. I can change it :)

dimitris-athanasiou · 2021-05-06T13:32:09Z

...in/java/org/elasticsearch/xpack/ml/aggs/correlation/BucketCorrelationAggregationBuilder.java

+    @Override
+    protected void validate(ValidationContext context) {
+
+        final String firstAgg = bucketsPaths[0].split("[>\\.]")[0];


The fact that the validation here is the same as in BucketMetricsPipelineAggregationBuilder made me wonder whether the bucket correlation agg should extend that class. I didn't look further if that's possible but I think it's worth checking it out.

BucketMetricsPipelineAggregationBuilder

This class supports things we don't want to support like user provided formats + gap policies. I could overwrite those methods to make sure prevent them from being set internally (it can be prevented in the parser).

I will see if extending that class will work.

IIRC, I did not inherit from that class because of an initial design choice. But that has since changed. So, hopefully, it works nicely

@dimitris-athanasiou I think to do this, I would need to do a refactoring of BucketMetricsPipelineAggregationBuilder to either allow format and gapPolicy to be accessed directly in the sub-class, or that all accesses go through their respective getters.

Could we not pass in the zero gap policy?

I'd feel more comfortable merging with the similar code and performing a more mechanical refactoring in a follow up then pushing it into this.

Could we not pass in the zero gap policy?

I am not sure where we would do that. Also, when serializing in XContent the gap_policy is written. I suppose we call gapPolicy(GapPolicy) in the subclass ctor, but that felt...messy (accessing a method in a ctor).

dimitris-athanasiou · 2021-05-06T13:33:27Z

...in/java/org/elasticsearch/xpack/ml/aggs/correlation/BucketCorrelationAggregationBuilder.java

+    }
+
+    private final CorrelationFunction correlationFunction;
+


I don't see anything about gap policy here. Should we not support various gap policies? Or is it not applicable for correlation?

No, we shouldn't. There are two gap policys: Skip and Zero. Skip breaks things as we need the two arrays to be the same length.

dimitris-athanasiou · 2021-05-06T13:43:01Z

...ck/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/aggs/correlation/CorrelativeValue.java

+import java.util.List;
+import java.util.Objects;
+
+public class CorrelativeValue implements Writeable, ToXContentObject {


A comment explaining what this class encapsulates would benefit readability I think.

I also wonder if this should be renamed to CorrelationIndicator.

dimitris-athanasiou · 2021-05-06T14:25:50Z

...n/ml/src/main/java/org/elasticsearch/xpack/ml/aggs/correlation/CountCorrelationFunction.java

+        if (indicator.getFractions() == null) {
+            double sum = CorrelationFunction.sum(indicator.getExpectations());
+            xMean = sum / indicator.getExpectations().length;
+            double var = 0;


There is MovingFunctions.stdDev which calculates this as part of it. I wonder if it makes sense to add a MovingFunctions.variance method and reuse it from MovingFunctions.stdDev. If that feels a bit risky, I still think we might want to consider NaN handling here.

dimitris-athanasiou · 2021-05-06T14:28:52Z

...n/ml/src/main/java/org/elasticsearch/xpack/ml/aggs/correlation/CountCorrelationFunction.java

+        final double yMean = weight;
+        final double yVar = (1 - weight) * yMean * yMean + weight * (1 - yMean) * (1 - yMean);
+        double xyCov = 0;
+        if (indicator.getFractions() != null) {


nit comment but I think it'd make it easier to read this if you invert the if logic here (or the one above) so that in both cases we're dealing first with the same case (whether it is there are fractions or not).

dimitris-athanasiou · 2021-05-06T14:32:11Z

...n/ml/src/main/java/org/elasticsearch/xpack/ml/aggs/correlation/CountCorrelationFunction.java

+                    + "]. Unable to calculate correlation"
+            );
+        }
+        final double xMean;


The correlation calculation here is based on something known (e.g. Pearson)? It yes it would be good to add a comment explaining this.

See comment on method

nik9000 · 2021-05-06T15:34:36Z

docs/reference/aggregations/pipeline/bucket-correlation-aggregation.asciidoc

+  }
+}
+-------------------------------------------------
+// NOTCONSOLE


We should make this a fully real example, i think. It'd be a pain to make the setup for it, but without that we can't be sure it works.

@nik9000 there is a non-doc integration test that covers this.

I can attempt to do a set up, but its gonna take a bit to generate data and write out the 50 ranges.

Its important that we test these so they don't break eventually. I can't tell you the number of times I've broken stuff in the docs without noticing it. I mean, since we made the tests I can tell you its much rarer, but I still can't tell you.

You can totally use the setup stuff in docs/build.gradle - over there you can write for loops and stuff to emit values.

If we have a response on the page we should assert that it came from a request on the page - but its totally ok to use stuff like ... and filter_path to shrink it. No one is going to read a huge response anyway.....

nik9000 · 2021-05-06T15:37:28Z

...lClusterTest/java/org/elasticsearch/xpack/ml/integration/BucketCorrelationAggregationIT.java

+
+import static org.hamcrest.Matchers.closeTo;
+
+public class BucketCorrelationAggregationIT extends MlSingleNodeTestCase {


If I had to pick either yaml or single node test I'd pick yaml. It's less expressive which is a pain but we are in the process of using the yaml files to assert things about backwards compatibility that we just can't do with single node tests.

nik9000 · 2021-05-06T15:41:18Z

...lClusterTest/java/org/elasticsearch/xpack/ml/integration/BucketCorrelationAggregationIT.java

+        );
+    }
+
+    private double pearsonCorrelation(double[] xs, int[] ys) {


We do this sort of thing in the aggs tests as well from time to time. We often have a simple example we can assert produces the numbers we expect but we other times fire random numbers into the thing and make an idealized implementation and assert that they are the same. Its nice to make sure none of the distributed "stuff" got in the way. And there really isn't a substitute for randomized data to find weird edges.

nik9000 · 2021-05-06T15:42:21Z

...lClusterTest/java/org/elasticsearch/xpack/ml/integration/BucketCorrelationAggregationIT.java

+        return corXY/xs.length;
+    }
+
+    private static double sum(double[] xs) {


Is DoubleStream.of(xs).sum() short enough that you wouldn't need to make a whole function for it? Its up to you.

I am simply using MovingFunctions.sum now in a subsequent commit.

nik9000 · 2021-05-06T15:44:45Z

...in/java/org/elasticsearch/xpack/ml/aggs/correlation/BucketCorrelationAggregationBuilder.java

+    private static final ParseField FUNCTION = new ParseField("function");
+
+    @SuppressWarnings("unchecked")
+    public static final ConstructingObjectParser<BucketCorrelationAggregationBuilder, String> PARSER = new ConstructingObjectParser<>(


If InstantiatingObjectParser works for you it's generally a little nicer. I'm not sure if it does, but it's worth looking at.

Looks like the context isn't passed in anywhere, so I wouldn't be able to use InstantiatingObjectParser.

nik9000 · 2021-05-06T15:46:12Z

...in/java/org/elasticsearch/xpack/ml/aggs/correlation/BucketCorrelationAggregationBuilder.java

+    @Override
+    protected void validate(ValidationContext context) {
+
+        final String firstAgg = bucketsPaths[0].split("[>\\.]")[0];


I'd feel more comfortable merging with the similar code and performing a more mechanical refactoring in a follow up then pushing it into this.

nik9000 · 2021-05-06T15:47:28Z

...in/java/org/elasticsearch/xpack/ml/aggs/correlation/BucketCorrelationAggregationBuilder.java

+    public boolean equals(Object o) {
+        if (this == o) return true;
+        if (o == null || getClass() != o.getClass()) return false;
+        if (super.equals(o) == false) return false;


super.equals already does the reference and class comparison so you can probably skip that bit.

nik9000 · 2021-05-06T15:52:11Z

x-pack/plugin/src/yamlRestTest/resources/rest-api-spec/test/ml/bucket_correlation_agg.yml

+  - do:
+      search:
+        index: store
+        body: >


I tend to just write these in yaml. If you like it better this way that's fine - it is easier to copy and paste it into curl or whatever.

…ent/elasticsearch into feature/ml-bucket-correlation-agg

…-correlation-agg

dimitris-athanasiou

LGTM Just a javadoc comment rogue `

...n/ml/src/main/java/org/elasticsearch/xpack/ml/aggs/correlation/CountCorrelationFunction.java

…/correlation/CountCorrelationFunction.java

…ation function (elastic#72133) This commit adds a new pipeline aggregation that allows correlation within the aggregation frame work in bucketed values. The initial function is a `count_correlation` function. The purpose of which is to correlate the count in a consistent number of buckets with a pre calculated indicator. The indicator and the aggregated buckets should related to the same metrics with in documents. Example for correlating terms within a `service.version.keyword` with latency percentiles. The percentiles and provided correlation indicator both refer to the same source data where the indicator was previously calculated.: ``` GET apm-7.12.0-transaction-generated/_search { "size": 0, "aggs": { "field_terms": { "terms": { "field": "service.version.keyword", "size": 20 }, "aggs": { "latency_range": { "range": { "field": "transaction.duration.us", "ranges": [<snip>], "keyed": true } }, "correlation": { "bucket_correlation": { "buckets_path": "latency_range>_count", "count_correlation": { "indicator": { "expectations": [<snip>], "doc_count": 20000 } } } } } } } } ```

…correlation function (#72133) (#72896) * [ML] add new bucket_correlation aggregation with initial count_correlation function (#72133) This commit adds a new pipeline aggregation that allows correlation within the aggregation frame work in bucketed values. The initial function is a `count_correlation` function. The purpose of which is to correlate the count in a consistent number of buckets with a pre calculated indicator. The indicator and the aggregated buckets should related to the same metrics with in documents. Example for correlating terms within a `service.version.keyword` with latency percentiles. The percentiles and provided correlation indicator both refer to the same source data where the indicator was previously calculated.: ``` GET apm-7.12.0-transaction-generated/_search { "size": 0, "aggs": { "field_terms": { "terms": { "field": "service.version.keyword", "size": 20 }, "aggs": { "latency_range": { "range": { "field": "transaction.duration.us", "ranges": [<snip>], "keyed": true } }, "correlation": { "bucket_correlation": { "buckets_path": "latency_range>_count", "count_correlation": { "indicator": { "expectations": [<snip>], "doc_count": 20000 } } } } } } } } ```

NamedWriteable Serialization was not declared in the original implementation for 7.x. This commit fixes this. Relates to: #72133 closes: elastic/kibana#105047

…75234) NamedWriteable Serialization was not declared in the original implementation for 7.x. This commit fixes this. Relates to: #72133 closes: elastic/kibana#105047

benwtrent added >enhancement :ml Machine learning v8.0.0 v7.14.0 labels Apr 22, 2021

benwtrent force-pushed the feature/ml-bucket-correlation-agg branch 2 times, most recently from 4cadeee to 0f5bf18 Compare April 23, 2021 18:19

benwtrent marked this pull request as ready for review April 23, 2021 18:19

elasticmachine added the Team:ML Meta label for the ML team label Apr 23, 2021

benwtrent force-pushed the feature/ml-bucket-correlation-agg branch from 0f5bf18 to 0c1970a Compare April 23, 2021 18:25

[ML] add new bucket_correlation aggregation

5216280

benwtrent force-pushed the feature/ml-bucket-correlation-agg branch from 0c1970a to 5216280 Compare April 23, 2021 20:10

fixing docs

f49c647

szabosteve reviewed Apr 30, 2021

View reviewed changes

adding tests, marking as experimental

35dc158

benwtrent requested a review from szabosteve April 30, 2021 18:46

Merge branch 'master' into feature/ml-bucket-correlation-agg

62c9faf

szabosteve approved these changes May 3, 2021

View reviewed changes

nik9000 self-requested a review May 5, 2021 13:41

dimitris-athanasiou suggested changes May 6, 2021

View reviewed changes

dimitris-athanasiou reviewed May 6, 2021

View reviewed changes

nik9000 reviewed May 6, 2021

View reviewed changes

adding doc tests and addressing PR comments

ed37bf8

benwtrent requested review from dimitris-athanasiou and nik9000 May 6, 2021 19:32

benwtrent added 3 commits May 6, 2021 15:33

Merge branch 'feature/ml-bucket-correlation-agg' of github.com:benwtr…

b404e9c

…ent/elasticsearch into feature/ml-bucket-correlation-agg

Merge remote-tracking branch 'upstream/master' into feature/ml-bucket…

c3a06fb

…-correlation-agg

fixing docs

5b4b30c

dimitris-athanasiou approved these changes May 10, 2021

View reviewed changes

...n/ml/src/main/java/org/elasticsearch/xpack/ml/aggs/correlation/CountCorrelationFunction.java Outdated Show resolved Hide resolved

Update x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/aggs…

bdccff1

…/correlation/CountCorrelationFunction.java

nik9000 approved these changes May 10, 2021

View reviewed changes

benwtrent merged commit 8069e9b into elastic:master May 10, 2021

benwtrent deleted the feature/ml-bucket-correlation-agg branch May 10, 2021 16:46

benwtrent mentioned this pull request May 10, 2021

[7.x] [ML] add new bucket_correlation aggregation with initial count_correlation function (#72133) #72896

Merged

qn895 mentioned this pull request Jun 21, 2021

[ML] APM Latency Correlations elastic/kibana#99905

Merged

11 tasks

droberts195 mentioned this pull request Jul 9, 2021

[APM] Latency correlations: Error Unknown NamedWriteable category elastic/kibana#105047

Closed

benwtrent mentioned this pull request Jul 12, 2021

[ML] fixing 7.x serialization for correlation aggregation #75224

Merged

benwtrent mentioned this pull request Jul 12, 2021

[ML] fixing 7.x serialization for correlation aggregation (#75224) #75234

Merged

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

	<titleabbrev>Bucket Correlation Aggregation</titleabbrev>
	<titleabbrev>Bucket correlation aggregation</titleabbrev>


		void validate(PipelineAggregationBuilder.ValidationContext context, String bucketPath);

		static double sum(double[] xs) {


		import static org.hamcrest.Matchers.closeTo;

		public class BucketCorrelationAggregationIT extends MlSingleNodeTestCase {

[ML] add new bucket_correlation aggregation with initial count_correlation function #72133

[ML] add new bucket_correlation aggregation with initial count_correlation function #72133

Conversation

benwtrent commented Apr 22, 2021 • edited Loading

elasticmachine commented Apr 23, 2021

szabosteve left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benwtrent commented Apr 30, 2021

szabosteve left a comment

Choose a reason for hiding this comment

dimitris-athanasiou left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benwtrent May 6, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dimitris-athanasiou left a comment

Choose a reason for hiding this comment

benwtrent commented Apr 22, 2021 •

edited

Loading

benwtrent May 6, 2021 •

edited

Loading