Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove Summary until better understand correlation with time #199

Merged
merged 1 commit into from
Aug 13, 2020

Conversation

bogdandrutu
Copy link
Member

Summary aggregated data are hard to add/sub to build different temporalities, and from the protocol perspective we should not try to associate them with a time window because this will not give us too much benefit.

This makes OTLP compatible with other protocols that support reporting of quantiles like Prometheus/OpenMetrics, but we need to think of a different way to support default aggregation for ValueRecorder, during the OTLP meeting Tue we discussed the possibility to use DDSketch as the default aggregation for these data, but that is not decided and a decision is needed for that independent of this PR.

@jkwatson
Copy link
Contributor

I can understand why percentile/quantile values are non-aggregatable, but min/max/sum/count are certainly mergeable. Are we throwing out the baby with the bathwater with this?

@jmacd
Copy link
Contributor

jmacd commented Aug 12, 2020

During the meeting, I made a note that the Prometheus Summary data type does not fit the definition of Delta or Cumulative, as far as I know. @paivagustavo made a note about this in the OTel-Go code with a link to https://www.robustperception.io/how-does-a-prometheus-summary-work which explains this topic.

After thinking about it, maybe we should add a new class of Temporality to say neither Delta nor Cumulative? I'm not sure whether there is a way to classify other kinds of Temporality.

@jmacd
Copy link
Contributor

jmacd commented Aug 12, 2020

If there is a new kind of temporality which is similar to "Delta" but does not convey differencing, the word might be Interval. The Interval temporality are those fields which are computed over a window: Min value, Max value, Last value (and First value) are fields that are meaningful and useful over an interval, and they are also mergeable.

Also @jkwatson I still support the idea that Summary should be a broad class of summary statistic, including Min, Max, Sum, and Count. I think about adding Last Value, as discussed in open-telemetry/oteps#117, which would fall into Interval temporality. (There is also "M2", which is a way to summarize variance.)

I wrote the meeting notes from today's OTLP meeting after the fact:
https://docs.google.com/document/d/1LfDVyBJlIewwm3a0JtDtEjkusZjzQE3IAix8b0Fxy3Y/edit#heading=h.z0c9k3cymbcq

We discussed the default aggregation for ValueRecorder instruments, which is a related topic if you take the viewpoint that Summary describe some interval. This was previously discussed here: open-telemetry/opentelemetry-specification#636. (CC: @mikezvi)

@jkwatson
Copy link
Contributor

If there is a new kind of temporality which is similar to "Delta" but does not convey differencing, the word might be Interval. The Interval temporality are those fields which are computed over a window: Min value, Max value, Last value (and First value) are fields that are meaningful and useful over an interval, and they are also mergeable.

Interesting distinction, and yes, agreed. Is it important to note this difference in the OTLP?

@bogdandrutu
Copy link
Member Author

My personal thought here: In case of Histogram and Sum we can translate Delta to/from Cumulative (with an extra state), but this is not the case for Summary at all (even only for Count/Sum/Min/Max) because we cannot translate Cumulative -> Delta. Indeed we can do Delta -> Cumulative for Count/Sum/Min/Max but not the opposite.

There is a good question about how do we represent the correlation between the Aggregated values and Time in case of a Summary aggregation, and my answer is "I don't know yet" but what I know is "Temporality as defined today is not correct for Summary". So indeed I may "throw out the baby with the bathwater" but I have more reasons to justify that the current "Temporality" does not applying to Summary than vice-versa.

To summarize:

  • Temporality in the way we defined it applies to "sum/count" components of the Summary. Guarantees that we can calculate delta to/from cumulative. As the article that explains Prometheus case, sum/count are "Sum" aggregations so they can have temporality.
  • Min/Max may be "mergeable" but they are not "differentiable" so sending "cumulative" min/max will make impossible to support backends that requires "delta".
  • Adding a new "Aggregation" or extending Summary with Min/Max faces the same problem as mentioned before.

So we can have "temporality" for the "sum/count" components of the Summary and treat everything else as "Gauge" which means no temporality associated with them because we don't know how to "merge" or "differentiate".

@bogdandrutu
Copy link
Member Author

To clarify the goal: the current Summary has caveats and we try to release a stable protocol sooner that later, we can go ahead and completely remove Summary for the moment until we better understand all the implications (that means protocol is not ready for GA, but it is stable so everything that is defined is clear and we are most likely not going to change). Then rethink the Summary, this way we can unblock 5 instruments, and can unblock the collector work.

@bogdandrutu
Copy link
Member Author

@jkwatson would you support removing Summary entirely for the moment to give us some time to rethink this while still making progress, or the tradeoff that is right now to remove only "Temporality".

After a second thought I think removing entire "Summary" may be better, because we can rethink the entire problem and don't have any legacy to deal with.

@jkwatson
Copy link
Contributor

@jkwatson would you support removing Summary entirely for the moment to give us some time to rethink this while still making progress, or the tradeoff that is right now to remove only "Temporality".

After a second thought I think removing entire "Summary" may be better, because we can rethink the entire problem and don't have any legacy to deal with.

For the proto? Definitely. I have no problem with that at all. 👍

@bogdandrutu bogdandrutu changed the title Remove temporality from the Summary Remove Summary until better understand correlation with time Aug 12, 2020
@jmacd
Copy link
Contributor

jmacd commented Aug 12, 2020

In the short term, we could also do what Promtheus does when it turns Summary values into timeseries points in the remote write API, which is to turn each percentile value into its own metric (as it does for Histogram buckets), e.g., metric_name.p95, metric_name.p50.

@jmacd
Copy link
Contributor

jmacd commented Aug 12, 2020

What are we going to do about ValueRecorder, in this case? How should we represent MinMaxSumCount? Are we moving in the direction of using Histogram or DDSketch as the ValueRecorder default?

@bogdandrutu
Copy link
Member Author

@jmacd:

What are we going to do about ValueRecorder, in this case? How should we represent MinMaxSumCount? Are we moving in the direction of using Histogram or DDSketch as the ValueRecorder default?

As I left in the proto "TBD". I think we have a clear goal to release something stable (I would say even incomplete) because that unblocks us to start supporting everything except ValueRecorder which probably takes 1-2 weeks to change all the languages etc during this time we can make a final decision on that.

@bogdandrutu
Copy link
Member Author

In the short term, we could also do what Promtheus does when it turns Summary values into timeseries points in the remote write API, which is to turn each percentile value into its own metric (as it does for Histogram buckets), e.g., metric_name.p95, metric_name.p50.

In the "short short" term I would not support Summary at all, then if we get to a point close to GA and not resolve this would do what you suggested 👍

@bogdandrutu
Copy link
Member Author

@open-telemetry/specs-metrics-approvers please review, I want tomorrow to have submitted in the repo a version of the protocol (may be incomplete) that is ready to be published :)

* Add exemplars to proto

* handle just exemplars, nit fixes

* comments

* rawvalue -> exemplar, remove sample_count
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants