-
Notifications
You must be signed in to change notification settings - Fork 266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove Summary until better understand correlation with time #199
Conversation
I can understand why percentile/quantile values are non-aggregatable, but min/max/sum/count are certainly mergeable. Are we throwing out the baby with the bathwater with this? |
During the meeting, I made a note that the Prometheus Summary data type does not fit the definition of Delta or Cumulative, as far as I know. @paivagustavo made a note about this in the OTel-Go code with a link to https://www.robustperception.io/how-does-a-prometheus-summary-work which explains this topic. After thinking about it, maybe we should add a new class of Temporality to say neither Delta nor Cumulative? I'm not sure whether there is a way to classify other kinds of Temporality. |
If there is a new kind of temporality which is similar to "Delta" but does not convey differencing, the word might be Interval. The Interval temporality are those fields which are computed over a window: Min value, Max value, Last value (and First value) are fields that are meaningful and useful over an interval, and they are also mergeable. Also @jkwatson I still support the idea that Summary should be a broad class of summary statistic, including Min, Max, Sum, and Count. I think about adding Last Value, as discussed in open-telemetry/oteps#117, which would fall into Interval temporality. (There is also "M2", which is a way to summarize variance.) I wrote the meeting notes from today's OTLP meeting after the fact: We discussed the default aggregation for |
Interesting distinction, and yes, agreed. Is it important to note this difference in the OTLP? |
My personal thought here: In case of Histogram and Sum we can translate Delta to/from Cumulative (with an extra state), but this is not the case for Summary at all (even only for Count/Sum/Min/Max) because we cannot translate Cumulative -> Delta. Indeed we can do Delta -> Cumulative for Count/Sum/Min/Max but not the opposite. There is a good question about how do we represent the correlation between the Aggregated values and Time in case of a Summary aggregation, and my answer is "I don't know yet" but what I know is "Temporality as defined today is not correct for Summary". So indeed I may "throw out the baby with the bathwater" but I have more reasons to justify that the current "Temporality" does not applying to Summary than vice-versa. To summarize:
So we can have "temporality" for the "sum/count" components of the Summary and treat everything else as "Gauge" which means no temporality associated with them because we don't know how to "merge" or "differentiate". |
To clarify the goal: the current Summary has caveats and we try to release a stable protocol sooner that later, we can go ahead and completely remove Summary for the moment until we better understand all the implications (that means protocol is not ready for GA, but it is stable so everything that is defined is clear and we are most likely not going to change). Then rethink the Summary, this way we can unblock 5 instruments, and can unblock the collector work. |
@jkwatson would you support removing After a second thought I think removing entire "Summary" may be better, because we can rethink the entire problem and don't have any legacy to deal with. |
For the proto? Definitely. I have no problem with that at all. 👍 |
In the short term, we could also do what Promtheus does when it turns Summary values into timeseries points in the remote write API, which is to turn each percentile value into its own metric (as it does for Histogram buckets), e.g., metric_name.p95, metric_name.p50. |
What are we going to do about ValueRecorder, in this case? How should we represent MinMaxSumCount? Are we moving in the direction of using Histogram or DDSketch as the ValueRecorder default? |
As I left in the proto "TBD". I think we have a clear goal to release something stable (I would say even incomplete) because that unblocks us to start supporting everything except |
In the "short short" term I would not support Summary at all, then if we get to a point close to GA and not resolve this would do what you suggested 👍 |
@open-telemetry/specs-metrics-approvers please review, I want tomorrow to have submitted in the repo a version of the protocol (may be incomplete) that is ready to be published :) |
* Add exemplars to proto * handle just exemplars, nit fixes * comments * rawvalue -> exemplar, remove sample_count
Summary aggregated data are hard to add/sub to build different temporalities, and from the protocol perspective we should not try to associate them with a time window because this will not give us too much benefit.
This makes OTLP compatible with other protocols that support reporting of quantiles like Prometheus/OpenMetrics, but we need to think of a different way to support default aggregation for ValueRecorder, during the OTLP meeting Tue we discussed the possibility to use DDSketch as the default aggregation for these data, but that is not decided and a decision is needed for that independent of this PR.