-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add capability to have configurable aggregation temporality for OTLP Registry #3625
Add capability to have configurable aggregation temporality for OTLP Registry #3625
Conversation
…registry. OTLP format supports delta and cumulative temporality. But currently in micrometer, the support for OTLP Metrics with Delta Temporality is not available. This commit adds the capability to export meters with delta temporality. Below are some behaviours this PR includes * capability to configure aggregation Temporality via property `otlp.aggregationTemporality` * Delta Temporality is supported by using StepMeters wherever applicable * Capability to export max as part of the HistogramDataPoint for Delta Temporality
...ations/micrometer-registry-otlp/src/main/java/io/micrometer/registry/otlp/OtlpStepTimer.java
Outdated
Show resolved
Hide resolved
🛠 Lift Auto-fixSome of the Lift findings in this PR can be automatically fixed. You can download and apply these changes in your local project directory of your branch to review the suggestions before committing.1 # Download the patch
curl https://lift.sonatype.com/api/patch/github.com/micrometer-metrics/micrometer/3625.diff -o lift-autofixes.diff
# Apply the patch with git
git apply lift-autofixes.diff
# Review the changes
git diff Want it all in a single command? Open a terminal in your project's directory and copy and paste the following command: curl https://lift.sonatype.com/api/patch/github.com/micrometer-metrics/micrometer/3625.diff | git apply Once you're satisfied, commit and push your changes in your project. Footnotes |
...entations/micrometer-registry-otlp/src/main/java/io/micrometer/registry/otlp/OtlpConfig.java
Outdated
Show resolved
Hide resolved
...entations/micrometer-registry-otlp/src/main/java/io/micrometer/registry/otlp/OtlpConfig.java
Outdated
Show resolved
Hide resolved
...ns/micrometer-registry-otlp/src/main/java/io/micrometer/registry/otlp/OtlpMeterRegistry.java
Outdated
Show resolved
Hide resolved
...ns/micrometer-registry-otlp/src/main/java/io/micrometer/registry/otlp/OtlpMeterRegistry.java
Outdated
Show resolved
Hide resolved
...ns/micrometer-registry-otlp/src/main/java/io/micrometer/registry/otlp/OtlpMeterRegistry.java
Outdated
Show resolved
Hide resolved
...ns/micrometer-registry-otlp/src/main/java/io/micrometer/registry/otlp/OtlpMeterRegistry.java
Outdated
Show resolved
Hide resolved
...ns/micrometer-registry-otlp/src/main/java/io/micrometer/registry/otlp/OtlpMeterRegistry.java
Outdated
Show resolved
Hide resolved
...ter-registry-otlp/src/main/java/io/micrometer/registry/otlp/OtlpStepDistributionSummary.java
Outdated
Show resolved
Hide resolved
import io.micrometer.core.instrument.distribution.*; | ||
import io.micrometer.core.instrument.step.StepDistributionSummary; | ||
|
||
class OtlpStepDistributionSummary extends StepDistributionSummary { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% sure that this is a good idea but this class seem the same as OtlpCumulativeDistributionSummary
except:
TimeWindowFixedBoundaryHistogram
/DistributionStatisticConfig
/etc. are created differently- Cumulative has
getStartTimeNanos
So I was thinking if we should have the common logic in a common place, e.g.:
- Having only one
OtlpDistributionSummary
that can handle both (getStartTimeNanos
will be there in the step case too). - Having a common parent/super class that cumulative/step can extend and provide the
TimeWindowFixedBoundaryHistogram
/DistributionStatisticConfig
/etc. for example in ctor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think OtlpStepTimer
has the same situation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right see one of my TODOs in the PR description. This was one of my POC codes and it needs some re-designing and better code re-organization. Will start on that part and welcome any suggestions there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to ping us if you need some help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. I might need some especially in restructuring some of these classes. Already, OtlpMeterRegisrty does what I believe was not How registries were meant to work doing 2 things - Step and Cumulative. And the same goes with the Timers and DistributionSummaries because there are different things inherently - Step and Cumulative. So, any help on this would be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to throw in the idea of get-and-reset semantics for Histograms. Basically, you record the data, snapshot it, and then reset the histogram on every export. This circumvents the whole timer and rotation code, which I am not sure is required for a delta-based export. Just leaving this here as an idea, let me know if there is something I am missing
...ations/micrometer-registry-otlp/src/main/java/io/micrometer/registry/otlp/OtlpStepTimer.java
Outdated
Show resolved
Hide resolved
cbe7b1b
to
daabc04
Compare
daabc04
to
7854a88
Compare
...entations/micrometer-registry-otlp/src/main/java/io/micrometer/registry/otlp/OtlpConfig.java
Outdated
Show resolved
Hide resolved
769ae14
to
487af47
Compare
- Well, I would prefer having something like a StepWindowHistogram for Step based registries or would improve the existing implementation to have improved accuracy. But adding tests for the existing histogram implementation.
} | ||
|
||
@Test | ||
void timerWithHistogram() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shakuzen / @jonatan-ivanov I believe you guys are already aware that the TImeWindowFixedBoundaryHistogram will have very poor accuracy for StepBased registries.
To Summarise:
The issue with that is it uses a sliding bucket which creates 3 buckets(assumption, step=1min, the default buffer length is 3). And these buckets are rotated every 20 seconds (step/buffer length) which makes the histogram always error-prone I,e it will always undercount the values. This is the reason we used DeltaHistogramCounts implementation in SignalFxMeterRegistry for our histogram use-cases. With OTLP coming into the picture, I think we should at least have a similar implementation (or basic StepBased Histogram) which will give more accurate information. And moreover, the percentiles don't align with Step Boundaries which also makes it difficult to correlate them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll need to change this. SignalFx was the first step-based registry to support percentile histograms, so we hadn't taken close enough look at this in general, I think. The histogram support is indeed not right, which we can fix after merging this, but I don't usually like merging something I know has a bug.
The percentiles and max not aligning with step boundaries was intentional (there's a relevant old issue with the history #1993), but I don't think it makes sense in the OTLP context because the specification is clear about what the meaning of the max or bucket counts is supposed to represent, and it needs to align with the time interval.
...crometer-registry-otlp/src/main/java/io/micrometer/registry/otlp/AggregationTemporality.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all: Thanks for taking this on! I think this is a great initiative. I have experience both in OTel and a little bit in Micrometer exporters, so feel free to ping me if you are stuck somewhere. I have left a few comments. In general, I agree with you that there might be an opportunity to improve the histogram reporting for step-based histograms. I will give this another go next week, and try it out too.
...gistry-otlp/src/main/java/io/micrometer/registry/otlp/OtlpCumulativeDistributionSummary.java
Show resolved
Hide resolved
@@ -79,13 +79,9 @@ public HistogramSnapshot takeSnapshot() { | |||
return snapshot; | |||
} | |||
|
|||
CountAtBucket[] histogramCounts = this.monotonicBucketCountHistogram.takeSnapshot(0, 0, 0).histogramCounts(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessarily related to this PR, and might be addressed in a follow-up (also untested, I just happen to have fixed a similar bug recently): the fact that super.takeSnapshot()
and monotonicBucketCountHistogram.takeSnapshot()
are not synchronized might lead to discrepancies between the count and the bucket counts if some other thread records data in the time between the two method calls.
...ns/micrometer-registry-otlp/src/main/java/io/micrometer/registry/otlp/OtlpMeterRegistry.java
Outdated
Show resolved
Hide resolved
...ns/micrometer-registry-otlp/src/main/java/io/micrometer/registry/otlp/OtlpMeterRegistry.java
Outdated
Show resolved
Hide resolved
import io.micrometer.core.instrument.distribution.*; | ||
import io.micrometer.core.instrument.step.StepDistributionSummary; | ||
|
||
class OtlpStepDistributionSummary extends StepDistributionSummary { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to throw in the idea of get-and-reset semantics for Histograms. Basically, you record the data, snapshot it, and then reset the histogram on every export. This circumvents the whole timer and rotation code, which I am not sure is required for a delta-based export. Just leaving this here as an idea, let me know if there is something I am missing
@pirgeo Thanks for the review sure will reach out for any help. I believe one thing on top of our minds was what to do with |
8e1d972
to
8b8c350
Compare
Small update: as I was looking into how to do this without code duplication, I thought having some integration tests could be timely, so: #3668 |
@jonatan-ivanov what is the missing part to move forward with this PR? |
@jonatan-ivanov / @shakuzen Is there any update on this? |
# Conflicts: # implementations/micrometer-registry-otlp/src/main/java/io/micrometer/registry/otlp/OtlpCumulativeDistributionSummary.java # implementations/micrometer-registry-otlp/src/main/java/io/micrometer/registry/otlp/OtlpCumulativeTimer.java # implementations/micrometer-registry-otlp/src/main/java/io/micrometer/registry/otlp/OtlpMeterRegistry.java # implementations/micrometer-registry-otlp/src/test/java/io/micrometer/registry/otlp/OtlpMeterRegistryTest.java
I resolved the conflicts (merged |
I did some further refactoring and added integration tests, most of the things look good to me, but I have two concerns:
to
In the first case, meters published together can have different timestamps, in the second they can't. E.g.: let's say we have two gages and pulling them can take 5 seconds, in the first case their timestamps will be ~5 seconds apart, in the second case they will be off. I think the original approach is closer to the truth, what do you think?
We are sending the data to the collector (click to reveal an example request)
But the Prometheus response does not have them:
Additionally, these tests are timing out on CircleCI and I cannot re-trigger them since for some reason they are in @lenin-jaganathan's org. :o |
|
Yeah, Prometheus not supporting delta is ok (it can't really), but I would have assumed that the OTel collector can accept OTLP with delta and export it in the Prometheus format (as cumulative). :( |
@jonatan-ivanov I did verify the Histogram datapoint after your comment and I don't see any issues with them. I see the data points in our back end and it just works fine. Do you want me to revert your last 2 commits? Edit: I am not sure what Prometheus does exactly, the spec say the Dleta Histograms should be DROPPED or CONVERTED for Prometheus. (https://opentelemetry.io/docs/reference/specification/compatibility/prometheus_and_openmetrics/#histograms-1) |
👍🏼 |
…Metrics will use the StepWindow As Per (Otlp Spec)[https://opentelemetry.io/docs/reference/specification/metrics/data-model/#metric-points], the following seems to hold true - When the aggregation temporality is “delta”, we expect to have no overlap in time windows for metric streams - So, we will be using the step window for start and endtime - For cumulative, we will be using the start time of the meter and the time when the meter is read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gave it another pass and tried running it with a collector that simply logs what it receives. Looks like it all works to me. I haven't done extensive tests with a backend yet though.
.../micrometer-registry-otlp/src/main/java/io/micrometer/registry/otlp/OtlpCumulativeTimer.java
Show resolved
Hide resolved
@lenin-jaganathan Thank you for the changes! @shakuzen I think this is in a good enough shape to merge it, what do you think? |
timer.record(111, TimeUnit.MILLISECONDS); | ||
|
||
// This is where TimeWindowMax can be painful and make no sense at all. Need to | ||
// re-visit this. Probably StepMax is what might fit good for OTLP. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need a different Max implementation as you say. TimeWindowMax doesn't fit the interval as it should for OTLP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some remaining known issues, but if everyone thinks it would still be beneficial to get this in the RC1 (scheduled for Monday) for wider/easier testing while we fix the remaining issues before GA, I'm okay with that.
@lenin-jaganathan has already pointed out the issues in comments throughout the pull request:
- partial publish on close fix is not applied to this registry with delta temporality (maybe we can do something in the TCK to catch this proactively in registry implementations)
- time window max is not right for OTLP delta
- time window histogram is not right for OTLP delta
Depending on what we do about #2818 that may also cause issues here.
@shakuzen I do have a couple of POCs for StepBased Histogram and Max. But I just don't want this to look like a huge PR with too many new things. I would have a PR for StepHistogram and StepMax immediately after this PR. But If you think differently, I can just pile more commit on this. No preferences there, just a matter of getting Delta Registry support in to start with. |
Sounds good. Please do follow-up with PRs for the other things. On the partial step issue, is there any reason we can't make |
@shakuzen Follow-up PR for to address Max and histogram - #3749 Let me just get back on partial publishing. I see an issue with partial publishing in some other Step Registries too (see SignalFxTimer which was made an AbstractTimer just to have a custom Histogram implementation and now since it is no more Step the issue still exists.) And the same may be the case for OTLP as my new PR makes the OTLP Timer extend AbstarctTimer. |
Yeah, it's on my list of things to do post-RC1 to more thoroughly check specific-registry implementations for this kind of thing. But it's expected if you make something not extend a Step meter, you're indicating you want to be responsible for any of the behavior that the step meters provide, such as the behavior on registry close. We can't really do anything about that. It depends on each registry implementation what's best for it, but I have to imagine it would be easier to extend and customize StepTimer than not in most cases. |
I completely agree with the flexibility offered here. But since StepMeter is the interface that defines that we can perform a report with a different value on close, and this being package-private makes it someone cannot implement this behavior. I would think about something that lets implementers give an implementation for this last-minute publish but still let the StepRegistry close that.
Technically, we do that since we almost override everything in the StepRegistry in OTLP. But, doing something like that will make me assume that OTLP is a StepRegistry which it is not as it can be either Step or Cumulative. |
In Micrometer 1.11.0-RC1, a new property was introduced in OtlpConfig to define aggregation temporality. See micrometer-metrics/micrometer#3625
In Micrometer 1.11.0-RC1, a new property was introduced in OtlpConfig to define aggregation temporality. See micrometer-metrics/micrometer#3625
In Micrometer 1.11.0-RC1, a new property was introduced in OtlpConfig to define aggregation temporality. See micrometer-metrics/micrometer#3625
OTLP format supports delta and cumulative temporality. But currently in micrometer, the support for OTLP Metrics with Delta Temporality is not available. This PR adds the capability to export meters with delta temporality(#3145). Below are some behaviors this PR includes
otlp.aggregationTemporality
Implementation Information:
When Delta Temporality is enabled the following applies,
StartTimeUnixNano
) for the meter will be set to the start of the previous step.TimeUnixNano
) for the meter will be set to the end of the previous step or the step for which data is being reported.(the previous step - indicates the last fully completed step.)
TODO:
Make the support for Delta Histogram using a StepHistogram(basically the count, and total should match with the value recorded in Histogram) as mentioned in the OTLP protocolPossibly rename some of the classes and minimize code duplications.fixes #3145