Semantic conventions for `up` metric on Resources. #1078 #1102

cwildman · 2020-10-16T17:22:05Z

Changes

Describe the up metric on Resources.

PR Description & Discussion

This PR lays the groundwork for an up metric to identify Resources that are healthy or unhealthy. There are a few open questions on where this metric should be generated and if it should have any labels. I'll mention those questions here but for now I tried to leave the semantic convention fairly open-ended.

Where in a metric pipeline should this metric be synthesized?

The up metric makes sense in a telemetry system using a pull model. This is because the component doing the pulling already has to keep state about which targets it is pulling metrics from and when. If a pull from a specific target ever fails, this component can synthesize the up metric with a value of 0 indicating it is unhealthy.

In a telemetry system using a push model, it's less clear how beneficial the up metric is. If the server side was to synthesize the up metric in a push system, there are a few problems: 1. the server would be forced to keep state about the Resources that have reported so that it can generate the "unhealthy" up metric when they stop reporting, 2. is a Resource that stops reporting always unhealthy? Or perhaps it simply no longer exists. If the client side was to synthesize the up metric in a push system can it ever really report unhealthy? If the process pushing these metrics has died then it can't report as unhealthy. Then the indication that this Resource is unhealthy becomes anytime this metric is not present. That seems superfluous since you could use any metric from this Resource to try to determine that.

In summary I think it would be useful to generate the up metric in receivers that are using a pull model. I am less convinced it makes sense to synthesize the up metric in all exporters, where unhealthy just becomes the lack of presence of a metric. I could see generating the up metric in prometheus exporters that push data (prometheus remote write) for backwards compatibility. A prometheus exporter that simply acts as a scrape target shouldn't need to synthesize up since the prometheus server will do it automatically.

This metric should be label free

I agree that this metric doesn't have any explicit labels but, we at least want some attributes from the Resource this metric is reported for added to the metric when it is exported. Without those labels the metric is useless. Maybe that is obvious and doesn't need to be called out in the semantic convention.

Related issues #1078

Mentioning @jmacd and @bogdandrutu for reviews.

specification/metrics/semantic_conventions/resource-metrics.md

justinfoote

I love this simple proposal.
I have one question for the group. Should we specify exporter behavior with a SHOULD?

They SHOULD inherit the attributes of the encapsulating Resource ...

If not, maybe we should make this a MAY?

cwildman · 2020-10-19T23:22:17Z

@arminru I think I addressed your feedback. Thank you for the review!

justinfoote · 2020-10-19T23:41:29Z

@open-telemetry/specs-metrics-approvers this is ready for review!

jmacd · 2020-10-23T06:19:28Z

specification/metrics/semantic_conventions/resource-metrics.md

+
+## Metric Instruments
+
+The following metric instruments SHOULD be synthesized every time a Resource produces some


I am wondering if this SHOULD should be restricted to OTLP exporters? I think we explicitly do not want to export this for a Prometheus exporter, since the receiving end will synthesize up.

We could elaborate on the purpose. Suppose a process starts a metrics SDK but the application registers no instrumentation packages and no instruments. A push OTLP exporter is installed, and now wants to push a single collection interval, consisting of 0 metrics. This metric exists to show the resource is up.

Ok that is a good use case for an up metric with OTLP exporters that I hadn't thought of.

Regarding whether this SHOULD should be restricted to OTLP exporters, what about the prometheus remote write exporter? That is non OTLP but I think we want the up metric.

I see. @cwildman yes! Maybe we need a term to say when this rule applies, like "original" to indicate when instrumentation is generated, particularly when it is first aggregated into an exposition format. I used the term "aggregated" thinking of the Collector: it will not apply the up metric in any of its exporters but it could apply the metric in some of its receivers. OTel SDKs that export Prometheus remote write would apply this format, agreed.

jmacd · 2020-10-23T06:33:59Z

specification/metrics/semantic_conventions/resource-metrics.md

+
+| Name                 | Instrument    | Units        | Description |
+|----------------------|---------------|--------------|-------------|
+| `resource.up` | ValueRecorder | 1 | A value of `1` indicates the Resource is healthy, `0` if the Resource is unhealthy. |


The name resource.up is great, but I don't think we should use the word "healthy". Maybe just "produced telemetry"?

For a push exporter, a resource that is not up means "no longer pushing". When might we see a push exporter explicitly push a zero for resource.up? I was thinking through connection error states where some metric data is dropped. Should the resource.up variable be set to zero when data is lost?

I was thinking as a UpDownSumObserver. In a metrics pipeline we could drop resource attributes, lowering cardinality, which will naturally aggregate as a sum. This makes the resource.up metric effectively a cardinality counter.

"produced telemetry" makes sense to me.

For a push exporter, a resource that is not up means "no longer pushing". When might we see a push exporter explicitly push a zero for resource.up? I was thinking through connection error states where some metric data is dropped. Should the resource.up variable be set to zero when data is lost?

If I'm following you correctly, isn't it a possibility that the resource.up metric itself is lost as part of those connection errors? I feel like if that's the case this would be an unreliable signal.

I was thinking as a UpDownSumObserver. In a metrics pipeline we could drop resource attributes, lowering cardinality, which will naturally aggregate as a sum. This makes the resource.up metric effectively a cardinality counter.

I'm curious why UpDownSumObserver instead of just SumObserver in that scenario if we never have a value less than zero?

I feel like if that's the case this would be an unreliable signal.

True. I was thinking but didn't say anything about this case. The resource.up metric might be treated specially inside an SDK (we might say MUST) in a way that it was buffered even when other timeseries are dropped. This kind of spec could be added in the ../protocol/exporter.md document, maybe.

jmacd · 2020-10-23T07:23:11Z

In a telemetry system using a push model, it's less clear how beneficial the up metric is.

There are differences in the push model, and one of the reasons we might not want to generate resource.up when pushing is that it's implied by the existence of any other metric data over the same time period.

This means that we can standardize any one metric that MUST be builtin to OTel SDKs. In the OTel-Go contrib runtime package there is a "runtime.uptime" metric currently. Here's an alternative approach, then: Create a semantic-conventional metric resource.uptime defined as the wall-clock time since the process started. The existence of a reported resource.uptime value implies the resource was up. (I prefer "resource.uptime" to "runtime.uptime").

However an uptime metric doesn't solve one of the problems I was hoping to solve with this. An end user that has built up monitoring and alerting over Prometheus data is used to an up metric. We'd like to help this end user transition to OTel in two ways: (1) replacing Prometheus server with OTel collector with a Prometheus/OpenMetrics receiver, (2) replacing Prometheus clients with OTel SDKs.

Whether (1) or (2), the user's alerting and monitoring should keep working. We could logically synthesize the up metric from resource.uptime as stated above, but a Prometheus receiver in case (1) can't. Besides, it's simpler just to use an explicit metric.

I worry a little about introducing a name conflict so early on. If OTel uses resource.up and Prometheus users are expecting up, we're going to have to also standardize translation stages. Although I like resource.up, I wonder if we could agree on up--for Prometheus compatibility, to save 9 bytes, and reduce the number of translations.

cwildman · 2020-10-28T00:19:17Z

There are differences in the push model, and one of the reasons we might not want to generate resource.up when pushing is that it's implied by the existence of any other metric data over the same time period.

Totally agree, I tried to articulate this in the description but you've said it much clearer.

Whether (1) or (2), the user's alerting and monitoring should keep working. We could logically synthesize the up metric from resource.uptime as stated above, but a Prometheus receiver in case (1) can't.

In case (1) shouldn't the Prometheus receiver be responsible for synthesizing the up metric since it is replacing the prometheus server which does that normally?

I worry a little about introducing a name conflict so early on. If OTel uses resource.up and Prometheus users are expecting up, we're going to have to also standardize translation stages. Although I like resource.up, I wonder if we could agree on up--for Prometheus compatibility, to save 9 bytes, and reduce the number of translations.

What exactly do you mean by standardize translation stages? I'm picturing we could have this as a default relabel rule for places where it's obviously necessary. For example a prometheus remote write exporter should make sure resource.uptime is relabeled to up by default. Other exporters that aren't likely to be pushing to legacy systems wouldn't need this rule. I worry that breaking the naming scheme for these metrics to provide backward compatibility to other systems ends up being some long running tech debt. Whereas relabel rules allow us to provide backward compatibility without breaking the naming scheme, it's just unclear how often this would require effort from the user to create the relabel rule. You are certainly in a much better position to weigh that tradeoff.

jmacd · 2020-10-30T20:02:20Z

What exactly do you mean by standardize translation stages? I'm picturing we could have this as a default relabel rule for places where it's obviously necessary.

I think you've got the picture. This extra relabeling just just more complexity for the sake of using resource.up vs up. OTOH other metrics conventions will be encountered in the real world, and we're probably going to have to get good at this sort of translation in practice. I prefer the qualified name resource.up.

github-actions · 2020-11-07T03:19:19Z

This PR was marked stale due to lack of activity. It will be closed in 7 days.

cwildman · 2020-11-12T18:57:05Z

@jmacd @bogdandrutu per your discussion at last weeks metrics meeting we have two semantic conventions we want to define:

A resource.up metric that can be used for compatibility with prometheus up
A resource.uptime metric that can be used to track time since process started.

I'd like to keep this PR focused on #1, the compatibility issue. Do we want a separate OTEL issue for #2?

github-actions · 2020-11-20T03:21:06Z

This PR was marked stale due to lack of activity. It will be closed in 7 days.

github-actions · 2020-11-27T03:23:12Z

Closed as inactive. Feel free to reopen if this PR is still being worked on.

jgals · 2020-12-04T16:17:02Z

@jmacd and @bogdandrutu, any thoughts on the above proposal and question from @cwildman?

jmacd · 2020-12-11T18:27:32Z

Answering @jgals question, let's focus on the first issue described above.

A resource.up metric that can be used for compatibility with prometheus up
I've filed #1273 which will let us derive the uptime from the resource.up metric.

github-actions · 2020-12-19T03:32:49Z

This PR was marked stale due to lack of activity. It will be closed in 7 days.

github-actions · 2020-12-26T03:38:02Z

Closed as inactive. Feel free to reopen if this PR is still being worked on.

cwildman requested review from a team October 16, 2020 17:22

github-actions bot assigned jmacd Oct 16, 2020

arminru reviewed Oct 19, 2020

View reviewed changes

justinfoote approved these changes Oct 19, 2020

View reviewed changes

jmacd approved these changes Oct 23, 2020

View reviewed changes

jmacd requested changes Oct 23, 2020

View reviewed changes

This was referenced Oct 30, 2020

Prometheus receiver stops scraping all targets when Kubernetes SD change or become unreachable open-telemetry/opentelemetry-collector#1909

Closed

Do not export counters with no points for OTLP. open-telemetry/opentelemetry-java#1978

Merged

github-actions bot added the Stale label Nov 7, 2020

MrAlias removed the Stale label Nov 11, 2020

cwildman added 5 commits November 12, 2020 11:31

Semantic conventions for up metric on Resources. open-telemetry#1078

96e6c5b

CHANGELOG

c237741

PR feedback.

6e9f8ca

Remove extra line break.

9cffe4b

Switch to UpDownSumObserver.

94ef57b

cwildman force-pushed the resource-metrics-semantic-conventions branch from 3caa9d1 to 94ef57b Compare November 12, 2020 19:32

github-actions bot added the Stale label Nov 20, 2020

github-actions bot closed this Nov 27, 2020

jmacd reopened this Dec 4, 2020

github-actions bot removed the Stale label Dec 5, 2020

github-actions bot added the Stale label Dec 19, 2020

github-actions bot closed this Dec 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Semantic conventions for `up` metric on Resources. #1078 #1102

Semantic conventions for `up` metric on Resources. #1078 #1102

cwildman commented Oct 16, 2020 •

edited

Loading

justinfoote left a comment

cwildman commented Oct 19, 2020

justinfoote commented Oct 19, 2020

jmacd Oct 23, 2020

cwildman Oct 27, 2020

jmacd Oct 30, 2020

jmacd Oct 23, 2020

jmacd Oct 23, 2020

cwildman Oct 28, 2020 •

edited

Loading

jmacd Oct 30, 2020

jmacd commented Oct 23, 2020

cwildman commented Oct 28, 2020

jmacd commented Oct 30, 2020

github-actions bot commented Nov 7, 2020

cwildman commented Nov 12, 2020

github-actions bot commented Nov 20, 2020

github-actions bot commented Nov 27, 2020

jgals commented Dec 4, 2020

jmacd commented Dec 11, 2020

github-actions bot commented Dec 19, 2020

github-actions bot commented Dec 26, 2020


		## Metric Instruments

		The following metric instruments SHOULD be synthesized every time a Resource produces some

Semantic conventions for up metric on Resources. #1078 #1102

Semantic conventions for up metric on Resources. #1078 #1102

Conversation

cwildman commented Oct 16, 2020 • edited Loading

Changes

PR Description & Discussion

justinfoote left a comment

Choose a reason for hiding this comment

cwildman commented Oct 19, 2020

justinfoote commented Oct 19, 2020

jmacd Oct 23, 2020

Choose a reason for hiding this comment

cwildman Oct 27, 2020

Choose a reason for hiding this comment

jmacd Oct 30, 2020

Choose a reason for hiding this comment

jmacd Oct 23, 2020

Choose a reason for hiding this comment

jmacd Oct 23, 2020

Choose a reason for hiding this comment

cwildman Oct 28, 2020 • edited Loading

Choose a reason for hiding this comment

jmacd Oct 30, 2020

Choose a reason for hiding this comment

jmacd commented Oct 23, 2020

cwildman commented Oct 28, 2020

jmacd commented Oct 30, 2020

github-actions bot commented Nov 7, 2020

cwildman commented Nov 12, 2020

github-actions bot commented Nov 20, 2020

github-actions bot commented Nov 27, 2020

jgals commented Dec 4, 2020

jmacd commented Dec 11, 2020

github-actions bot commented Dec 19, 2020

github-actions bot commented Dec 26, 2020

Semantic conventions for `up` metric on Resources. #1078 #1102

Semantic conventions for `up` metric on Resources. #1078 #1102

cwildman commented Oct 16, 2020 •

edited

Loading

cwildman Oct 28, 2020 •

edited

Loading