Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Semantic conventions for up metric on Resources. #1078 #1102

Conversation

cwildman
Copy link

@cwildman cwildman commented Oct 16, 2020

Fixes #1078

Changes

Describe the up metric on Resources.

PR Description & Discussion

This PR lays the groundwork for an up metric to identify Resources that are healthy or unhealthy. There are a few open questions on where this metric should be generated and if it should have any labels. I'll mention those questions here but for now I tried to leave the semantic convention fairly open-ended.

  1. Where in a metric pipeline should this metric be synthesized?

The up metric makes sense in a telemetry system using a pull model. This is because the component doing the pulling already has to keep state about which targets it is pulling metrics from and when. If a pull from a specific target ever fails, this component can synthesize the up metric with a value of 0 indicating it is unhealthy.

In a telemetry system using a push model, it's less clear how beneficial the up metric is. If the server side was to synthesize the up metric in a push system, there are a few problems: 1. the server would be forced to keep state about the Resources that have reported so that it can generate the "unhealthy" up metric when they stop reporting, 2. is a Resource that stops reporting always unhealthy? Or perhaps it simply no longer exists. If the client side was to synthesize the up metric in a push system can it ever really report unhealthy? If the process pushing these metrics has died then it can't report as unhealthy. Then the indication that this Resource is unhealthy becomes anytime this metric is not present. That seems superfluous since you could use any metric from this Resource to try to determine that.

In summary I think it would be useful to generate the up metric in receivers that are using a pull model. I am less convinced it makes sense to synthesize the up metric in all exporters, where unhealthy just becomes the lack of presence of a metric. I could see generating the up metric in prometheus exporters that push data (prometheus remote write) for backwards compatibility. A prometheus exporter that simply acts as a scrape target shouldn't need to synthesize up since the prometheus server will do it automatically.

  1. This metric should be label free

I agree that this metric doesn't have any explicit labels but, we at least want some attributes from the Resource this metric is reported for added to the metric when it is exported. Without those labels the metric is useless. Maybe that is obvious and doesn't need to be called out in the semantic convention.

Related issues #1078

Mentioning @jmacd and @bogdandrutu for reviews.

@cwildman cwildman requested review from a team October 16, 2020 17:22
Copy link
Member

@justinfoote justinfoote left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love this simple proposal.
I have one question for the group. Should we specify exporter behavior with a SHOULD?

They SHOULD inherit the attributes of the encapsulating Resource ...

If not, maybe we should make this a MAY?

@cwildman
Copy link
Author

@arminru I think I addressed your feedback. Thank you for the review!

@justinfoote
Copy link
Member

@open-telemetry/specs-metrics-approvers this is ready for review!


## Metric Instruments

The following metric instruments SHOULD be synthesized every time a Resource produces some
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if this SHOULD should be restricted to OTLP exporters? I think we explicitly do not want to export this for a Prometheus exporter, since the receiving end will synthesize up.

We could elaborate on the purpose. Suppose a process starts a metrics SDK but the application registers no instrumentation packages and no instruments. A push OTLP exporter is installed, and now wants to push a single collection interval, consisting of 0 metrics. This metric exists to show the resource is up.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok that is a good use case for an up metric with OTLP exporters that I hadn't thought of.

Regarding whether this SHOULD should be restricted to OTLP exporters, what about the prometheus remote write exporter? That is non OTLP but I think we want the up metric.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. @cwildman yes! Maybe we need a term to say when this rule applies, like "original" to indicate when instrumentation is generated, particularly when it is first aggregated into an exposition format. I used the term "aggregated" thinking of the Collector: it will not apply the up metric in any of its exporters but it could apply the metric in some of its receivers. OTel SDKs that export Prometheus remote write would apply this format, agreed.


| Name | Instrument | Units | Description |
|----------------------|---------------|--------------|-------------|
| `resource.up` | ValueRecorder | 1 | A value of `1` indicates the Resource is healthy, `0` if the Resource is unhealthy. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name resource.up is great, but I don't think we should use the word "healthy". Maybe just "produced telemetry"?

For a push exporter, a resource that is not up means "no longer pushing". When might we see a push exporter explicitly push a zero for resource.up? I was thinking through connection error states where some metric data is dropped. Should the resource.up variable be set to zero when data is lost?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking as a UpDownSumObserver. In a metrics pipeline we could drop resource attributes, lowering cardinality, which will naturally aggregate as a sum. This makes the resource.up metric effectively a cardinality counter.

Copy link
Author

@cwildman cwildman Oct 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"produced telemetry" makes sense to me.

For a push exporter, a resource that is not up means "no longer pushing". When might we see a push exporter explicitly push a zero for resource.up? I was thinking through connection error states where some metric data is dropped. Should the resource.up variable be set to zero when data is lost?

If I'm following you correctly, isn't it a possibility that the resource.up metric itself is lost as part of those connection errors? I feel like if that's the case this would be an unreliable signal.

I was thinking as a UpDownSumObserver. In a metrics pipeline we could drop resource attributes, lowering cardinality, which will naturally aggregate as a sum. This makes the resource.up metric effectively a cardinality counter.

I'm curious why UpDownSumObserver instead of just SumObserver in that scenario if we never have a value less than zero?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like if that's the case this would be an unreliable signal.

True. I was thinking but didn't say anything about this case. The resource.up metric might be treated specially inside an SDK (we might say MUST) in a way that it was buffered even when other timeseries are dropped. This kind of spec could be added in the ../protocol/exporter.md document, maybe.

@jmacd
Copy link
Contributor

jmacd commented Oct 23, 2020

In a telemetry system using a push model, it's less clear how beneficial the up metric is.

There are differences in the push model, and one of the reasons we might not want to generate resource.up when pushing is that it's implied by the existence of any other metric data over the same time period.

This means that we can standardize any one metric that MUST be builtin to OTel SDKs. In the OTel-Go contrib runtime package there is a "runtime.uptime" metric currently. Here's an alternative approach, then: Create a semantic-conventional metric resource.uptime defined as the wall-clock time since the process started. The existence of a reported resource.uptime value implies the resource was up. (I prefer "resource.uptime" to "runtime.uptime").

However an uptime metric doesn't solve one of the problems I was hoping to solve with this. An end user that has built up monitoring and alerting over Prometheus data is used to an up metric. We'd like to help this end user transition to OTel in two ways: (1) replacing Prometheus server with OTel collector with a Prometheus/OpenMetrics receiver, (2) replacing Prometheus clients with OTel SDKs.

Whether (1) or (2), the user's alerting and monitoring should keep working. We could logically synthesize the up metric from resource.uptime as stated above, but a Prometheus receiver in case (1) can't. Besides, it's simpler just to use an explicit metric.

I worry a little about introducing a name conflict so early on. If OTel uses resource.up and Prometheus users are expecting up, we're going to have to also standardize translation stages. Although I like resource.up, I wonder if we could agree on up--for Prometheus compatibility, to save 9 bytes, and reduce the number of translations.

@cwildman
Copy link
Author

There are differences in the push model, and one of the reasons we might not want to generate resource.up when pushing is that it's implied by the existence of any other metric data over the same time period.

Totally agree, I tried to articulate this in the description but you've said it much clearer.

Whether (1) or (2), the user's alerting and monitoring should keep working. We could logically synthesize the up metric from resource.uptime as stated above, but a Prometheus receiver in case (1) can't.

In case (1) shouldn't the Prometheus receiver be responsible for synthesizing the up metric since it is replacing the prometheus server which does that normally?

I worry a little about introducing a name conflict so early on. If OTel uses resource.up and Prometheus users are expecting up, we're going to have to also standardize translation stages. Although I like resource.up, I wonder if we could agree on up--for Prometheus compatibility, to save 9 bytes, and reduce the number of translations.

What exactly do you mean by standardize translation stages? I'm picturing we could have this as a default relabel rule for places where it's obviously necessary. For example a prometheus remote write exporter should make sure resource.uptime is relabeled to up by default. Other exporters that aren't likely to be pushing to legacy systems wouldn't need this rule. I worry that breaking the naming scheme for these metrics to provide backward compatibility to other systems ends up being some long running tech debt. Whereas relabel rules allow us to provide backward compatibility without breaking the naming scheme, it's just unclear how often this would require effort from the user to create the relabel rule. You are certainly in a much better position to weigh that tradeoff.

@jmacd
Copy link
Contributor

jmacd commented Oct 30, 2020

What exactly do you mean by standardize translation stages? I'm picturing we could have this as a default relabel rule for places where it's obviously necessary.

I think you've got the picture. This extra relabeling just just more complexity for the sake of using resource.up vs up. OTOH other metrics conventions will be encountered in the real world, and we're probably going to have to get good at this sort of translation in practice. I prefer the qualified name resource.up.

@github-actions
Copy link

github-actions bot commented Nov 7, 2020

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Nov 7, 2020
@MrAlias MrAlias removed the Stale label Nov 11, 2020
@cwildman
Copy link
Author

@jmacd @bogdandrutu per your discussion at last weeks metrics meeting we have two semantic conventions we want to define:

  1. A resource.up metric that can be used for compatibility with prometheus up
  2. A resource.uptime metric that can be used to track time since process started.

I'd like to keep this PR focused on #1, the compatibility issue. Do we want a separate OTEL issue for #2?

@cwildman cwildman force-pushed the resource-metrics-semantic-conventions branch from 3caa9d1 to 94ef57b Compare November 12, 2020 19:32
@github-actions
Copy link

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Nov 20, 2020
@github-actions
Copy link

Closed as inactive. Feel free to reopen if this PR is still being worked on.

@github-actions github-actions bot closed this Nov 27, 2020
@jgals
Copy link

jgals commented Dec 4, 2020

@jmacd and @bogdandrutu, any thoughts on the above proposal and question from @cwildman?

@jmacd jmacd reopened this Dec 4, 2020
@github-actions github-actions bot removed the Stale label Dec 5, 2020
@jmacd
Copy link
Contributor

jmacd commented Dec 11, 2020

Answering @jgals question, let's focus on the first issue described above.

A resource.up metric that can be used for compatibility with prometheus up
I've filed #1273 which will let us derive the uptime from the resource.up metric.

@github-actions
Copy link

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Dec 19, 2020
@github-actions
Copy link

Closed as inactive. Feel free to reopen if this PR is still being worked on.

@github-actions github-actions bot closed this Dec 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Metrics semantic convention: "up" metric
6 participants