Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[exporter/prometheusremotewrite] Fix: Don't drop batch in case of failure to translate metrics #29729

Merged

Conversation

rapphil
Copy link
Contributor

@rapphil rapphil commented Dec 11, 2023

Description: Don't drop a whole batch in case of failure to translate from Otel to Prometheus. Instead, with this PR we are trying to send to Prometheus all the metrics that were properly translated and create a warning message for failures to translate.

This PR also adds supports to telemetry in this component so that it is possible to inspect how the translation process is happening and identify failed translations.

I opted to not include the number of time series that failed translation because I don't want to make assumptions about how the FromMetrics function works. Instead we are just publishing if there was any failure during the translation process and the number of time series returned.

Link to tracking Issue: #15281

Testing: UTs were added to account for the case that you have mixed metrics, with some succeeding the translation and some failing.

Don't return error to the exporter helper in case of failures to
translate from OpenTelemetry metrics to Prometheus metrics (which can
happen due to several reasons). Instead log the error in warn level and
try to send as much data as possible.

Signed-off-by: Raphael Silva <rapphil@gmail.com>
@rapphil rapphil force-pushed the rapphil-prw-fix-metric-translation branch from 44b6709 to ac29886 Compare December 11, 2023 02:01
@rapphil rapphil changed the title Fix: Don't return error for metric translation in the prw exporter [prometheusremotewriteexporter] Fix: Don't drop batch in case of failure to translate metrics Dec 11, 2023
Comment on lines 138 to 139
prwe.settings.Logger.Warn("Failed to translate metrics: %s", zap.Error(err))
prwe.settings.Logger.Warn("Exporting remaining %s metrics.", zap.Int("converted", len(tsMap)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there known cases where OTEL -> Prometheus translation will fail? If so, this may be a noisy log line to emit at the warning level. Instead of logging at the warn level could we ensure that a metric is generated and emitted by the exporter that counts the amount of failed translations there are?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change will not increase the amount of error messages given that in the present state the component is already producing an error message on failed translations: the error message due to permanent error logged by the exporter helper.

please refer to the following current error message for an example of failure:

2022-10-19T11:27:07.375+0800    error   exporterhelper/queued_retry.go:395      Exporting failed. The error is not retryable. Dropping data.    {"kind": "exporter", "data_type": "metrics", "name": "prometheusremotewrite", "error": "Permanent error: invalid temporality and type combination;

I'd rather put this new feature in a different PR.

Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Dec 27, 2023
@rapphil rapphil changed the title [prometheusremotewriteexporter] Fix: Don't drop batch in case of failure to translate metrics [exporter/prometheusremotewrite] Fix: Don't drop batch in case of failure to translate metrics Jan 7, 2024
@github-actions github-actions bot removed the Stale label Jan 8, 2024
@rapphil rapphil requested a review from bryan-aguilar January 12, 2024 17:11
exporter/prometheusremotewriteexporter/exporter.go Outdated Show resolved Hide resolved
Comment on lines 183 to 184
prwe.settings.Logger.Debug("failed to translate metrics %s", zap.Error(err))
prwe.settings.Logger.Debug("exporting remaining %s metrics", zap.Int("translated", len(tsMap)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
prwe.settings.Logger.Debug("failed to translate metrics %s", zap.Error(err))
prwe.settings.Logger.Debug("exporting remaining %s metrics", zap.Int("translated", len(tsMap)))
prwe.settings.Logger.Debug("failed to translate metrics, zap.Error(err))
prwe.settings.Logger.Debug("exporting remaining metrics", zap.Int("count", len(tsMap)))

The %s is for formatted strings. Zap uses structured logging instead.

I also wondering if we want to adjust these log lines to be more clear. I think they may confuse users. I think the message intent is

  1. Some metrics failed to translate. This was not a catastrophic failure.
  2. N metrics were successfully translated. The PRWE will attempt to export these.

exporter/prometheusremotewriteexporter/exporter_test.go Outdated Show resolved Hide resolved
exporter/prometheusremotewriteexporter/exporter_test.go Outdated Show resolved Hide resolved
}

func (p *prwTelemetryOtel) recordTranslationFailure(ctx context.Context) {
p.failedTranslations.Add(ctx, 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these include the component ID as an attribute? It can be important to identify which component is failing when multiple exporters of the same type are configured.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. done!

// TODO: create helper functions similar to the processor helper: BuildCustomMetricName
prefix := "exporter/" + metadata.Type + "/"

failedTranslations, _ := meter.Int64Counter(prefix+"failed_translations",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't ignore instrument creation errors. These should be propagated up the call stack if they can't be handled here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why I thought it was ok to do this. fixed.

Comment on lines 38 to 39
telemetry := newPRWTelemetry(set)
prwe, err := newPRWExporter(prwCfg, set, telemetry)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be done inside newPRWExporter()? It's already receiving the settings object newPRWTelemetry() needs and the telemetry object isn't used outside of the new exporter function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was done because of the tests. I think it is ok to create a real telemetry and then switch for a mock one. I updated the code.

Comment on lines 183 to 184
prwe.settings.Logger.Debug("failed to translate metrics %s", zap.Error(err))
prwe.settings.Logger.Debug("exporting remaining %s metrics", zap.Int("translated", len(tsMap)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
prwe.settings.Logger.Debug("failed to translate metrics %s", zap.Error(err))
prwe.settings.Logger.Debug("exporting remaining %s metrics", zap.Int("translated", len(tsMap)))
prwe.settings.Logger.Debug("failed to translate metrics, exporting remaining metrics", zap.Error(err), zap.Int("translated", len(tsMap)))

If these are to be logged at the same level might as well make it a single logging statement. Is the number of metrics that failed translation available to record?

Copy link
Contributor Author

@rapphil rapphil Jan 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like your idea of using a single line.

Is the number of metrics that failed translation available to record?

No, this is hidden in the FromMetrics function implementation. In theory we could use num otel metrics - len(tsMap) but the FromMetrics also include other time series, such as the target_info. This behaviour vary based on the flags passed to the function .... 😒

We could also look into the number of errors embedded in the error returned in the FromMetrics function, but this would also rely on assumptions about the FromMetrics in my opinion.

rapphil and others added 4 commits January 12, 2024 10:59
Co-authored-by: Anthony Mirabella <a9@aneurysm9.com>
Co-authored-by: bryan-aguilar <46550959+bryan-aguilar@users.noreply.github.com>
Signed-off-by: Raphael Silva <rapphil@gmail.com>
Signed-off-by: Raphael Silva <rapphil@gmail.com>
@rapphil rapphil requested a review from dashpole as a code owner January 12, 2024 21:47
@github-actions github-actions bot added the processor/resourcedetection Resource detection processor label Jan 12, 2024
Signed-off-by: Raphael Silva <rapphil@gmail.com>
Signed-off-by: Raphael Silva <rapphil@gmail.com>
@bryan-aguilar
Copy link
Contributor

 --- a/exporter/prometheusremotewriteexporter/go.mod
+++ b/exporter/prometheusremotewriteexporter/go.mod
@@ -23,6 +23,7 @@ require (
 	go.opentelemetry.io/collector/consumer v0.93.1-0.20240129215828-1ed45ec12569
 	go.opentelemetry.io/collector/exporter v0.93.1-0.20240129215828-1ed45ec12569
 	go.opentelemetry.io/collector/pdata v1.0.2-0.20240129215828-1ed45ec12569
+	go.opentelemetry.io/otel v1.22.0
 	go.opentelemetry.io/otel/metric v1.22.0
 	go.opentelemetry.io/otel/trace v1.22.0
 	go.uber.org/multierr v1.11.0
@@ -75,7 +76,6 @@ require (
 	go.opentelemetry.io/collector/receiver v0.93.1-0.20240129215828-1ed45ec12569 // indirect
 	go.opentelemetry.io/collector/semconv v0.93.1-0.20240129215828-1ed45ec12569 // indirect
 	go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp v0.47.0 // indirect
-	go.opentelemetry.io/otel v1.22.0 // indirect
 	go.opentelemetry.io/otel/exporters/prometheus v0.45.0 // indirect
 	go.opentelemetry.io/otel/sdk v1.22.0 // indirect
 	go.opentelemetry.io/otel/sdk/metric v1.22.0 // indirect
go.mod/go.sum deps changes detected, please run "make gotidy" and commit the changes in this PR.

@bryan-aguilar bryan-aguilar added the ready to merge Code review completed; ready to merge by maintainers label Jan 30, 2024
Copy link
Contributor

@codeboten codeboten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please resolve the conflict and we can get this merged

Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Feb 28, 2024
@Aneurysm9 Aneurysm9 removed the Stale label Mar 1, 2024
@bryan-aguilar bryan-aguilar added ready to merge Code review completed; ready to merge by maintainers and removed ready to merge Code review completed; ready to merge by maintainers labels Mar 1, 2024
@codeboten codeboten merged commit e5fc693 into open-telemetry:main Mar 13, 2024
142 checks passed
@github-actions github-actions bot added this to the next release milestone Mar 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
exporter/prometheusremotewrite ready to merge Code review completed; ready to merge by maintainers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants