[exporter/prometheusremotewrite] Fix: Don't drop batch in case of failure to translate metrics #29729

rapphil · 2023-12-11T01:57:01Z

Description: Don't drop a whole batch in case of failure to translate from Otel to Prometheus. Instead, with this PR we are trying to send to Prometheus all the metrics that were properly translated and create a warning message for failures to translate.

This PR also adds supports to telemetry in this component so that it is possible to inspect how the translation process is happening and identify failed translations.

I opted to not include the number of time series that failed translation because I don't want to make assumptions about how the FromMetrics function works. Instead we are just publishing if there was any failure during the translation process and the number of time series returned.

Link to tracking Issue: #15281

Testing: UTs were added to account for the case that you have mixed metrics, with some succeeding the translation and some failing.

Don't return error to the exporter helper in case of failures to translate from OpenTelemetry metrics to Prometheus metrics (which can happen due to several reasons). Instead log the error in warn level and try to send as much data as possible. Signed-off-by: Raphael Silva <rapphil@gmail.com>

bryan-aguilar · 2023-12-11T17:19:47Z

exporter/prometheusremotewriteexporter/exporter.go

+			prwe.settings.Logger.Warn("Failed to translate metrics: %s", zap.Error(err))
+			prwe.settings.Logger.Warn("Exporting remaining %s metrics.", zap.Int("converted", len(tsMap)))


Are there known cases where OTEL -> Prometheus translation will fail? If so, this may be a noisy log line to emit at the warning level. Instead of logging at the warn level could we ensure that a metric is generated and emitted by the exporter that counts the amount of failed translations there are?

This change will not increase the amount of error messages given that in the present state the component is already producing an error message on failed translations: the error message due to permanent error logged by the exporter helper.

please refer to the following current error message for an example of failure:

2022-10-19T11:27:07.375+0800 error exporterhelper/queued_retry.go:395 Exporting failed. The error is not retryable. Dropping data. {"kind": "exporter", "data_type": "metrics", "name": "prometheusremotewrite", "error": "Permanent error: invalid temporality and type combination;

I'd rather put this new feature in a different PR.

github-actions · 2023-12-27T05:19:40Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

…c-translation

Signed-off-by: Raphael Silva <rapphil@gmail.com>

exporter/prometheusremotewriteexporter/exporter.go

bryan-aguilar · 2024-01-12T18:13:25Z

exporter/prometheusremotewriteexporter/exporter.go

+			prwe.settings.Logger.Debug("failed to translate metrics %s", zap.Error(err))
+			prwe.settings.Logger.Debug("exporting remaining %s metrics", zap.Int("translated", len(tsMap)))


Suggested change

prwe.settings.Logger.Debug("failed to translate metrics %s", zap.Error(err))

prwe.settings.Logger.Debug("exporting remaining %s metrics", zap.Int("translated", len(tsMap)))

prwe.settings.Logger.Debug("failed to translate metrics, zap.Error(err))

prwe.settings.Logger.Debug("exporting remaining metrics", zap.Int("count", len(tsMap)))

The %s is for formatted strings. Zap uses structured logging instead.

I also wondering if we want to adjust these log lines to be more clear. I think they may confuse users. I think the message intent is

Some metrics failed to translate. This was not a catastrophic failure.

N metrics were successfully translated. The PRWE will attempt to export these.

exporter/prometheusremotewriteexporter/exporter_test.go

Aneurysm9 · 2024-01-12T18:13:16Z

exporter/prometheusremotewriteexporter/exporter.go

+}
+
+func (p *prwTelemetryOtel) recordTranslationFailure(ctx context.Context) {
+	p.failedTranslations.Add(ctx, 1)


Should these include the component ID as an attribute? It can be important to identify which component is failing when multiple exporters of the same type are configured.

good point. done!

Aneurysm9 · 2024-01-12T18:17:25Z

exporter/prometheusremotewriteexporter/exporter.go

+	// TODO: create helper functions similar to the processor helper: BuildCustomMetricName
+	prefix := "exporter/" + metadata.Type + "/"
+
+	failedTranslations, _ := meter.Int64Counter(prefix+"failed_translations",


Don't ignore instrument creation errors. These should be propagated up the call stack if they can't be handled here.

I don't know why I thought it was ok to do this. fixed.

Aneurysm9 · 2024-01-12T18:18:59Z

exporter/prometheusremotewriteexporter/factory.go

+	telemetry := newPRWTelemetry(set)
+	prwe, err := newPRWExporter(prwCfg, set, telemetry)


Should this be done inside newPRWExporter()? It's already receiving the settings object newPRWTelemetry() needs and the telemetry object isn't used outside of the new exporter function.

This was done because of the tests. I think it is ok to create a real telemetry and then switch for a mock one. I updated the code.

Aneurysm9 · 2024-01-12T18:22:32Z

exporter/prometheusremotewriteexporter/exporter.go

+			prwe.settings.Logger.Debug("failed to translate metrics %s", zap.Error(err))
+			prwe.settings.Logger.Debug("exporting remaining %s metrics", zap.Int("translated", len(tsMap)))


Suggested change

prwe.settings.Logger.Debug("failed to translate metrics %s", zap.Error(err))

prwe.settings.Logger.Debug("exporting remaining %s metrics", zap.Int("translated", len(tsMap)))

prwe.settings.Logger.Debug("failed to translate metrics, exporting remaining metrics", zap.Error(err), zap.Int("translated", len(tsMap)))

If these are to be logged at the same level might as well make it a single logging statement. Is the number of metrics that failed translation available to record?

I like your idea of using a single line.

Is the number of metrics that failed translation available to record?

No, this is hidden in the FromMetrics function implementation. In theory we could use num otel metrics - len(tsMap) but the FromMetrics also include other time series, such as the target_info. This behaviour vary based on the flags passed to the function .... 😒

We could also look into the number of errors embedded in the error returned in the FromMetrics function, but this would also rely on assumptions about the FromMetrics in my opinion.

Co-authored-by: Anthony Mirabella <a9@aneurysm9.com>

Co-authored-by: bryan-aguilar <46550959+bryan-aguilar@users.noreply.github.com>

Signed-off-by: Raphael Silva <rapphil@gmail.com>

…c-translation

bryan-aguilar · 2024-01-30T06:16:26Z

 --- a/exporter/prometheusremotewriteexporter/go.mod
+++ b/exporter/prometheusremotewriteexporter/go.mod
@@ -23,6 +23,7 @@ require (
 	go.opentelemetry.io/collector/consumer v0.93.1-0.20240129215828-1ed45ec12569
 	go.opentelemetry.io/collector/exporter v0.93.1-0.20240129215828-1ed45ec12569
 	go.opentelemetry.io/collector/pdata v1.0.2-0.20240129215828-1ed45ec12569
+	go.opentelemetry.io/otel v1.22.0
 	go.opentelemetry.io/otel/metric v1.22.0
 	go.opentelemetry.io/otel/trace v1.22.0
 	go.uber.org/multierr v1.11.0
@@ -75,7 +76,6 @@ require (
 	go.opentelemetry.io/collector/receiver v0.93.1-0.20240129215828-1ed45ec12569 // indirect
 	go.opentelemetry.io/collector/semconv v0.93.1-0.20240129215828-1ed45ec12569 // indirect
 	go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp v0.47.0 // indirect
-	go.opentelemetry.io/otel v1.22.0 // indirect
 	go.opentelemetry.io/otel/exporters/prometheus v0.45.0 // indirect
 	go.opentelemetry.io/otel/sdk v1.22.0 // indirect
 	go.opentelemetry.io/otel/sdk/metric v1.22.0 // indirect
go.mod/go.sum deps changes detected, please run "make gotidy" and commit the changes in this PR.

Signed-off-by: Raphael Silva <rapphil@gmail.com>

codeboten

Please resolve the conflict and we can get this merged

github-actions · 2024-02-28T05:19:22Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

…c-translation

rapphil requested a review from Aneurysm9 as a code owner December 11, 2023 01:57

rapphil requested a review from a team December 11, 2023 01:57

github-actions bot assigned dmitryax Dec 11, 2023

github-actions bot added the exporter/prometheusremotewrite label Dec 11, 2023

rapphil force-pushed the rapphil-prw-fix-metric-translation branch from 44b6709 to ac29886 Compare December 11, 2023 02:01

rapphil changed the title ~~Fix: Don't return error for metric translation in the prw exporter~~ [prometheusremotewriteexporter] Fix: Don't drop batch in case of failure to translate metrics Dec 11, 2023

bryan-aguilar reviewed Dec 11, 2023

View reviewed changes

Merge branch 'main' into rapphil-prw-fix-metric-translation

883bd39

github-actions bot added the Stale label Dec 27, 2023

rapphil changed the title ~~[prometheusremotewriteexporter] Fix: Don't drop batch in case of failure to translate metrics~~ [exporter/prometheusremotewrite] Fix: Don't drop batch in case of failure to translate metrics Jan 7, 2024

github-actions bot removed the Stale label Jan 8, 2024

rapphil added 2 commits January 11, 2024 23:05

Merge remote-tracking branch 'origin/main' into rapphil-prw-fix-metri…

f2a846b

…c-translation

Feat: Add telemetry for metric translation in the PRWE

baa04c6

Signed-off-by: Raphael Silva <rapphil@gmail.com>

rapphil requested a review from bryan-aguilar January 12, 2024 17:11

Merge branch 'main' into rapphil-prw-fix-metric-translation

691a7ec

rapphil mentioned this pull request Jan 12, 2024

Add helper functions/conventions/documentation to help contrib components add support to custom telemetry open-telemetry/opentelemetry-collector#9277

Open

Merge branch 'main' into rapphil-prw-fix-metric-translation

f897551

bryan-aguilar reviewed Jan 12, 2024

View reviewed changes

Aneurysm9 reviewed Jan 12, 2024

View reviewed changes

rapphil and others added 4 commits January 12, 2024 10:59

Fix and simplify log line in case of failure to translated

b3a6ce6

Co-authored-by: Anthony Mirabella <a9@aneurysm9.com>

Simplify tests

a128f16

Co-authored-by: bryan-aguilar <46550959+bryan-aguilar@users.noreply.github.com>

Fix: fix based on code review

84af853

Signed-off-by: Raphael Silva <rapphil@gmail.com>

Chore: Update metric name

bffdd8c

Signed-off-by: Raphael Silva <rapphil@gmail.com>

rapphil requested review from Aneurysm9 and bryan-aguilar January 12, 2024 19:37

rapphil requested a review from dashpole as a code owner January 12, 2024 21:47

github-actions bot added the processor/resourcedetection Resource detection processor label Jan 12, 2024

Fix: Fix go.mod

faf0fdd

Signed-off-by: Raphael Silva <rapphil@gmail.com>

Chore: fix go.mod

f60d68e

Signed-off-by: Raphael Silva <rapphil@gmail.com>

Aneurysm9 approved these changes Jan 23, 2024

View reviewed changes

Merge remote-tracking branch 'origin/main' into rapphil-prw-fix-metri…

5e953f5

…c-translation

rapphil added 2 commits January 30, 2024 17:10

Fix go.mod

0fad48b

Signed-off-by: Raphael Silva <rapphil@gmail.com>

Merge branch 'main' into rapphil-prw-fix-metric-translation

bfdb63a

bryan-aguilar added the ready to merge Code review completed; ready to merge by maintainers label Jan 30, 2024

github-actions bot mentioned this pull request Feb 6, 2024

Weekly Report: 2024-01-30 - 2024-02-06 #31055

Closed

github-actions bot mentioned this pull request Feb 13, 2024

Weekly Report: 2024-02-06 - 2024-02-13 #31192

Closed

codeboten reviewed Feb 13, 2024

View reviewed changes

This was referenced Feb 20, 2024

Weekly Report: 2024-02-13 - 2024-02-20 #31323

Closed

Weekly Report: 2024-02-13 - 2024-02-20 asuresh4/opentelemetry-collector-contrib#11541

Open

This was referenced Feb 27, 2024

Weekly Report: 2024-02-20 - 2024-02-27 #31422

Closed

Weekly Report: 2024-02-20 - 2024-02-27 asuresh4/opentelemetry-collector-contrib#11542

Open

github-actions bot added the Stale label Feb 28, 2024

Aneurysm9 removed the Stale label Mar 1, 2024

rapphil and others added 3 commits March 1, 2024 00:50

Merge remote-tracking branch 'origin/main' into rapphil-prw-fix-metri…

c178c3b

…c-translation

Fix merge conflicts

2bc92ca

Merge branch 'main' into rapphil-prw-fix-metric-translation

9e7ac06

bryan-aguilar added ready to merge Code review completed; ready to merge by maintainers and removed ready to merge Code review completed; ready to merge by maintainers labels Mar 1, 2024

This was referenced Mar 5, 2024

Weekly Report: 2024-02-27 - 2024-03-05 #31560

Closed

Weekly Report: 2024-02-27 - 2024-03-05 asuresh4/opentelemetry-collector-contrib#11543

Open

Weekly Report: 2024-03-05 - 2024-03-12 #31693

Closed

bryan-aguilar added 2 commits March 13, 2024 09:44

Merge branch 'main' into rapphil-prw-fix-metric-translation

315996a

make tidy

e8ea846

codeboten approved these changes Mar 13, 2024

View reviewed changes

codeboten merged commit e5fc693 into open-telemetry:main Mar 13, 2024
142 checks passed

github-actions bot added this to the next release milestone Mar 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[exporter/prometheusremotewrite] Fix: Don't drop batch in case of failure to translate metrics #29729

[exporter/prometheusremotewrite] Fix: Don't drop batch in case of failure to translate metrics #29729

rapphil commented Dec 11, 2023 •

edited

Loading

bryan-aguilar Dec 11, 2023

rapphil Dec 12, 2023

github-actions bot commented Dec 27, 2023

bryan-aguilar Jan 12, 2024

Aneurysm9 Jan 12, 2024

rapphil Jan 12, 2024

Aneurysm9 Jan 12, 2024

rapphil Jan 12, 2024

Aneurysm9 Jan 12, 2024

rapphil Jan 12, 2024

Aneurysm9 Jan 12, 2024

rapphil Jan 12, 2024 •

edited

Loading

bryan-aguilar commented Jan 30, 2024

codeboten left a comment

github-actions bot commented Feb 28, 2024

		prwe.settings.Logger.Warn("Failed to translate metrics: %s", zap.Error(err))
		prwe.settings.Logger.Warn("Exporting remaining %s metrics.", zap.Int("converted", len(tsMap)))

		prwe.settings.Logger.Debug("failed to translate metrics %s", zap.Error(err))
		prwe.settings.Logger.Debug("exporting remaining %s metrics", zap.Int("translated", len(tsMap)))

		telemetry := newPRWTelemetry(set)
		prwe, err := newPRWExporter(prwCfg, set, telemetry)

[exporter/prometheusremotewrite] Fix: Don't drop batch in case of failure to translate metrics #29729

[exporter/prometheusremotewrite] Fix: Don't drop batch in case of failure to translate metrics #29729

Conversation

rapphil commented Dec 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Dec 27, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rapphil Jan 12, 2024 • edited Loading

Choose a reason for hiding this comment

bryan-aguilar commented Jan 30, 2024

codeboten left a comment

Choose a reason for hiding this comment

github-actions bot commented Feb 28, 2024

rapphil commented Dec 11, 2023 •

edited

Loading

rapphil Jan 12, 2024 •

edited

Loading