-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[exporter/prometheusremotewrite] Fix: Don't drop batch in case of failure to translate metrics #29729
[exporter/prometheusremotewrite] Fix: Don't drop batch in case of failure to translate metrics #29729
Conversation
Don't return error to the exporter helper in case of failures to translate from OpenTelemetry metrics to Prometheus metrics (which can happen due to several reasons). Instead log the error in warn level and try to send as much data as possible. Signed-off-by: Raphael Silva <rapphil@gmail.com>
44b6709
to
ac29886
Compare
prwe.settings.Logger.Warn("Failed to translate metrics: %s", zap.Error(err)) | ||
prwe.settings.Logger.Warn("Exporting remaining %s metrics.", zap.Int("converted", len(tsMap))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there known cases where OTEL -> Prometheus translation will fail? If so, this may be a noisy log line to emit at the warning level. Instead of logging at the warn
level could we ensure that a metric is generated and emitted by the exporter that counts the amount of failed translations there are?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change will not increase the amount of error messages given that in the present state the component is already producing an error message on failed translations: the error message due to permanent error logged by the exporter helper.
please refer to the following current error message for an example of failure:
2022-10-19T11:27:07.375+0800 error exporterhelper/queued_retry.go:395 Exporting failed. The error is not retryable. Dropping data. {"kind": "exporter", "data_type": "metrics", "name": "prometheusremotewrite", "error": "Permanent error: invalid temporality and type combination;
I'd rather put this new feature in a different PR.
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
Signed-off-by: Raphael Silva <rapphil@gmail.com>
prwe.settings.Logger.Debug("failed to translate metrics %s", zap.Error(err)) | ||
prwe.settings.Logger.Debug("exporting remaining %s metrics", zap.Int("translated", len(tsMap))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prwe.settings.Logger.Debug("failed to translate metrics %s", zap.Error(err)) | |
prwe.settings.Logger.Debug("exporting remaining %s metrics", zap.Int("translated", len(tsMap))) | |
prwe.settings.Logger.Debug("failed to translate metrics, zap.Error(err)) | |
prwe.settings.Logger.Debug("exporting remaining metrics", zap.Int("count", len(tsMap))) |
The %s
is for formatted strings. Zap uses structured logging instead.
I also wondering if we want to adjust these log lines to be more clear. I think they may confuse users. I think the message intent is
- Some metrics failed to translate. This was not a catastrophic failure.
- N metrics were successfully translated. The PRWE will attempt to export these.
} | ||
|
||
func (p *prwTelemetryOtel) recordTranslationFailure(ctx context.Context) { | ||
p.failedTranslations.Add(ctx, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these include the component ID as an attribute? It can be important to identify which component is failing when multiple exporters of the same type are configured.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point. done!
// TODO: create helper functions similar to the processor helper: BuildCustomMetricName | ||
prefix := "exporter/" + metadata.Type + "/" | ||
|
||
failedTranslations, _ := meter.Int64Counter(prefix+"failed_translations", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't ignore instrument creation errors. These should be propagated up the call stack if they can't be handled here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know why I thought it was ok to do this. fixed.
telemetry := newPRWTelemetry(set) | ||
prwe, err := newPRWExporter(prwCfg, set, telemetry) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be done inside newPRWExporter()
? It's already receiving the settings object newPRWTelemetry()
needs and the telemetry object isn't used outside of the new exporter function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was done because of the tests. I think it is ok to create a real telemetry and then switch for a mock one. I updated the code.
prwe.settings.Logger.Debug("failed to translate metrics %s", zap.Error(err)) | ||
prwe.settings.Logger.Debug("exporting remaining %s metrics", zap.Int("translated", len(tsMap))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prwe.settings.Logger.Debug("failed to translate metrics %s", zap.Error(err)) | |
prwe.settings.Logger.Debug("exporting remaining %s metrics", zap.Int("translated", len(tsMap))) | |
prwe.settings.Logger.Debug("failed to translate metrics, exporting remaining metrics", zap.Error(err), zap.Int("translated", len(tsMap))) |
If these are to be logged at the same level might as well make it a single logging statement. Is the number of metrics that failed translation available to record?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like your idea of using a single line.
Is the number of metrics that failed translation available to record?
No, this is hidden in the FromMetrics
function implementation. In theory we could use num otel metrics - len(tsMap)
but the FromMetrics
also include other time series, such as the target_info
. This behaviour vary based on the flags passed to the function .... 😒
We could also look into the number of errors embedded in the error returned in the FromMetrics function, but this would also rely on assumptions about the FromMetrics
in my opinion.
Co-authored-by: Anthony Mirabella <a9@aneurysm9.com>
Co-authored-by: bryan-aguilar <46550959+bryan-aguilar@users.noreply.github.com>
Signed-off-by: Raphael Silva <rapphil@gmail.com>
Signed-off-by: Raphael Silva <rapphil@gmail.com>
Signed-off-by: Raphael Silva <rapphil@gmail.com>
Signed-off-by: Raphael Silva <rapphil@gmail.com>
|
Signed-off-by: Raphael Silva <rapphil@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please resolve the conflict and we can get this merged
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
Description: Don't drop a whole batch in case of failure to translate from Otel to Prometheus. Instead, with this PR we are trying to send to Prometheus all the metrics that were properly translated and create a warning message for failures to translate.
This PR also adds supports to telemetry in this component so that it is possible to inspect how the translation process is happening and identify failed translations.
I opted to not include the number of time series that failed translation because I don't want to make assumptions about how the
FromMetrics
function works. Instead we are just publishing if there was any failure during the translation process and the number of time series returned.Link to tracking Issue: #15281
Testing: UTs were added to account for the case that you have mixed metrics, with some succeeding the translation and some failing.