Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metric attribute conventions corresponding to span status? #3243

Closed
trask opened this issue Feb 21, 2023 · 8 comments
Closed

Add metric attribute conventions corresponding to span status? #3243

trask opened this issue Feb 21, 2023 · 8 comments
Assignees
Labels
area:semantic-conventions Related to semantic conventions spec:metrics Related to the specification/metrics directory

Comments

@trask
Copy link
Member

trask commented Feb 21, 2023

From @arminru's #2419 (comment):

The status, on the other hand, is indeed missing and I think it would make sense to introduce a generic metric attribute convention for this that matches the definitions for Span Status.

@trask trask added area:semantic-conventions Related to semantic conventions spec:metrics Related to the specification/metrics directory labels Feb 21, 2023
@trask trask self-assigned this Feb 21, 2023
@trask trask moved this to Blocker for HTTP semconv stability in Semantic Conventions + Instrumentation Stability WG Feb 21, 2023
@trask
Copy link
Member Author

trask commented Feb 23, 2023

Discussed in Semantic Convention WG and it only makes sense for some metrics and should probably be deferred to individual semantic conventions where they may or may not already have a dimension that encompasses "status", e.g. http.status_code in http semconv.

@AlexanderWert
Copy link
Member

@lmolkova , @trask
I had another look. In ECS there's the field event.outcome that actually represent exactly what we discussed in the Semantic Convention WG: A generic status that can have the values failure, success, unknown.

@lmolkova
Copy link
Contributor

lmolkova commented Jul 19, 2023

I'd like to revisit this and add more context on why no general-purpose attribute is problematic:

Option 1: Domain-specific error code

Databases and messaging systems do not have cross-domain error codes. They would need to be standardized within one semconv and then we'd introduce

  • messaging.status_code
  • db.status_code
  • faas.status_code
  • dotnet.http.client.error_reason
  • etc

Pros:

  • it's possible to see failure-rate on a dashboard or add alerts for the specific domain

Cons:

  • metric is not specific enough. There are no system-specific error codes
  • there is little commonality of error codes within specific domain

Option 2: General-purpose error code

We'll introduce one common attribute error.code, event.outcome, status.code or anything else.

Pros and cons are the same as in Option 1, but we do just one status enum, minimizing inconsistencies and cognitive load.

Option 3: No common error codes at all

Each extension defines a custom attribute, so we'll have

  • dotnet.http.client.error_reason
  • messaging.kafka.status_code
  • messaging.rabbitmq.error_reason
  • db.mssql.hresult
  • db.cosmosdb.status_code
  • etc

Pros:

  • metrics are very specific

Cons:

  • no common dashboards/alerts that include error rate

Proposal: Option2+3

  1. introduce one general-purpose status: error.code, event.outcome, or status.code.
    • start with existing values for span status: OK, ERROR, no value (UNSET)
    • the list might slowly grow over time and include timeout, cancellation, etc.
    • we can allow custom values as long as practical cardinality is low.
  2. Recommend individual semconv extensions to define additional tech-specific error code

Example:

  • status.code: ERROR
  • messaging.kafka.error_code: INVALID_FETCH_SIZE

Pros:

  • allows building common dashboards/alerts with failure rates for specific semconv
  • allows to have tech-specific metrics

@lmolkova
Copy link
Contributor

lmolkova commented Jul 19, 2023

Adding more context on how this affects HTTP semconv stability:

.NET HttpClient is being instrumented with metrics. The native instrumentation can report failure reason when no response was received (http.response.status_code is not set).

I expect there are other instrumentations like this, which can provide the reason (DNS issue, connection reset or refused, TLS issue, timeout or cancellation, etc). This information seems to be quite useful and each instrumentation should consider adding this information when it has it.

.NET team is considering using exception.type or inventing their own attribute.

We do not have an attribute to report such a thing in HTTP metrics. Assuming we don't introduce one now, adding it later would be a breaking change.

@lmolkova
Copy link
Contributor

Based on today's SIG discussion, I separated HTTP-specific issue to open-telemetry/semantic-conventions#204 and removing this one from HTTP blockers.

@lmolkova
Copy link
Contributor

lmolkova commented Aug 8, 2023

Option 4: no unification, one attribute

One attribute that has 2 values defined for success and unknown failure, but allows instrumentations to put any values (with low practical cardinality):

error.reason: - success
error.reason: _OTHER - unknown error
error.reason: java.lang.UnknownHostException
error.reason: dns_error
error.reason: timeout
error.reason: INVALID_FETCH_SIZE

Pros:

  • allows building common dashboards/alerts with failure rates per error
  • allows to have tech-specific metrics

@lmolkova
Copy link
Contributor

lmolkova commented Aug 8, 2023

@AlexanderWert

In ECS there's the field event.outcome that actually represent exactly what we discussed in the Semantic Convention WG: A generic status that can have the values failure, success, unknown.

The event is a bit confusing because the existing event namespace is used to describe OTel events. I wonder if error.code or error.type would be more descriptive?

@lmolkova
Copy link
Contributor

I believe it was fixed in open-telemetry/semantic-conventions#205

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:semantic-conventions Related to semantic conventions spec:metrics Related to the specification/metrics directory
Development

No branches or pull requests

4 participants