-
Notifications
You must be signed in to change notification settings - Fork 897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Library instrumentation: StatusCanonicalCode #306
Comments
I agree that having to define a somewhat arbitrary mapping to these status code seems to be redundant work at best. The states that I think are interesting for a span are better described with something like:
Especially the fine distinction between a semantically failed operation and an operation that ended with some error indicator is tricky. For example AFAIK we don't mark most non 2xx-HTTP status codes as failed operations in Dynatrace. E.g. a 404 could just mean an application is checking if some entity exists and 404 would be the equivalent of a boolean |
In particular, I don't know which problem the CannonicalStatusCode enumeration is trying to solve. |
I guess the broad problem we want to solve with CannonicalStatusCode is grouping spans by the finite number of well-known error codes. Backend UX can do nice things on top of it, users can easily query The problems are:
I propose the following approach
(naming needs a lot of improvements) HTTP:
gRPC:
Internal spans:
|
I don't think we should have an Actually, that EDIT: Or maybe |
int statusCode is useful for a lot of things (HTTP, gRPC, Oracle error codes, MS HRESULT, you name it). Historically everyone checks range for HTTP code and with int it's much easier. With "ECONNREFUSED" it seems it has int value but users are free to not use it and provide common string representation in the description. My expectation was that exceptions will be covered with Logging API and using span status for it is an abuse. So your example could be Plus, in the future, there will be log with error severity that provides exception information. And you could even write it today with log4j or another common logging API, just without OTel support |
Typically strongly typed fields are needed when there are experiences for them in SDK or OTcollector. I think in case of status code we may have need for the @Oberon00 for your proposal I'd suggest we have a separate "error" type that can be reported independently from the |
client/server indication lives in SpanKind. essentially all SpanKind.Client have only client errors. success/failure may be ok, but not enough (404 is a perfect example). I agree statusCode can move to the attributes - this will make status ever leaner. |
Of course ECONNREFUSED has a concrete and more or less stable integer value per POSIX implementation but Linux, BSD, MacOS, Windows (WSAECONNREFUSED) might each have different ones. The value is meaningless, the meaning is carried by "ECONNREFUSED". In HTTP it's the other way round, 404 identifies the error and "Not found" is merely a description.
An exception that merely occured during an operation: Yes, just log it. An operation that was ended by throwing an exception should be handled differently though. Partially disagree with
E.g. a client receiving HTTP 500 would arguably fail with a server error.
I don't really understand how you would separate an error from a span? The error would describe the result of the operation described by the span, so it would naturally be a property of the span. But I think that errors could be handled through semantic conventions, beyond maybe the very basic success/failure boolean (or maybe 3 to max 4 clearly defined enum values). |
Separation from the Span is in the sense that Many systems has a separate data types to represent errors and exceptions. And we can go with this. Or we can have |
Right, what is the problem with description carrying this
You don't know which server or maybe it's your own circuit breaker. Arguably what you received is not necessary what server responded with.
I think there are different options here. Exceptions are huge, they may have additional information attached (local variables, or links to dumps!), so semantic conversion on span will never be enough to cover all exceptions scenarios. Arguably exceptions are stored separately. I can easily imagine how you want to sample exceptions differently than spans or track them when you don't do any distributed tracing. Based on this discussion #67, it looks like Event is a temp API to workaround missing logging. In future users should be able to use logs OR events and semanically it will be the same. If we decide to follow this route (which I'm not a fan of) conventions on events would allow tracking exceptions. But still, status is not for them. |
My problem with that is that an integer errno like 53 is meaningless unless you know it originated e.g. from a FreeBSD system. On the other hand, the error names are a POSIX standard. So I'd really prefer something like a Span attribute |
HTTP requests are also huge 😃 I think we only need these two properties to satisfy most use cases for exceptions:
Stack traces would be nice too, but there is often considerable overhead associated with them, both in gathering them (resolving symbols, ...) and of course sending them. EDIT: I agree that any information beyond the two basic properties type & message can be stitched together on demand and may be delivered as log or something else. |
I would go a middle course here, as I stated above. |
this was moved to the next milestone. Few meeting notes from https://docs.google.com/document/d/1-bCYkN-DWJq4jw1ybaDZYYmx-WAe6HnwfWbkm8d57v8/edit#:
|
from the spec sig mtg today, looks like this one could be for the v0.5 milestone but sounds like more input would be needed desirably from the .net folks @bruno-garcia @reyang |
.NET OTel is going to be leveraging types from the BCL (i.e., the .NET standard library) that are general and can't have a property like status tied to a specific protocol. It will be good to separate error reporting and span status conversations, but at first, just having a simple span status, e.g.: Error reporting/information is a much larger discussion about formats and conventions. The relation with span status is that once it is /cc @andrewhsu |
I'm not sure I understand this right. Do you mean .NET has chosen to not implement the span status code API? EDIT: Seems it's there https://github.com/open-telemetry/opentelemetry-dotnet/blob/master/src/OpenTelemetry.Api/Trace/Status.cs |
@Oberon00 currently the plan is to use |
While this seems like a very sensible choice for .NET, I doubt you can call the result OpenTelemetry, as the API/SDK separation with ability for vendors to use their own SDK (incl Span) implementations is IMHO a central part of its promise. But I'm starting to go off-topic here. |
I've added this to the 05/19/2020 OpenTelemetry SIG: specifications meeting agenda. |
I believe that Status can be treated as a semantic convention, and that span attributes can be used to convey the Status property. Instead of a dedicated
On the other hand, I support the notion that the existing status codes are sufficiently and usefully generic. In the SIG call, one example was given: suppose the user has a "language exception". How should this map into a status code? Language exceptions can mean anything: some language exceptions will map to deadline exceeded, some to permission denied, some to not found, etc. It's fine to leave the default, which is "OK", which gives less information to the backend, but is "correct". Let's look at examples. What are some common Windows system call errors that you think do not map well into the canonical codes? |
(Also @tedsuo mentioned |
Next follow up meeting - Thu 5/21/2020 9:00AM-10:00AM PST Or join by phone (audio only) Find local phone number: https://dialin.teams.microsoft.com/8551f4c1-bea3-441a-8738-69aa517a91c5?id=320785411 |
Span.Status Discussion Notes
|
Notes for Thu 5/21/2020 9:00AM-10:00AM PST meetingDecision and follow ups
@bogdandrutu @noahfalk @pjanotti @tarekgh please review/comment and help to keep me honest 😄 |
Apologies it took longer than I expected to get to it. I've got a draft written up and I'm waiting to get a little feedback from my coworkers to see if it passes some basic scrutiny ; ) If they like it I'll stick it in the oteps repo as the next step. |
^ OTEP that addresses this issue: open-telemetry/oteps#123 "Add support for more expressive span status API" by @noahfalk |
Recording exception was addressed in #697. As there is a lot of controversy and different opinions about Span.status, there is a proposal to remove this field altogether before GA: #706. It will be much easier to add it back when/if we agree on it, than to support the current I propose to declare that this issue is superseded by #706 and should be closed. @open-telemetry/specs-approvers @open-telemetry/specs-trace-approvers If you agree, please close this issue. |
@open-telemetry/specs-approvers @open-telemetry/specs-trace-approvers If you agree, please close this issue. |
Since there were only two 👍 (althoug no 👎), I would leave that to @open-telemetry/technical-committee but it is probably uncontroversial by now. |
From the sig mtg today, discussed and sounds like this should be closed in favor of #706 |
StatusCanonicalCode
is suitable for gRPC codes.Even when it comes to HTTP instrumentation, the real status appears in the attributes rather than Status. The soon-to-be-guidance for mapping will introduce redundancy by bucketing HTTP codes into a bit smaller subset of codes.
For other protocols (or logical operations), it seems everyone has to find the most appropriate status in the gRPC codes and map their error into one from the list.
The feedback I've got from the team inside Azure which did the instrumentation (hi @pakrym), that StatusCanonicalCode is not the perfect match for arbitrary error codes agnostic to the protocol/logical operation.
So
Unknown
error with elaborate description is the options they go with, at least for now.Nitpicking:
aborted
vscancelled
vsDeadlineExceeded
and similar ambiguity. Things likeInternal
seems to be as good asunknown
for logical operations.So I wonder if we can revisit choice for status code and try to come up with
The text was updated successfully, but these errors were encountered: