Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Library instrumentation: StatusCanonicalCode #306

Closed
lmolkova opened this issue Oct 11, 2019 · 33 comments
Closed

Library instrumentation: StatusCanonicalCode #306

lmolkova opened this issue Oct 11, 2019 · 33 comments
Assignees
Labels
area:error-reporting Related to error reporting priority:p1 Highest priority level release:required-for-ga Must be resolved before GA release, or nice to have before GA spec:protocol Related to the specification/protocol directory spec:trace Related to the specification/trace directory
Milestone

Comments

@lmolkova
Copy link
Contributor

lmolkova commented Oct 11, 2019

StatusCanonicalCode is suitable for gRPC codes.

Even when it comes to HTTP instrumentation, the real status appears in the attributes rather than Status. The soon-to-be-guidance for mapping will introduce redundancy by bucketing HTTP codes into a bit smaller subset of codes.

For other protocols (or logical operations), it seems everyone has to find the most appropriate status in the gRPC codes and map their error into one from the list.

The feedback I've got from the team inside Azure which did the instrumentation (hi @pakrym), that StatusCanonicalCode is not the perfect match for arbitrary error codes agnostic to the protocol/logical operation.

So Unknown error with elaborate description is the options they go with, at least for now.

Nitpicking: aborted vs cancelled vs DeadlineExceeded and similar ambiguity. Things like Internal seems to be as good as unknown for logical operations.

So I wonder if we can revisit choice for status code and try to come up with

  • something that does not introduce redundancy for HTTP/any protocol
  • would make sense for all kinds of spans
  • will have a short and clear explanation when to use what
@Oberon00
Copy link
Member

I agree that having to define a somewhat arbitrary mapping to these status code seems to be redundant work at best. The states that I think are interesting for a span are better described with something like:

  1. Did the operation semantically fail or not?
  2. Was the operation left with an exception or not? (I guess this implies a semantic failure)
  3. What was the exception type/class? (generic info)
  4. What was the error message? ("generic" info)
  5. What was the technical error code or equivalent (operation-specific, e.g. HTTP status code, C errno, Oracle ORA error number, ...)?

Especially the fine distinction between a semantically failed operation and an operation that ended with some error indicator is tricky. For example AFAIK we don't mark most non 2xx-HTTP status codes as failed operations in Dynatrace. E.g. a 404 could just mean an application is checking if some entity exists and 404 would be the equivalent of a boolean false returned. I think we even allow users to configure in the Dynatrace Web UI, if for some particular URL (or class of URLs), a particular HTTP status code should be considered a failed operation.

@Oberon00
Copy link
Member

In particular, I don't know which problem the CannonicalStatusCode enumeration is trying to solve.

@lmolkova
Copy link
Contributor Author

lmolkova commented Oct 16, 2019

I guess the broad problem we want to solve with CannonicalStatusCode is grouping spans by the finite number of well-known error codes. Backend UX can do nice things on top of it, users can easily query

The problems are:

  • users want to know protocol error code regardless of OTel error codes subset
  • very hard to enforce the same mapping in the instrumentation for the same client library but in different languages for anything except grpc and possibly http
  • grouping heterogeneous things (HTTP requests and Oracle SQL queries) by same error code doesn't make a lot of sense

I propose the following approach

Status
  // protocol or library specific status code
  int statusCode = 0

  string description = null

  // optional OTel status code 
  // indicates if it's a success or failure
  Result result = Result.None

  enum Result
    // not set or hard to tell whether it's success or failure 
    // exporters may think about it as success or get into the business of parsing statusCode
    None, 
    
   Ok,

   Failure,
   
   // maybe we want to cover a handful of very common client errors
  Cancelled

(naming needs a lot of improvements)

HTTP:

  • status code will get out of attributes and move to the Status.statusCode
  • instrumentation can set Result to OK for 1xx-2xx, Failure for 5xx and keep None for 4xx (or explicitly 404, 409)
  • client errors will not have statusCodebut have Result

gRPC:

  • set statusCode
  • indicate success/failure/unknown in Result

Internal spans:

  • never touch statusCode
  • set Result and provide an optional description
  • detailed exceptions should be covered with Logging API in future (I assume) including type, stack, etc

@Oberon00
Copy link
Member

Oberon00 commented Oct 16, 2019

I don't think we should have an int statusCode.
For example, for OS errors, a string with the error code constant like ECONNREFUSED would be a better fit (POSIX only specifies the names not the values of the error codes, see https://pubs.opengroup.org/onlinepubs/9699919799.2013edition/basedefs/errno.h.html).

Actually, that ECONNREFUSED highlights another use case that is not well-supported by that statusCode field: If we had some generic convention for OS errors, you could do something like setting os.errcode = "ECONNREFUSED", os.errfunc = "connect" on an HTTP span and you would know that there was an OS-level error, not an HTTP-level error.

EDIT: Or maybe err.kind = "os", err.code = "ECONNREFUSED", err.os.func = "connect". Could be adapted to err.kind = "exception" err.code = "java.net.ConnectException", err.msg = "Connection to [::2]:14426 refused.". We need semantic conventions for errors and exceptions!

@lmolkova
Copy link
Contributor Author

lmolkova commented Oct 16, 2019

int statusCode is useful for a lot of things (HTTP, gRPC, Oracle error codes, MS HRESULT, you name it). Historically everyone checks range for HTTP code and with int it's much easier.

With "ECONNREFUSED" it seems it has int value but users are free to not use it and provide common string representation in the description.

My expectation was that exceptions will be covered with Logging API and using span status for it is an abuse.

So your example could be
statusCode = 111 (or 0), description = "ECONNREFUSED", Result = Failure

Plus, in the future, there will be log with error severity that provides exception information. And you could even write it today with log4j or another common logging API, just without OTel support

@SergeyKanzhelev SergeyKanzhelev added this to the Alpha v0.3 milestone Oct 16, 2019
@SergeyKanzhelev
Copy link
Member

Typically strongly typed fields are needed when there are experiences for them in SDK or OTcollector. I think in case of status code we may have need for the success/failure bit and maybe for client vs. server indication. statusCode can probably be moved to attributes.

@Oberon00 for your proposal I'd suggest we have a separate "error" type that can be reported independently from the Span. And it may carry all this information. Would it be what you need? Or you think this information is needed on a Span itself?

@lmolkova
Copy link
Contributor Author

client/server indication lives in SpanKind. essentially all SpanKind.Client have only client errors.

success/failure may be ok, but not enough (404 is a perfect example).

I agree statusCode can move to the attributes - this will make status ever leaner.

@Oberon00
Copy link
Member

@lmolkova

Of course ECONNREFUSED has a concrete and more or less stable integer value per POSIX implementation but Linux, BSD, MacOS, Windows (WSAECONNREFUSED) might each have different ones. The value is meaningless, the meaning is carried by "ECONNREFUSED". In HTTP it's the other way round, 404 identifies the error and "Not found" is merely a description.

My expectation was that exceptions will be covered with Logging API and using span status for it is an abuse.

An exception that merely occured during an operation: Yes, just log it. An operation that was ended by throwing an exception should be handled differently though.


@lmolkova

Partially disagree with

Client have only client errors.

E.g. a client receiving HTTP 500 would arguably fail with a server error.


@SergeyKanzhelev

@Oberon00 for your proposal I'd suggest we have a separate "error" type that can be reported independently from the Span. And it may carry all this information. Would it be what you need? Or you think this information is needed on a Span itself?

I don't really understand how you would separate an error from a span? The error would describe the result of the operation described by the span, so it would naturally be a property of the span. But I think that errors could be handled through semantic conventions, beyond maybe the very basic success/failure boolean (or maybe 3 to max 4 clearly defined enum values).

@SergeyKanzhelev
Copy link
Member

Separation from the Span is in the sense that Span semantics only tells whether it's success or failure. And there is something else defines all the properties of this failure.

Many systems has a separate data types to represent errors and exceptions. And we can go with this. Or we can have Event on Span with the specific semantic of an error.

@lmolkova
Copy link
Contributor Author

lmolkova commented Oct 17, 2019

Of course ECONNREFUSED has a concrete and more or less stable integer value per POSIX implementation but Linux, BSD, MacOS, Windows (WSAECONNREFUSED) might each have different ones. The value is meaningless, the meaning is carried by "ECONNREFUSED". In HTTP it's the other way round, 404 identifies the error and "Not found" is merely a description.

Right, what is the problem with description carrying this ECONNREFUSED as a string?

E.g. a client receiving HTTP 500 would arguably fail with a server error.

You don't know which server or maybe it's your own circuit breaker. Arguably what you received is not necessary what server responded with.

An exception that merely occured during an operation: Yes, just log it. An operation that was ended by throwing an exception should be handled differently though.

I think there are different options here. Exceptions are huge, they may have additional information attached (local variables, or links to dumps!), so semantic conversion on span will never be enough to cover all exceptions scenarios.

Arguably exceptions are stored separately. I can easily imagine how you want to sample exceptions differently than spans or track them when you don't do any distributed tracing.
So I agree with @SergeyKanzhelev that exceptions may be represented separately from Spans and correlation ids allow backends to stitch together lean spans with rich exceptions.

Based on this discussion #67, it looks like Event is a temp API to workaround missing logging. In future users should be able to use logs OR events and semanically it will be the same.

If we decide to follow this route (which I'm not a fan of) conventions on events would allow tracking exceptions. But still, status is not for them.

@Oberon00
Copy link
Member

Right, what is the problem with description carrying this ECONNREFUSED as a string?

My problem with that is that an integer errno like 53 is meaningless unless you know it originated e.g. from a FreeBSD system. On the other hand, the error names are a POSIX standard. So I'd really prefer something like a Span attribute error.errno that is set to a string "ECONNREFUSED". I don't see any added value in the integer number.
An error description would be what strerror returns for ECONNREFUSED, like "Connection refused."

@Oberon00
Copy link
Member

Oberon00 commented Oct 23, 2019

I think there are different options here. Exceptions are huge, they may have additional information attached (local variables, or links to dumps!), so semantic conversion on span will never be enough to cover all exceptions scenarios.

HTTP requests are also huge 😃 I think we only need these two properties to satisfy most use cases for exceptions:

  • Exception type (maybe as two properties type name + namespace)
  • Error message

Stack traces would be nice too, but there is often considerable overhead associated with them, both in gathering them (resolving symbols, ...) and of course sending them.

EDIT: I agree that any information beyond the two basic properties type & message can be stitched together on demand and may be delivered as log or something else.

@Oberon00
Copy link
Member

Separation from the Span is in the sense that Span semantics only tells whether it's success or failure. And there is something else defines all the properties of this failure.

I would go a middle course here, as I stated above.

@SergeyKanzhelev
Copy link
Member

this was moved to the next milestone. Few meeting notes from https://docs.google.com/document/d/1-bCYkN-DWJq4jw1ybaDZYYmx-WAe6HnwfWbkm8d57v8/edit#:

  • Status code based on gRPC
  • Sort of works for HTTP
  • But not very clear for other types of operations
  • Severity is a potentially useful concept
  • “Do you want someone to be paged on this?”
  • We may need to come up with our own codes and clarity of translation
  • Discuss in v0.4

@andrewhsu
Copy link
Member

from the spec sig mtg today, looks like this one could be for the v0.5 milestone but sounds like more input would be needed desirably from the .net folks @bruno-garcia @reyang

@pjanotti
Copy link

pjanotti commented May 15, 2020

.NET OTel is going to be leveraging types from the BCL (i.e., the .NET standard library) that are general and can't have a property like status tied to a specific protocol.

It will be good to separate error reporting and span status conversations, but at first, just having a simple span status, e.g.: Ok|Failed simplifies things. The application/library should set the Failed status since the criteria for that can vary, i.e., a failed child span should not automatically trigger the failure of the parent. This approach keeps simple the determination of the span status for the users.

Error reporting/information is a much larger discussion about formats and conventions. The relation with span status is that once it is Failed, there should be a way to indicate which specific error, between a sequence of errors, finally triggered the span status to be Failed. Typically this should be the last error, but there may be cases in which an error is logged after the condition that triggered the failure of the span (e.g., error trying to communicate about the failure itself).

/cc @andrewhsu

@Oberon00
Copy link
Member

Oberon00 commented May 18, 2020

I'm not sure I understand this right. Do you mean .NET has chosen to not implement the span status code API? EDIT: Seems it's there https://github.com/open-telemetry/opentelemetry-dotnet/blob/master/src/OpenTelemetry.Api/Trace/Status.cs

@pjanotti
Copy link

@Oberon00 currently the plan is to use Activity as a replacement for OTel Span and drop the old implementation - there is an ongoing discussion if there should be an OTel API wrapping it (at this moment I'm particularly in favor of having the wrapper). The BCL team modified Activity to better align with OTel but there is some differences Status is one of the most visible.

@Oberon00
Copy link
Member

While this seems like a very sensible choice for .NET, I doubt you can call the result OpenTelemetry, as the API/SDK separation with ability for vendors to use their own SDK (incl Span) implementations is IMHO a central part of its promise. But I'm starting to go off-topic here.

@reyang
Copy link
Member

reyang commented May 19, 2020

I've added this to the 05/19/2020 OpenTelemetry SIG: specifications meeting agenda.
Please join the discussion if you're interested.

@jmacd
Copy link
Contributor

jmacd commented May 19, 2020

I believe that Status can be treated as a semantic convention, and that span attributes can be used to convey the Status property. Instead of a dedicated SetStatus() API, we can use SetAttribute. I would recommend the use of a dictionary-typed value for the status, e.g.,

span.SetAttribute("span.status", { "message": "blah blah", "code": "DEADLINE_EXCEEDED" })

On the other hand, I support the notion that the existing status codes are sufficiently and usefully generic. In the SIG call, one example was given: suppose the user has a "language exception". How should this map into a status code? Language exceptions can mean anything: some language exceptions will map to deadline exceeded, some to permission denied, some to not found, etc. It's fine to leave the default, which is "OK", which gives less information to the backend, but is "correct".

Let's look at examples. What are some common Windows system call errors that you think do not map well into the canonical codes?

@jmacd
Copy link
Contributor

jmacd commented May 19, 2020

(Also @tedsuo mentioned SpanKind during this conversation. It's in the same boat. Why isn't it a semantic convention?)

@reyang
Copy link
Member

reyang commented May 19, 2020

Next follow up meeting - Thu 5/21/2020 9:00AM-10:00AM PST

Teams link: https://teams.microsoft.com/l/meetup-join/19%3ameeting_ZjgzMzI2NzItOWM3Zi00ZDQ4LThjNzQtOTJhY2U2NWVhMzAz%40thread.v2/0?context=%7b%22Tid%22%3a%2272f988bf-86f1-41af-91ab-2d7cd011db47%22%2c%22Oid%22%3a%2283ba88b7-f89d-4e39-86c5-39927960aca7%22%7d

Or join by phone (audio only)
+1 425-616-0754,,320785411# United States, Seattle
Phone Conference ID: 320 785 411#

Find local phone number: https://dialin.teams.microsoft.com/8551f4c1-bea3-441a-8738-69aa517a91c5?id=320785411

@reyang
Copy link
Member

reyang commented May 20, 2020

Span.Status Discussion Notes

  • Why do we want Status? How do we intend to use Status? Is GRPC code a good fit?

    • [Open Question] Do we intend to map exceptions (e.g. Java, .NET)?
      • Most folks believe that is NOT the current intention of Status.
      • @bogdandrutu entertained the idea of extending Status to cover these scenarios.
    • [Pros] Having a generic mapping seems to be super valuable for telemetry consumers on spans that represent communication protocols. It was unclear if the value extended elsewhere.
      • As it removes a lot of work from the consumer to do the mapping, although the responsibility is still on the consumer side (e.g. how to interpret status code and tell whether it is a real error or not).
    • [Cons] The generic mapping might not be what the customer would really want - for example, and HTTP 404 could either be treated as a success or failure, depending on the scenario/expectation.
    • [Cons] It is vague regarding the guidance for non-RPC scenario. For example, how to map a Win32 Error Code, POSIX error number or COM HRESULT? Customers might misuse it, which eventually makes it difficult to consume.
  • Should this be a 1st class member, or just a normal custom property plus semantic conventions?

    • [Pros] Encourages everyone to try their best to do the mapping, rather than pushing the entire responsibility to telemetry consumers.
    • [Cons] Makes it part of the interface without a clear semantic/guidance might lead to long term pain. @noahfalk expressed concerns that many users may ignore Status or map it arbitrarily in these cases.
    • [Open Question] we might need some meta rule helping us to decide whether a particular field should be 1st class member or just a key-value in the property bag.
  • Next steps:

    • Decide on whether we need something with the semantics of Status - it seems the current preference is YES.
    • Decide on whether Status should be a 1st class member of Span, or something else.
    • Decide on the mapping guidance and semantic, e.g. if GRPC code is good or we want something else.

@reyang
Copy link
Member

reyang commented May 21, 2020

Notes for Thu 5/21/2020 9:00AM-10:00AM PST meeting

Decision and follow ups

  • @noahfalk will create an otep collecting feedbacks and see if we can agree on a more generic design.
  • Agreed that Status should be a 1st class member of Span.
    • It is unlikely to hit the .NET 5 milestone, so .NET BCL will put Status as a custom property instead of a 1st class member field; Later .NET can make an additive change to promote Status to 1st class. Worst case we'll have chance to make the change in a year.
    • We will have a separate discussion in OpenTelemetry .NET SDK to see whether we can put a Status shim/extension.
  • @reyang suggested that we don't want to invent a superset of status code for everything, instead, trying to have a small set of actionable codes that are general enough.
  • @pjanotti suggested that we should keep both a general status and the verbose/specific error code/info.
  • @bogdandrutu proposed an idea of having Status(level, domain, attributes).
  • Clarified that level is NOT an indication of severity (e.g. VERBOSE vs. WARNING).

@bogdandrutu @noahfalk @pjanotti @tarekgh please review/comment and help to keep me honest 😄

@Oberon00
Copy link
Member

@reyang @noahfalk Any progress with the OTEP?

@noahfalk
Copy link

Apologies it took longer than I expected to get to it. I've got a draft written up and I'm waiting to get a little feedback from my coworkers to see if it passes some basic scrutiny ; ) If they like it I'll stick it in the oteps repo as the next step.

@carlosalberto carlosalberto added the release:required-for-ga Must be resolved before GA release, or nice to have before GA label Jul 2, 2020
@Oberon00
Copy link
Member

Oberon00 commented Jul 7, 2020

^ OTEP that addresses this issue: open-telemetry/oteps#123 "Add support for more expressive span status API" by @noahfalk

@iNikem
Copy link
Contributor

iNikem commented Jul 22, 2020

Recording exception was addressed in #697.

As there is a lot of controversy and different opinions about Span.status, there is a proposal to remove this field altogether before GA: #706. It will be much easier to add it back when/if we agree on it, than to support the current status which so many people are uncomfortable with.

I propose to declare that this issue is superseded by #706 and should be closed. @open-telemetry/specs-approvers @open-telemetry/specs-trace-approvers If you agree, please close this issue.

@iNikem
Copy link
Contributor

iNikem commented Jul 27, 2020

@open-telemetry/specs-approvers @open-telemetry/specs-trace-approvers If you agree, please close this issue.

@Oberon00
Copy link
Member

Since there were only two 👍 (althoug no 👎), I would leave that to @open-telemetry/technical-committee but it is probably uncontroversial by now.

@andrewhsu
Copy link
Member

From the sig mtg today, discussed and sounds like this should be closed in favor of #706

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:error-reporting Related to error reporting priority:p1 Highest priority level release:required-for-ga Must be resolved before GA release, or nice to have before GA spec:protocol Related to the specification/protocol directory spec:trace Related to the specification/trace directory
Projects
None yet
Development

No branches or pull requests