-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTTP Client spans - Span Status for 4xx errors - Change from MUST to SHOULD #1003
Comments
I think SemConv should be driven by the underlying spec, in this case by the RFC that defines the status codes, which says that 4xx are client side errors:
That is the only thing that globally applies to all HTTP requests. The use case that you raise (e.g. CRUD API not considering a 404 as an error) is I'd say a higher level interpretation of the status code - from the HTTP libraries point of view it's still an error, but the higher level application layer doesn't interpret it as an error, because it can properly handle it. But from HTTP's point of view it's still an error. I think this would be better served by filtering in the backend - if you list errors, and your application doesn't interpret 404s as errors, then you could just filter those out from the list of errors. The backend is a place where you have the option to defines which specific 4xx client side HTTP spans are not errors from your application's point of view, and then slice and dice traces accordingly. I don't the SemConv can accommodate this. |
I think using should vs must and having a default of error for implementations is a good starting point. However 4xx responses are incredibly gray when it comes to 'is that an error', and a lot of this comes from the use context. In the perspective of a rest client the api/client/developer may defined the error status as:
However in the context of a ui/browser everything except 401 would be considered an error. This request is to allow client span generators to add additional context to the span. Below is the Elastic APM interpretation of client spans and their default status, they allow for interpretation of a 4xx response not being an error.
https://github.com/elastic/apm/blob/main/specs/agents/tracing-instrumentation-http.md#outcome |
That spec is used by all Elastic APM Agents, and basically all set client spans with HTTP 4xx to
Yeah, I understand the intention - I just feel it'd make SemConv less useful and most instrumentations won't benefit. But that's just my opinion - let's see what others think. Are there any widely used libraries where you could have a custom interpretation (like the list you had above) of the different 4xx status codes? |
Regardless of the default, it won't work for some applications. 404 is an error in some cases and not an error in others, sometimes depending on the caller context. But I agree that a MUST is not right. Specialized or opinionated instrumentations should be able to use different classification of errors and users are encouraged to override status in the processors (UPDATE but currently impossible #1003 (comment) 🤦♀️). I'd support SHOULD and would consider it as a bug fix. It would also not change the behavior of any OTel HTTP instrumentations. @tj-cisco would you be able to send a PR? |
I've been following this conversation since this started as a feature request on the Perl OTel instrumentation I maintain. At the risk of sounding off-topic, I have a question about the expected consequences of this change. @lmolkova says that a change like the one proposed here would "not change the behaviour of any OTel HTTP instrumentation", and I agree: even with a change like this one, I don't think I'd have accepted a change like the one that originally lead to this conversation, at least not without some additional clarification on the spec. But this made me think twice:
I like this idea, but I'm not sure how the specification supports this. Unless I'm reading it incorrectly, any change in status would happen in the processor's So the processor can change the status, but the span will not be recording at that point, so that change will be discarded. Or did I get this wrong? I'd be happy to move this to a different thread if it makes things easier to manage. |
My take here - We can give users flexibility in the Status Code -> Span Status mapping. However, at a macro level -
That infers the following:
Overall I'm ok moving from MUST to SHOULD, but I still think what we have now is reasonable. |
As long as we're not changing the behavior of existing emitted telemetry, then this seems fine as a bug fix. |
@jjatria oh snap, you're absolutely right and there is no way to change span status in the processor. I think the best thread to discuss it in is open-telemetry/opentelemetry-specification#1089 |
@jsuereth it still seems to be problematic and we get A LOT of negative user feedback about this Azure/azure-sdk-for-net#29132 if users do something like The perfect solution would be that client libraries/application would be able to dynamically decide if inner HTTP span status is an error, but we're really far from it. TL;DR: The MUST is a problem because even client libs which do HTTP instrumentation and know 404 is not an error can't classify it as such. |
I'd expand on the comment from @jjatria about the already opened PR for perl: I totally acknowledge the use-case described here, and it's ok if we try to make SemConv allow for those use-cases. The biggest issue I see with just making it from If people agree that this is a potential risk we care about, then maybe a suggestion: We could slice the spec into 2 parts:
|
I confess I don't know much about the full downstream consequences of a span being marked as OK or as an error, but if the problem is as @lmolkova states that
then I feel like this is not a problem with the spans, but with the way the tooling around them interprets them. I think on this I side with @jsuereth in that
I don't think the MUST is a bug, but don't object to it being a SHOULD. However, I'm not convinced that making the change suggested here would have the consequences that motivate it. |
Is it our aim to support automated tooling that blindly shows error rates? It definitely should be our aim to provide clarity to confused users, but there might be other ways to do that besides putting a band-aid on HTTP semantic conventions. The problem is more general: an error might happen at any layer in an application, and it can be handled successfully on a different layer (or even be expected by that different layer). That should be clear from looking at a trace: in the example @lmolkova gave some HTTP operations finished with an error, while the parent Confusion on the user side might come from the notion that any errors happening in a trace are problematic. If we want to support this notion, than fixes like this are the logical consequence. I think solving this problem on the instrumentation level will only work for a few cases, as it's unrealistic to expect that the instrumented library can know whether an error is expected or handled by any application layer above it. |
I don't want to dismiss overwhelming user feedback. The retried connection drop during an operation is fine to mark as an error, 429 or 503 is fine too. What my example shows that some 'error' are not errors. The 404 during |
@gregkalapos does #281 address your concerns? |
In this case the documentation must be updated with rfc2119 in mind.
My PR here was the incorrect approach, if this change get accepted then I will create a way to pass a mapping of response codes and what the span status should be. I agree that blindly changing the all responses to be none errors is the wrong solution.
404 not being an error seems to be a fairly consistent theme here: elastic/apm-agent-python#1760 It seems that people are really not interested in seeing the noise in the reporting tools from none errors being reported as errors and the related alert fatiuge associated with them. |
I completely agree with what you say here. At the same time, I'd still remind that just by changing from
Not sure about that. From what I understand it's still not documented and there is no tooling around it forcing instrumentations to document when they are not following a I 💯% agree with @jjatria's comment:
My understanding is this: Some 4xxs are not always errors and in some cases, instrumentation has information about this which could be used to set or not set the span as an error (this I'd call edge-case which the spec could accommodate). In other cases (which I'd call base-case), the instrumentation does not have this info - so best we can do is to follow the HTTP RFC and here 4xx So why don't we just enable to use this information in the spec? |
You have 2 conflicting statements here, you can't use the word
This is what this whole issue is about, changing MUST to SHOULD will allow this. Alternatively it could be written in one of the 2 following ways:
IMHO having |
I think both of your proposals here are fine - both accommodate what you suggest here and don't contaminate the base case.
And would allow many other things - e.g. it'd also allow in a specific implementation to not set any of the 4xxs to error. My suggestion was to add the info into the spec that it's ok to not set 4xxs to errors IF the instrumentation has the knowledge that it's in fact not an error and not merge this into the base case. Let's see what others think - I think both of your proposals above are fine. |
The SHOULD implies that instrumentation would set status to error on 4xx. But if it has good reasons to set it to something else (keep it UNSET), it can do it too. So I feel simple change of MUST to SHOULD coneys the same message. I implemented it in the #1167, PTAL |
But then I'd prefer the spec says at least something about these good reasons. Right now (with #1167) it doesn't and I fear people will just interpret that sentence differently resulting in different implementations.
I'd not assume this. In jjatria/perl-opentelemetry#16 the But let's continue on the PR - maybe others will express their opinion as well. |
@gregkalapos I don't understand how we can keep MUST but immediately say that there are reasons where you don't follow it. https://datatracker.ietf.org/doc/html/rfc2119
If feels like the message you want to convey is "By default you must set span status to Error for for 4xx, but you can keep it unset if you have reasons to believe it's not an error". Based on the above definition of SHOULD, it can be simplified to "you SHOULD set status to Error for 4xx". Reading jjatria/perl-opentelemetry#16, I'm not sure that SHOULD would result in different default behavior. It seems the consensus there is to allow configurability and not change the default. |
I'd split it into 2 cases:
Or something along those lines - of course more polished. And under point 1 we could be even more specific (things you listed in the PR description). So these are 2 distinct cases -
For the 1. quote: as described above, I'd go a bit further and be specific on when you should not set it to error and what you can base that decision on. With that, I think this is more than just changing to Also another point: I think one goal of SemConv should be to have the same telemetry for the same situation in different languages. So the exact same library (e.g. a higher level HTTP lib that knows if something is an error or not) in language |
I think it's fairly common for span status to be context dependent, and the definition of the span status field reflects that (https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/api.md#set-status). Typically it's the application developer who has this additional context to tune the span status, but I think the same applies when a library (e.g. a rest client) uses an underlying HTTP client (e.g. the library implements a boolean-valued exists function by calling the underlying HTTP client and converting 404 response to a return value of false). |
Ultimately, I can't give the clear guidance. The only guidance I can give is "4xx is an error on client. If you know it's not an error, don't report it as such." Which is a long version of SHOULD :) |
Area(s)
No response
Is your change request related to a problem? Please describe.
Currently the convention states that all 4xx errors in a client span
MUST
have their status set toERROR
. This introduces high telemetry span failure rates in REST APIs that use 4xx codes to represent things that are not failures in the context of telemetry. For example in a CRUD API a 404 response to a GET for a resource is a perfectly normal response code that is not indicative of an error but communicates to the client that the record they are looking for does not exist and and should create it if they require it to exist. Another common use is 429 to enforce rate limiting, this may or may not be considered an error depending on the API consumer implementation, it should be left to the API consumer to generate an error span if they consider rate limiting an error.Describe the solution you'd like
Change
MUST
toSHOULD
in the following sentence for client spans:Resulting in:
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: