Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent error status for gRPC and HTTP #3110

Closed
thoslin opened this issue Jan 18, 2023 · 18 comments · Fixed by #3333
Closed

Inconsistent error status for gRPC and HTTP #3110

thoslin opened this issue Jan 18, 2023 · 18 comments · Fixed by #3333
Assignees
Labels
area:semantic-conventions Related to semantic conventions spec:trace Related to the specification/trace directory triaged-accepted The issue is triaged and accepted by the OTel community, one can proceed with creating a PR proposal

Comments

@thoslin
Copy link

thoslin commented Jan 18, 2023

Hi, we found that the error status convention is different between gRPC and HTTP, for example, not found error in gRPC will set status to error, while in HTTP it is unset.

gPRC: https://opentelemetry.io/docs/reference/specification/trace/semantic_conventions/rpc/#grpc-status

The Span Status MUST be left unset for an OK gRPC status code, and set to Error for all others.

HTTP: https://opentelemetry.io/docs/reference/specification/trace/semantic_conventions/http/#status

For HTTP status codes in the 4xx range span status MUST be left unset in case of SpanKind.SERVER and MUST be set to Error in case of SpanKind.CLIENT.

This causes trouble for us. we have some gRPC services instrumented using the offical OpenTelemetry SDK following the specification, and it will set error status for not found errors during instrumentation. And as we're using Splunk APM service, which treats error status as server side errors https://docs.splunk.com/Observability/apm/apm-spans-traces/apm-errors.html#how-are-error-spans-detected, so we run into a situation where alerts will be triggered by not found errors.

We could work around this by tuning our alert rules to distinguish between gRPC and HTTP, but I'd like to check if it is possible to update the convention for gRPC to have more fine-grained handling as HTTP and not set not found error as error status?

@MrAlias
Copy link
Contributor

MrAlias commented Jan 18, 2023

@thoslin would something like this work?

Value Description Status
0 OK Unset
1 CANCELLED Unset
2 UNKNOWN Error
3 INVALID_ARGUMENT Unset
4 DEADLINE_EXCEEDED Error
5 NOT_FOUND Unset
6 ALREADY_EXISTS Unset
7 PERMISSION_DENIED Unset
8 RESOURCE_EXHAUSTED Unset
9 FAILED_PRECONDITION Unset
10 ABORTED Unset
11 OUT_OF_RANGE Unset
12 UNIMPLEMENTED Error
13 INTERNAL Error
14 UNAVAILABLE Error
15 DATA_LOSS Error
16 UNAUTHENTICATED Unset

@thoslin
Copy link
Author

thoslin commented Jan 20, 2023

@MrAlias this is exactly what i'm looking for. thanks. can we consider submitting a PR for this?

@Oberon00
Copy link
Member

Oberon00 commented Jan 20, 2023

RESOURCE_EXHAUSTED, for example, seems like an ERROR I would definitely like to notice in my tracing system by default (might correspond to HTTP 503 or 429 depending on context).
The above table would probably also be different for client and server (e.g. on client-side INVALID_ARGUMENT is definitely an error)

All in all, since gRPC has no client/server distinction baked into the error codes (like HTTP 4xx, 5xx), I would suggest to set everything as Error, except OK (unset), i.e. I think the current status quo is fine.

@arminru arminru added area:semantic-conventions Related to semantic conventions spec:trace Related to the specification/trace directory semconv:HTTP labels Jan 24, 2023
@trask trask assigned denisivan0v and unassigned SergeyKanzhelev Jan 30, 2023
@thoslin
Copy link
Author

thoslin commented Feb 1, 2023

@Oberon00 we can consider both client and server side error by extending that table. I think that's how convention for HTTP works. for example,

For HTTP status codes in the 4xx range span status MUST be left unset in case of SpanKind.SERVER and MUST be set to Error in case of SpanKind.CLIENT.

So we may have similar definition for INVALID_ARGUMENT, if it is SpanKind.SERVER, status is unset, but if it is SpanKind.CLIENT, then status is Error.

@thoslin
Copy link
Author

thoslin commented Feb 3, 2023

Does this work?

Value Description SpanKind.SERVER Status SpanKind.CLIENT Status
0 OK Unset Unset
1 CANCELLED Unset Unset
2 UNKNOWN Error Unset
3 INVALID_ARGUMENT Unset Error
4 DEADLINE_EXCEEDED Error Unset
5 NOT_FOUND Unset Error
6 ALREADY_EXISTS Unset Error
7 PERMISSION_DENIED Unset Error
8 RESOURCE_EXHAUSTED Error Unset
9 FAILED_PRECONDITION Unset Error
10 ABORTED Unset Unset
11 OUT_OF_RANGE Unset Error
12 UNIMPLEMENTED Unset Unset
13 INTERNAL Error Unset
14 UNAVAILABLE Error Unset
15 DATA_LOSS Error Unset
16 UNAUTHENTICATED Unset Error

@trask trask removed the semconv:HTTP label Feb 6, 2023
@pellared
Copy link
Member

pellared commented Feb 10, 2023

@thoslin Nice table 👍

I think that if SpanKind.SERVER Status is Error then SpanKind.SERVER Status should be Error as well. This would be alligned with the HTTP convention.

RESOURCE_EXHAUSTED may not be an error on the server side. From https://grpc.github.io/grpc/core/md_doc_statuscodes.html:

RESOURCE_EXHAUSTED 8 Some resource has been exhausted, perhaps a per-user quota, or perhaps the entire file system is out of space.

DEADLINE_EXCEEDED may be generated by the client. Per-user quota is more like a client-side error (at least in HTTP sem. conv.).

Here is my proposal:

Value Description SpanKind.SERVER Status SpanKind.CLIENT Status
0 OK Unset Unset
1 CANCELLED Unset Unset
2 UNKNOWN Error Error
3 INVALID_ARGUMENT Unset Error
4 DEADLINE_EXCEEDED Unset Error
5 NOT_FOUND Unset Error
6 ALREADY_EXISTS Unset Error
7 PERMISSION_DENIED Unset Error
8 RESOURCE_EXHAUSTED Unset Error
9 FAILED_PRECONDITION Unset Error
10 ABORTED Unset Error
11 OUT_OF_RANGE Unset Error
12 UNIMPLEMENTED Unset Error
13 INTERNAL Error Error
14 UNAVAILABLE Error Error
15 DATA_LOSS Error Error
16 UNAUTHENTICATED Unset Error

@fantapop
Copy link

The determination here will be a forcing function for how these codes are used in the wild. The majority of our code lines up with these but there are certainly differences in how my team has interpreted and used some of them. The fact that there are already 3 interpretations listed here in this ticket makes me think its not that cut and dry. I don't think it would be a huge lift to migrate to whatever is decided here but Is it the job of the open telemetry to be this forcing function? One other option would be to allow it to be customizable. Either way would be a step forward.

@pellared
Copy link
Member

pellared commented Feb 14, 2023

there are certainly differences in how my team has interpreted and used some of them

I was thinking about the same. NOT_FOUND is a good example. The fact that the client received NOT_FOUND does not mean that there is an error. However, the same problem is with HTTP semantic convention for 404 (Not Found) HTTP Status Code...

For HTTP status codes in the 4xx range span status MUST be left unset in case of SpanKind.SERVER and MUST be set to Error in case of SpanKind.CLIENT.

@fantapop
Copy link

FWIW, here are some places were maps to grpc codes exist:

https://github.com/googleapis/googleapis/blob/master/google/rpc/code.proto
https://chromium.googlesource.com/external/github.com/grpc/grpc/+/refs/tags/v1.21.4-pre1/doc/statuscodes.md

@pellared
Copy link
Member

@fantapop Thanks

this would make

Value Description SpanKind.SERVER Status SpanKind.CLIENT Status
0 OK Unset Unset
1 CANCELLED Unset Error
2 UNKNOWN Error Error
3 INVALID_ARGUMENT Unset Error
4 DEADLINE_EXCEEDED Error Error
5 NOT_FOUND Unset Error
6 ALREADY_EXISTS Unset Error
7 PERMISSION_DENIED Unset Error
8 RESOURCE_EXHAUSTED Unset Error
9 FAILED_PRECONDITION Unset Error
10 ABORTED Unset Error
11 OUT_OF_RANGE Unset Error
12 UNIMPLEMENTED Error Error
13 INTERNAL Error Error
14 UNAVAILABLE Error Error
15 DATA_LOSS Error Error
16 UNAUTHENTICATED Unset Error

@carlosalberto carlosalberto added the triaged-needmoreinfo The issue is triaged - the OTel community needs more information to decide label Feb 20, 2023
@carlosalberto
Copy link
Contributor

Let's discuss this in tomorrow's Spec meeting. We need to clarify at least the basic codes.

@pellared
Copy link
Member

pellared commented Mar 7, 2023

@carlosalberto @tigrannajaryan Is there any update on how we want to address this issue?

@thoslin
Copy link
Author

thoslin commented Mar 15, 2023

Is there any update on this?

@pellared
Copy link
Member

If there would be no objections, next week I am going to create a PR that aligns the gRPC sem. conv. to the latest table proposed here

@pellared
Copy link
Member

@carlosalberto Can you assign me?

@pellared
Copy link
Member

pellared commented Mar 22, 2023

I checked https://github.com/grpc/grpc-go and it occurs that sometimes the HTTP status codes <-> gRPC status codes mapping varies.

See

vs

https://github.com/googleapis/googleapis/blob/master/google/rpc/code.proto

TBH I find it strange when I read

Internal errors. This means that some invariants expected by the underlying system have been broken. This error code is reserved for serious errors.

Also https://github.com/grpc/grpc-go/blob/a02aae6168aa683a1f106e285ec10eb6bc1f6ded/internal/transport/http2_client.go#L1477. Why not codes.Unknown?

Also: https://pkg.go.dev/google.golang.org/grpc@v1.54.0/codes

However, I still think that the proposed table is the most pragmatic approximation we can have.

@pellared
Copy link
Member

PR created #3333

@tigrannajaryan tigrannajaryan added triaged-accepted The issue is triaged and accepted by the OTel community, one can proceed with creating a PR proposal and removed triaged-needmoreinfo The issue is triaged - the OTel community needs more information to decide labels Mar 22, 2023
@tigrannajaryan
Copy link
Member

Triaging this as "accepted" since I think this is an omission compared to http conventions.

carlosalberto pushed a commit that referenced this issue Apr 3, 2023
Fixes #3110

## Changes

The idea behind the PR is to make the spam statuses gRPC convention
similar to HTTP semantic conventions.
The gRPC statuses -> HTTP status codes mapping is not anywhere strictly
defined.
However, there is are some approximations which can be found:

-
grpc/grpc@bb04e07#diff-c94ff143c8f378e6925a985fa18528a8254a6d7fc34bc855e1de13f1e7f3e464
-
https://github.com/googleapis/googleapis/blob/master/google/rpc/code.proto

I got confused if we should treat `INTERNAL` as `Error` for
`SpanKind.SERVER` because of:

-
https://github.com/grpc/grpc/blob/master/doc/http-grpc-status-mapping.md
-
https://github.com/grpc/grpc-go/blob/a02aae6168aa683a1f106e285ec10eb6bc1f6ded/internal/transport/handler_server.go#L90C8-L92
-
https://github.com/grpc/grpc-go/blob/a02aae6168aa683a1f106e285ec10eb6bc1f6ded/internal/transport/http_util.go#L73

On the other hand, [the
description](https://github.com/grpc/grpc/blob/master/doc/statuscodes.md)
of `INTERNAL` says:

> Internal errors. This means that some invariants expected by the
underlying system have been broken. This error code is reserved for
serious errors.

Therefore, I decided to leave it as `Error`. Also because this is
backwards compatible (at least for this gRPC status).

---------

Co-authored-by: Tyler Yahn <MrAlias@users.noreply.github.com>
jsuereth pushed a commit to jsuereth/otel-semconv-test that referenced this issue Apr 19, 2023
Fixes open-telemetry/opentelemetry-specification#3110

## Changes

The idea behind the PR is to make the spam statuses gRPC convention
similar to HTTP semantic conventions.
The gRPC statuses -> HTTP status codes mapping is not anywhere strictly
defined.
However, there is are some approximations which can be found:

-
grpc/grpc@bb04e07#diff-c94ff143c8f378e6925a985fa18528a8254a6d7fc34bc855e1de13f1e7f3e464
-
https://github.com/googleapis/googleapis/blob/master/google/rpc/code.proto

I got confused if we should treat `INTERNAL` as `Error` for
`SpanKind.SERVER` because of:

-
https://github.com/grpc/grpc/blob/master/doc/http-grpc-status-mapping.md
-
https://github.com/grpc/grpc-go/blob/a02aae6168aa683a1f106e285ec10eb6bc1f6ded/internal/transport/handler_server.go#L90C8-L92
-
https://github.com/grpc/grpc-go/blob/a02aae6168aa683a1f106e285ec10eb6bc1f6ded/internal/transport/http_util.go#L73

On the other hand, [the
description](https://github.com/grpc/grpc/blob/master/doc/statuscodes.md)
of `INTERNAL` says:

> Internal errors. This means that some invariants expected by the
underlying system have been broken. This error code is reserved for
serious errors.

Therefore, I decided to leave it as `Error`. Also because this is
backwards compatible (at least for this gRPC status).

---------

Co-authored-by: Tyler Yahn <MrAlias@users.noreply.github.com>
jsuereth pushed a commit to open-telemetry/semantic-conventions that referenced this issue May 11, 2023
Fixes open-telemetry/opentelemetry-specification#3110

## Changes

The idea behind the PR is to make the spam statuses gRPC convention
similar to HTTP semantic conventions.
The gRPC statuses -> HTTP status codes mapping is not anywhere strictly
defined.
However, there is are some approximations which can be found:

-
grpc/grpc@bb04e07#diff-c94ff143c8f378e6925a985fa18528a8254a6d7fc34bc855e1de13f1e7f3e464
-
https://github.com/googleapis/googleapis/blob/master/google/rpc/code.proto

I got confused if we should treat `INTERNAL` as `Error` for
`SpanKind.SERVER` because of:

-
https://github.com/grpc/grpc/blob/master/doc/http-grpc-status-mapping.md
-
https://github.com/grpc/grpc-go/blob/a02aae6168aa683a1f106e285ec10eb6bc1f6ded/internal/transport/handler_server.go#L90C8-L92
-
https://github.com/grpc/grpc-go/blob/a02aae6168aa683a1f106e285ec10eb6bc1f6ded/internal/transport/http_util.go#L73

On the other hand, [the
description](https://github.com/grpc/grpc/blob/master/doc/statuscodes.md)
of `INTERNAL` says:

> Internal errors. This means that some invariants expected by the
underlying system have been broken. This error code is reserved for
serious errors.

Therefore, I decided to leave it as `Error`. Also because this is
backwards compatible (at least for this gRPC status).

---------

Co-authored-by: Tyler Yahn <MrAlias@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:semantic-conventions Related to semantic conventions spec:trace Related to the specification/trace directory triaged-accepted The issue is triaged and accepted by the OTel community, one can proceed with creating a PR proposal
Projects
None yet
Development

Successfully merging a pull request may close this issue.