Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for partial success in an OTLP export response #2636

3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,9 @@ release.

### OpenTelemetry Protocol

- Add support for partial success in an OTLP export response
([#2636](https://github.com/open-telemetry/opentelemetry-specification/pull/2636)).

### SDK Configuration

### Telemetry Schemas
Expand Down
132 changes: 110 additions & 22 deletions specification/protocol/otlp.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,14 +16,18 @@ nodes such as collectors and telemetry backends.
* [OTLP/gRPC](#otlpgrpc)
+ [OTLP/gRPC Concurrent Requests](#otlpgrpc-concurrent-requests)
+ [OTLP/gRPC Response](#otlpgrpc-response)
- [Success](#success)
- [Partial Success](#partial-success)
- [Failures](#failures)
+ [OTLP/gRPC Throttling](#otlpgrpc-throttling)
+ [OTLP/gRPC Service and Protobuf Definitions](#otlpgrpc-service-and-protobuf-definitions)
+ [OTLP/gRPC Default Port](#otlpgrpc-default-port)
* [OTLP/HTTP](#otlphttp)
+ [OTLP/HTTP Request](#otlphttp-request)
+ [OTLP/HTTP Response](#otlphttp-response)
- [Success](#success)
- [Failures](#failures)
- [Success](#success-1)
- [Partial Success](#partial-success-1)
- [Failures](#failures-1)
- [Bad Data](#bad-data)
- [OTLP/HTTP Throttling](#otlphttp-throttling)
- [All Other Responses](#all-other-responses)
Expand All @@ -35,7 +39,7 @@ nodes such as collectors and telemetry backends.
- [Known Limitations](#known-limitations)
* [Request Acknowledgements](#request-acknowledgements)
+ [Duplicate Data](#duplicate-data)
* [Partial Success](#partial-success)
+ [Partial Success Retry](#partial-success-retry)
- [Future Versions and Interoperability](#future-versions-and-interoperability)
- [Glossary](#glossary)
- [References](#references)
Expand Down Expand Up @@ -145,16 +149,55 @@ was not delivered.

#### OTLP/gRPC Response

The server may respond with either a success or an error to the requests.
The response MUST be the appropriate serialized Protobuf message (see below for
the specific message to use in the [Success](#success),
[Partial Success](#partial-success) and [Failure](#failures) cases).

##### Success

The success response indicates telemetry data is successfully accepted by the
server.

The success response indicates telemetry data is successfully processed by the
server. If the server receives an empty request (a request that does not carry
If the server receives an empty request (a request that does not carry
any telemetry data) the server SHOULD respond with success.

Success response is returned via
[Export*ServiceResponse](https://github.com/open-telemetry/opentelemetry-proto)
message (`ExportTraceServiceResponse` for traces, `ExportMetricsServiceResponse`
for metrics, `ExportLogsServiceResponse` for logs).
On success, the server response MUST be a Protobuf-encoded
[Export<signal>ServiceResponse](https://github.com/open-telemetry/opentelemetry-proto)
message (`ExportTraceServiceResponse` for traces,
`ExportMetricsServiceResponse` for metrics and
`ExportLogsServiceResponse` for logs).

The server MUST leave the `partial_success` field unset
in case of a successful response.

##### Partial Success

If the request is only partially accepted
(i.e. when the server accepts only parts of the data and rejects the rest), the
server response MUST be a Protobuf-encoded
[Export<signal>ServiceResponse](https://github.com/open-telemetry/opentelemetry-proto)
message (`ExportTraceServiceResponse` for traces,
`ExportMetricsServiceResponse` for metrics and
`ExportLogsServiceResponse` for logs).

Additionally, the server MUST initialize the `partial_success` field
(`ExportTracePartialSuccess` message for traces,
`ExportMetricsPartialSuccess` message for metrics and
`ExportLogsPartialSuccess` message for logs), and it MUST set the respective
`accepted_spans`, `accepted_data_points` or `accepted_log_records` field with
the number of spans/data points/log records it accepted. In case the server
rejected everything, the `accepted_<signal>` field MUST be set to `0`.

The server SHOULD populate the `error_message` field with a human-readable
error message in English. The message should explain why the
server rejected parts of the data, and might offer guidance on how users
can address the issues.
The protocol does not attempt to define the structure of the error message.

The client MUST NOT retry the request when it receives a partial success
response where the `partial_success` is populated.

##### Failures

When an error is returned by the server it falls into 2 broad categories:
retryable and not-retryable:
Expand Down Expand Up @@ -382,8 +425,9 @@ numbers or strings are accepted when decoding.

#### OTLP/HTTP Response

Response body MUST be the appropriate serialized Protobuf message (see below for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we repeating such a large body of text that is mostly identical to OTLP/gRPC? This constitutes the semantic definition of the protocol, which is independent of the encoding / transport used.

Copy link
Member Author

@joaopgrassi joaopgrassi Jul 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the same 😕. I had an idea about putting the "common" things in a separated section but I was concerned the PR would touch too many things and it would be harder to review.

Copy link
Member Author

@joaopgrassi joaopgrassi Jul 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few nuances, for ex, for HTTP, we indicate which Content-Type should be used for the response, compression, which status code to return. Such things don't apply to gRPC.. so I was not sure what makes sense. Having a combined section and doing "If HTTP, return xyz header" doesn't sound good. Another point is if someone is implementing a gRPC server, they don't have to read the HTTP stuff and vice-versa.

I'm open to improve it, but I'm not sure to what extent. It could also be a follow up to make the review easier? Open to suggestions :)

the specific message to use in the Success and Failure cases).
The response body MUST be the appropriate serialized Protobuf message (see below for
the specific message to use in the [Success](#success-1),
[Partial Success](#partial-success-1) and [Failure](#failures-1) cases).

The server MUST set "Content-Type: application/x-protobuf" header if the
response body is binary-encoded Protobuf payload. The server MUST set
Expand All @@ -397,13 +441,47 @@ header.

##### Success

On success the server MUST respond with `HTTP 200 OK`. Response body MUST be
Protobuf-encoded `ExportTraceServiceResponse` message for traces,
`ExportMetricsServiceResponse` message for metrics and
`ExportLogsServiceResponse` message for logs.
The success response indicates telemetry data is successfully accepted by the
server.

If the server receives an empty request (a request that does not carry
any telemetry data) the server SHOULD respond with success.

The server SHOULD respond with success no sooner than after successfully
decoding and validating the request.
On success, the server MUST respond with `HTTP 200 OK`. The response body MUST be
a Protobuf-encoded [Export<signal>ServiceResponse](https://github.com/open-telemetry/opentelemetry-proto)
message (`ExportTraceServiceResponse` for traces,
`ExportMetricsServiceResponse` for metrics and
`ExportLogsServiceResponse` for logs).

The server MUST leave the `partial_success` field unset
in case of a successful response.

##### Partial Success

If the request is only partially accepted
(i.e. when the server accepts only parts of the data and rejects the rest), the
server MUST respond with `HTTP 200 OK`. The response body MUST be
a Protobuf-encoded [Export<signal>ServiceResponse](https://github.com/open-telemetry/opentelemetry-proto)
message (`ExportTraceServiceResponse` for traces,
`ExportMetricsServiceResponse` for metrics and
`ExportLogsServiceResponse` for logs).

Additionally, the server MUST initialize the `partial_success` field
(`ExportTracePartialSuccess` message for traces,
`ExportMetricsPartialSuccess` message for metrics and
`ExportLogsPartialSuccess` message for logs), and it MUST set the respective
`accepted_spans`, `accepted_data_points` or `accepted_log_records` field with
the number of spans/data points/log records it accepted. In case the server
rejected everything, the `accepted_<signal>` field MUST be set to `0`.

The server MAY populate the `error_message` field with a human-readable
error message in English. The message should explain why the
server rejected parts of the data, and might offer guidance on how users
can address the issues.
The protocol does not attempt to define the structure of the error message.

The client MUST NOT retry the request when it receives a partial success
response where the `partial_success` is populated.

##### Failures

Expand Down Expand Up @@ -520,11 +598,21 @@ received yet. The client will typically choose to re-send such data to guarantee
delivery, which may result in duplicate data on the server side. This is a
deliberate choice and is considered to be the right tradeoff for telemetry data.

### Partial Success
#### Partial Success Retry

Each server has its particularities and its way of treating data. There
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be a bit verbose. I think we should specify more explicitly here:

  1. Clients, by default, will not attempt automatic retry.
  2. Servers should leverage partial success when they fully understand that resending the exact same bundle of telemetry would lead to another error (preventing retry loops).

The rest of this commentary is nice for WHY, but the WHAT should be called out explicitly.

can be many reasons why a given server only accepted parts of an
OTLP request (e.g. quota exceeded,
data not conforming with the server's standards, etc).

The protocol offers a basic way of communicating partial reception success
from the server to the client. Such partial success information contains
how many spans/data points/log records were accepted and a general error
message. With only such information it is not feasible to achieve any type
of automatic retry by clients.

The protocol does not attempt to communicate partial reception success from the
server to the client (i.e. when part of the data can be received by the server
and part of it cannot). Attempting to do so would complicate the protocol and
The protocol does not give guidance on how such partial success requests can be
retried by clients. Attempting to do so would complicate the protocol and
implementations significantly and is left out as a possible future area of work.

## Future Versions and Interoperability
Expand Down