-
Notifications
You must be signed in to change notification settings - Fork 895
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify that API users SHOULD use valid UTF-8 for exportable data #3421
Comments
Strings are UTF8 in Protobuf:
|
So, bypassing SDK checks is not enough. We will be non-compliant with Protobuf spec and it may result in recipients failing to decode the data. Given that this is part of Stable spec I don't think we can change string to bytes anymore. |
Could we agree on a transformation to turn invalid UTF-8 into valid UTF-8? |
Yes, I think these are valid options to consider for the SDK functionality. |
There is an issue for this elsewhere already in the spec I believe. For this case in the Go issue I argued we should be limiting strings not by size but by grapheme so that they don't get cut off into invalid utf8 (or change the meaning of the utf8 string). |
Forgot, actually in the Erlang SDK we don't truncate into invalid UTF-8 because we treat the length limit as the # of graphemes https://github.com/open-telemetry/opentelemetry-erlang/blob/main/apps/opentelemetry/src/otel_attributes.erl#L105 The issue is the spec says "character" and not "graphemes" https://github.com/open-telemetry/opentelemetry-specification/blob/ca593258953811eb33197976b8e565299b874853/specification/common/README.md#attribute-limits |
I believe there's a question of what we actually want to have, here. The term Since protobuf (w/ v3 syntax) is very strict about UTF-8 and we intend for OTLP/proto to be the standard, we are quite constrained. The Golang protobuf implementation rejects such data in both
I am opposed to both of these options. The validate-and-force-correction path is simply expensive, especially when the SDK actually repairs invalid data. The expect-errors path is much worse, because generally the entire batch of data is rejected with little clue as to what the actual error was (note the error message in open-telemetry/opentelemetry-go#3021 does not include detail). Therefore, what I actually want is for SDKs to allow invalid UTF-8. While users are instructed they SHOULD not use invalid UTF-8, SDKs will be specified such that they SHOULD not themselves consider invalid UTF-8 to be an error. As mentioned in my earlier comment, I believe this can be accomplished by maintaining two variant How this solves the problem:
Consumers then can accept data with invalid UTF-8, which I think is ultimately what we all want. I find myself wanting to extend the library guidelines on performance to address this situation. Where we currently write:
I want to add a statement like so:
In addition, I would then add that SDKs SHOULD treat attribute "character limits" assuming though the character set is US-ASCII so that (as a performance concern) truncation logic need not parse UTF-8 code points or identify grapheme clusters. |
Isn't every recipient a processing pipeline from this perspective? This would mean a breaking change for every backend that supports receiving OTLP today. I want to suggest an alternate statement:
So, dear user, you have been warned. Don't do it, if you do and your backend crashes, that's between you and your vendor. This statement allows you to make choices in the SDKs. If you want to pay the cost and validate it you can do it. If you don't care and want to pass the problem to the recipients you can also do it. |
"SHOULD" is a good compromise (I'd be on the side of requiring valid utf-8) but I'd argue that we should also state for the users that valid utf8 will not be broken. |
We already specify that any non-valid UTF-8 string MUST be encoded as a bytes_value. I'm against the chaos that would ensue with forcing every string to be raw bytes with no notion of encoding. I think the current specification around bytes_value (with no encoding flag) is already a bit rough.
|
We now have open issues in the spec to handle this in the API/SDK. Can we close this issue? |
Closing. I assume nothing is needed in this repo. |
Please link any spec issues, for the readers following this issue? |
Re-opening, can't find the issue I was referring to. :-( |
Moving this to spec repo since there is nothing to be done on the proto. |
I will elevate this topic again in the next Specification SIG meeting. Recently I observed the following interaction. A Rust SDK emitted a string-valued attribute containing invalid utf8. This passed to an OTel collector, whose OTLP receiver parsed it as valid OTLP data. Note that gRPC receivers do not validate utf8, they assume the client does this. The OTel collector passed this data to a vendor-internal exporter, which is based on gRPC. That exporter tried to send the same data further into the backend, but it failed because of the invalid utf8. It could be worse -- if the collector has a batching step, and the invalid utf8 data was combined with other valid data, the whole request will fail. It has become very easy to lose telemetry because some participants require valid utf8 while others do not. What is the ideal outcome? I think it would be unpopular if we added a requirement that SDKs MUST validate utf8, because it is an unnecessary expense. At the same time, valid utf8 is stated as a requirement and enforced by Google protobuf libraries. The compromise I want us to reach is where there is known utf8 enforcement, there MUST be an effort to prevent telemetry data from dropping.
|
I do see the connection with #3950, which appears equally unresolved. I am surprised to hear about vendors that accept OTLP data and are not prepared to accept byte slice data, which is already a part of the OTLP data model. I will move forward my proposal above, which would resolves #3950 by requiring strings to be valid utf8 AND requiring SDKs to support byte slice data AND requiring SDKs and collectors to enforce validity where there is known enforcement. |
To help focus this conversation, the specification does have language to avoid API users entering raw byte -value attributes. I believe that (a) we should require strings to be valid utf8, (b) we should add "byte slice" as a primitive value, with no expectation for utf8 validity. We may use either this issue or #3950 to discuss. |
Just a notice, I think this is only essential for OTLP and not for the API. For instance, in Java and .NET opentelemetry-specification/specification/common/attribute-type-mapping.md Lines 104 to 113 in 7145a5d
I think this should be rephrased to something like
|
@pellared The document you is restricted to attribute values, and my issue here is about all the other strings too. Sure, other language runtimes have an expectation of UTF-16 -- and what should a .NET or Java library do if by some unsafe mechanism they realize invalid UTF-16 string? OpenTelemetry's error-handling guidelines say they can't produce an error. I don't know how easy it is to produce invalid UTF-16 at runtime in Java or .NET, but it's fairly easy in Go and Rust (using |
In open-telemetry/opentelemetry-go#3021, a user reports having spans fail to export because of invalid UTF-8.
I believe the SDK should pass this data through to the receiver, regardless of UTF-8 correctness. I am not sure whether gRPC and the Google protobuf library used by each SDK offers an option to bypass UTF-8 checking, but if so that would be one path forward if we agree.
Another option is to maintain (via
sed
, e.g.,) copies of the protocol that replacestring
withbytes
; SDKs could use these definitions instead to bypass UTF-8 checking as well.The text was updated successfully, but these errors were encountered: