-
Notifications
You must be signed in to change notification settings - Fork 224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Differing Interpretations of Protobuf Spec in Java and Go SDKs #759
Comments
Nice issue. @n3wscott @lance @JemDay any comments?
Should JSON really be just "text" in protobuf if it's going to treat it as a JSON string on the other side? If it were treated as binary would that avoid this issue? |
I'm curious is this a protobuf-format encoding issue or a json-format encoding issue - in the example above from the Go SDK is that the general behavior when the data is JSON (which I believe is in conflict with the JSON format specification). What this does call-out is the need for defined inter-op or compliance tests for the various format implementations to ensure they do play well with one another. |
An example unit-test input for a Proto formatted event with a JSON data payload for the java-sdk is here, it's a little simplistic and could potentially warrant a complex example. Note: The unit-test files use the JSON encoding for a protobuf object NOT the CE JSON format encoding. |
This issue is specific to the protobuf-deserializer in Go. It has to do with the difference in behavior that exists when the data is contained in the // NOTE: There are some issues around missing data content type values that
// are still unresolved. It is an optional field and if unset then it is
// implied that the encoding used for the envelope was also used for the
// data. However, there is no mapping that exists between data content types
// and the envelope content types. For example, how would this system know
// that receiving an envelope in application/cloudevents+protobuf know that
// the implied data content type if missing is application/protobuf.
//
// It is also not clear what should happen if the data content type is unset
// but it is known that the data content type is _not_ the same as the
// envelope. For example, a JSON encoded data value would be stored within
// the BinaryData attribute of the protobuf formatted envelope. Protobuf
// data values, however, are _always_ stored as a protobuf encoded Any type
// within the ProtoData field. Any use of the BinaryData or TextData fields
// means the value is _not_ protobuf. If content type is not set then have
// no way of knowing what the data encoding actually is. Currently, this
// code does not address this and only loads explicitly set data content
// type values.
contentType := ""
if container.Attributes != nil {
attr := container.Attributes[datacontenttype]
if attr != nil {
if stattr, ok := attr.Attr.(*pb.CloudEventAttributeValue_CeString); ok {
contentType = stattr.CeString
}
}
}
switch dt := container.Data.(type) {
case *pb.CloudEvent_BinaryData:
e.DataEncoded = dt.BinaryData
// NOTE: If we use SetData then the current implementation always sets
// the Base64 bit to true. Direct assignment appears to be the only way
// to set non-base64 encoded binary data.
// if err := e.SetData(contentType, dt.BinaryData); err != nil {
// return nil, fmt.Errorf("failed to convert binary type (%s) data: %s", contentType, err)
// }
case *pb.CloudEvent_TextData:
if err := e.SetData(contentType, dt.TextData); err != nil {
return nil, fmt.Errorf("failed to convert text type (%s) data: %s", contentType, err)
} The package main
import (
"fmt"
"github.com/cloudevents/sdk-go/v2/event"
)
type myObject struct {
Hello string `json:"hello"`
}
func main() {
e := event.New()
e.SetSpecVersion("1.0")
e.SetType("com.type")
e.SetSource("/")
e.SetID("123")
e.SetData("application/json", &myObject{Hello: "world"})
fmt.Printf("%v\n", e)
} This yields the output:
However if you attempt to serialize a string, then it is interpreted as a JSON-String, which is what causes the behavior we are seeing here: package main
import (
"fmt"
"github.com/cloudevents/sdk-go/v2/event"
)
func main() {
e := event.New()
e.SetSpecVersion("1.0")
e.SetType("com.type")
e.SetSource("/")
e.SetID("123")
e.SetData("application/json", "{\"hello\":\"world\"}")
fmt.Printf("%v\n", e)
} This yields the output:
Here is where the differing interpretations I mentioned come into play. Any container.Data = &pb.CloudEvent_BinaryData{
BinaryData: e.Data(),
}
if e.DataContentType() == ContentTypeProtobuf {
anymsg := &anypb.Any{
TypeUrl: e.DataSchema(),
Value: e.Data(),
}
container.Data = &pb.CloudEvent_ProtoData{
ProtoData: anymsg,
} This means that the Go protobuf serializer/deserializer will always round-trip successfully. However, the Java-SDK seems to have interpreted the spec to mean that there should be a one-to-one mapping between // If it's a proto message we can handle that directly.
if (data instanceof ProtoCloudEventData) {
final ProtoCloudEventData protoData = (ProtoCloudEventData) data;
if (protoData.getMessage() != null) {
protoBuilder.setProtoData(Any.pack(protoData.getMessage()));
}
} else {
if (Objects.equals(dataContentType, PROTO_DATA_CONTENT_TYPE)) {
// This will throw if the data provided is not an Any. The protobuf CloudEvent spec requires proto data to be stored as
// an Any. I would be amenable to allowing people to also pass raw unwrapped protobufs, but it complicates the logic here.
// Perhpas that can be a follow up if there is a need.
Any dataAsAny = null;
try {
dataAsAny = Any.parseFrom(data.toBytes());
} catch (InvalidProtocolBufferException e) {
throw CloudEventRWException.newDataConversion(e, "byte[]", "com.google.protobuf.Any");
}
protoBuilder.setProtoData(dataAsAny);
} else if (isTextType(dataContentType)) {
protoBuilder.setTextDataBytes(ByteString.copyFrom(data.toBytes()));
} else {
ByteString byteString = ByteString.copyFrom(data.toBytes());
protoBuilder.setBinaryData(byteString);
}
} In the Java code, when converting into protobuf the decision of whether to use // The proto spec says all text data should go into the text field. It is really difficult to figure out every case
// of text data based on the media type though, so I am just going to check for some common cases.
private static boolean isTextType(String type) {
if (type == null) {
return false;
}
return type.startsWith("text/") || "application/json".equals(type) || "application/xml".equals(type);
} So this only pops up when sending protobuf CloudEvents between services written in different languages. You can tell from the comments though that the spec was interpreted differently by the authors of each. In the Go code we have:
Where as the Java code explicitly use the |
This does deserve some conversation as this is obviously not a good situation. I'm not a Go expert so I can't see where the specific "encoding format" is being applied - I'm curious what happens when you use a JSON format in those examples as they should produce the same output .. the JSON format encoding requires that the datacontentype be checked to see how the data is represented on the wire. IN essence the only real difference here is that the JSON format requires the SDK to look for JSON data and represent it a set way, and the Proto spec extends that test to cover all textual media types - which granted is a bit of a wider net to cast but base don the same basic set of checks. What you have shone a light on here is that the Java code has a short-coming as it's not checking for datacontenttype I feel an issue coming on... |
I don't know that that is explicitly needed. |
Thanks @zach-robinson for flagging this - I feel this is a subject for the working group to discuss and clarify behavior as appropriate. Putting textual data into a binary construct raises the concern around how character-encoding schemes should be communicated. My personal view is that the Go-SDK isn't adhering to the spirit of the protobuf format specification:
|
I agree with this. |
What does |
ok, I'll bite - from either http://json.org or the ECMA standard you'll find this type of language "JSON is a lightweight, text-based, language-independent syntax for defining data interchange formats" And I'd say text is "a sequence of characters". If you put a "sequence of characters" into a "binary property" then you'd also need a mechanism to indicate the character encoding schema that was used so that it is correctly interpreted - thus the whole preamble construct for XML documents. Granted JSON documents are always meant to be exchanged in UTF-8 so given a content-type that indicates JSON you should be able to reliably process it. The protobuf string scalar type is UTF-8 encoded so there's no danger of mismatch when the string data is demarshalled. |
And in closing ... I 'may' be incorrect with my encoding comment for JSON, the standard does allow for other unicode encoding schemes which then demands introspection : See Section |
True, JSON is text, but I think this line from the intro comment is interesting:
If text_data is |
Personally I'd say that if a JSON string with content The C# SDK has different ways of handling data in general - the
This doesn't explicitly document what is meant by "binary" and "text" in the first paragraph, but the implementation is currently "byte arrays count as binary; strings count as text". This means if you want to put a JSON object (or JSON string) into the event, you (the caller) have to set the Structured mode event handling does not use the I don't think this issue needs to delay the protobuf event format 1.0 ratification, but we may want to add clarifying guidelines for SDKs. (We may want to think about the bigger issue of just how we specify SDK conformance in general, beyond this specific issue.) |
@jskeet - I was with you until the words ".. including the double quotes.." My argument is that there are no quotes around a JSON value unless that value is a
And not
Essentially you MUST be able to take the content of |
@JemDay: But that "including the double quotes" was in the context of a JSON string value (emphasis mine):
An empty JSON object should be stored without the double quotes, yes. That is my answer to @duglin's question here:
In my view, if the text_data is I suspect we're in violent agreement, and the lack of clarity is due to ambiguity between "a string that contains JSON" and "a JSON string value". In C#, to demonstrate the difference: string json = "{}"; // This is just a string which could be parsed as a JSON object
JValue value = new JValue("{}"); // This is a JSON string value that happens to contain curly braces Are we now actually in agreement? |
@jskeet - I believe we are sir :-) |
Move to the sdk-go repo per the 2/17/2022 call. |
Any update on this issue? |
Apart from the confusion around a json string, I think this thread came to the wrong conclusion. I think the go sdk is doing this correctly. in the original question, it has this example: {
"specversion" : "1.0",
"type" : "com.example.someevent",
"source" : "/mycontext",
"subject": null,
"id" : "C234-1234-1234",
"time" : "2018-04-05T17:31:00Z",
"comexampleextension1" : "value",
"comexampleothervalue" : 5,
"datacontenttype" : "application/json",
"data" : {
"appinfoA" : "abc",
"appinfoB" : 123,
"appinfoC" : true
}
} Which claims the java impl puts the json output into the proto data string. But I think this is incorrect, the output of marshaling a json object is not a string, it is a byte array. Casting to a string changes the value and the parsers on the other side can't make an assumption about that cast to be able to undo the cast. It makes the following non-deterministic: "data" : {
"appinfoA" : "abc",
"appinfoB" : 123,
"appinfoC" : true
} vs "data" : "{ \"appinfoA\" : \"abc\", \"appinfoB\" : 123, \"appinfoC\" : true }" The proto TextData should be the string This is exactly the same problem with parsing a cloudevent in structured mode. The spec says you need send json in json format and not a string. {
"specversion" : "1.0",
"type" : "com.example.someevent",
"source" : "/mycontext",
"id" : "C234-1234-1234",
"data" : {
"this" : "is sent as binary and is a json object"
}
} {
"specversion" : "1.0",
"type" : "com.example.someevent",
"source" : "/mycontext",
"id" : "C234-1234-1234",
"data" : "{\"this\" : \"is a json string and not an object\"}"
} So, I think the go sdk is working as intended. The issue lies in the java sdk's misinterpretation of the output of the json marshaler. |
Has somebody tried verifying the behavior using the test files from the Java SDK.... you should be able to create a protobuf object from the proto-json (and that's a proto encoding NOT a CloudEvent encoding) representation and then demarshall it using the GO SDK. The various proto to wire-level representations are here : https://github.com/cloudevents/sdk-java/tree/master/formats/protobuf/src/test/resources/v1 The JSON data payload is very simple .. it just caries an empty JSON object. Ideally we should align on the wire-representation(s) and then ensure the SDKs handle them coherently. I can't make the call tomorrow (7/20) unfortunately, and apologies for the multiple edits on this - I blame autocorrect. |
@n3wscott wrote:
Scott - I'm wondering if comparing this to how things work for HTTP would help find the right answer. In HTTP, how do the SDKs deal with an HTTP body of:
if a) content-type==app/json Do both result in the same thing (json object vs just string) ?
certainly true for golang, but I'm not sure this is a true statement in general since I think most people would say that JSON is a string serialization of data, not a binary one. |
it would be an empty json object.
it would be the string "{}" I should add some calcification, the This is not exactly the case for the proto, it is received as a string, and you will have to explicitly choose to cast that string to a byte array to be something the json unmarshaller can act on. This is not something we can do programmatically without some introspection and assumption of the intent of the inbound string. |
In that case, wouldn't:
result in a JSON Object not a string as you suggest? (I edited my previous comment) |
The simpler answer for your original question:
it is invalid and the parser will throw an error. A smart parser will read in the body as the content type (string) and then will not know how to parse as json (bytes) |
That's exactly the expected behavior if you were to switch the datacontenttype in the Protobuf examples. I'm a bit lost now as to whether we're talking purely about JSON over HTTP or JSON over a Protobuf encoded CE :-( |
But you're assuming a JSON parser only takes binary data. That's not true for all languages. I'm looking at some Java samples and they all seem to assume it's a string - which is a fair assumption given JSON is a stringified format. I guess it depends on whether the use of text_data and binary_data should override the datacontenttype attribute or not. I was assuming it wouldn't. |
I am not so sure, because how do I send the string I think if the proto gives me a string, and the content type is app/json, I should see that as a json string, and not try to cast it back to a byte array and then parse it. As @JemDay is suggesting, if I get a proto text data as I am saying we should see the proto text data field as having explicit quotes around the body, and any json parsing will always result in the string of the text data contents. vs Assume if datacontenttype is app/json, assume the text data field is bytes without the outer quotes. I think this ends up in a broken world where if I do send a simple json string:
I could get either of these in the proto and they are equivalent: (after
|
based on previous comments, I think:
is the point of disagreement. When a string is serialized in json, (per json.org) it starts with quotes so I'm not sure why we would remove the quotes when we put it into proto - it's part of the business data/payload. W/o it you can't tell if
absence or presence of quotes around the
is that a string or a number? If we assume "missing quotes" it's a string. But then how can I represent a number? Am I forced to use binary_data ? Feels wrong. |
I would say it is the string "1.5" because it is formatted as a string. If you want that to be the number |
I think the larger misunderstanding here is thinking proto.text_data is equivalent to the body of http, and it is not. We get access to the proto.text_data after the wire encoding has been interpreted. text_data is a string. Proto has an extra step we should respect in the type system. |
proto.text_data is not meant to mean anything other than "it's a character sequence" so if that's needed to carry an empty JSON Object it would be two characters long So .. From an SDK perspective the Protobuf (and XML!!) sdks are able to say to their parent applications the 'data' of the CloudEvent was a character sequence and here it is ... if desired they can go one step further and say "based on the When I did the Protobuf spec and eventually the XML spec is was heavily influenced by the way the JSON spec had evolved to enable 3 types of data, binary, character-sequence, and JSON and ended up replicating that pattern, so in both of them you see binary-data, character-sequence-data, and format-native-data. To Scott's point there is no association or relationship to an HTTP body (or any other transport for that matter) - we should also think in terms of the CE Information Model and consider how that model maps into a 'format' and that is completely independent of AMQP/HTTP/KAFKA etc.. |
And to continue that point ... if the JSON Value is a string containing the word hello then the proto.text_data would be 7 characters long - |
That behavior I'd want to verify to ensure the Java SDK works like that, I don't believe I had a test-case for that. |
On the use of the
text_data
andbinary_data
fields in the protobuf-format spec, it says:The Java SDK's serializer always uses the
text_data
field when thedatacontenttype
isapplication/json
,application/xml
ortext/*
. Otherwise thebinary_data
field is used. However, in the Go SDK's deserializerbinary_data
is treated literally, whiletext_data
gets serialized using the appropriate serializer for thedatacontenttype
. When a CloudEvent with adatacontenttype
ofapplication/json
is serialized into protobuf using the Java SDK and then sent to a service using the Go SDK and deserialized there, this results in the JSON in thetext_data
field being interpreted as a JSON-String when it should be treated as a JSON-Object. So if I serialize this CloudEvent into protobuf using the Java SDK:and then send the resulting protobuf object to a service using the Go SDK and deserialize it there back into JSON format, the result is:
This seems to be unintended behavior, but I as far as I can tell, neither the Java nor the Go SDK implementations of the Protobuf format are technically wrong according to the way the spec is written, so I decided to raise this issue here to get clarification. Is this a bug in one or both of these libraries? Should the Protobuf format spec be more specific in how the
binary_data
andtext_data
fields should be used? Or is this behavior somehow intended?The text was updated successfully, but these errors were encountered: