Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CBOR byte string support #1129

Closed
daviddesberg opened this issue Jun 12, 2018 · 54 comments
Closed

CBOR byte string support #1129

daviddesberg opened this issue Jun 12, 2018 · 54 comments
Assignees
Labels
aspect: binary formats BSON, CBOR, MessagePack, UBJSON solution: proposed fix a fix for the issue has been proposed and waits for confirmation
Milestone

Comments

@daviddesberg
Copy link

Currently, the CBOR feature of the library only supports writing strings out as UTF-8 and doesn't support byte arrays as described in the CBOR standard. I'd like to implement COSE on top of this library, but in order to do that, byte string support is needed.

I'm happy to submit a pull request for this, but I wanted to check in advance if this is something you'd be open to

@nlohmann nlohmann added the aspect: binary formats BSON, CBOR, MessagePack, UBJSON label Jun 14, 2018
@nlohmann
Copy link
Owner

How would you store the data coming from CBOR?

@nlohmann
Copy link
Owner

Related: #757, #587.

@nlohmann nlohmann added the state: needs more info the author of the issue needs to provide more details label Jun 18, 2018
@daviddesberg
Copy link
Author

my thought was creation of some byte string type, leaving existing treatment of strings alone. i'll have time this week to show in more detail

@nlohmann
Copy link
Owner

So you mean the byte string's bytes should just be parsed and stored in a JSON string? What about encodings? Please also have a look at the issues I referenced above.

@nyovaya
Copy link

nyovaya commented Jul 4, 2018

@nlohmann How about adding the suffix _BIN or _BINARY to the key string and the string then contains the byte string as a base64 converted string?

@nlohmann
Copy link
Owner

nlohmann commented Jul 4, 2018

What about an array of binary strings? What about an array of values of which some are binary strings? How would roundtripping work?

@nlohmann
Copy link
Owner

nlohmann commented Jul 9, 2018

Any news on this?

@nlohmann
Copy link
Owner

@lusitanian @nifker Any input on this?

@nyovaya
Copy link

nyovaya commented Jul 19, 2018

My approach would be still using a prefix/suffix for the key and for the data itself base32768 or base64 - but I dont what performance base32768 has or usable it is.
An example would look like this:

BINARY_binarystring: "媒腻㐤┖ꈳ埳"

binarystring_BINARY: "XmFFzQtc"

The prefix/suffix is trimmed from the key string and the resulting string is used as key in CBOR.
The value is just decoded from base64/base32768 back into a binary string and will be the value in CBOR.

@nlohmann
Copy link
Owner

I do not like this approach, as it relies on the convention to use and respect such prefixes/suffixes.

@nyovaya
Copy link

nyovaya commented Jul 20, 2018

Couldn't imagine how else you want to detect that.

@nlohmann
Copy link
Owner

There are two aspects:

  1. Parsing CBOR to JSON (deserialization). There is currently no support for binary strings. It could be added, e.g. by either implementing encoders for base64 and storing the result in a string or by extending the SAX interface to pass std::vector<uint8_t> to the caller so the user can decide how to interpret the received bytes. The first approach has the benefit of being close to CBOR. The second approach has the benefit of being very generic and shifting the decoding work to the user, allowing the library to stay slim in this point.
  2. Creating CBOR from JSON (serialization). We follow the JSON types right now and map all strings to CBOR's UTF-8 strings. In order to tell the serializer to encode specific strings differently, one could think of an additional parameter to the serializer that contains a mapping from JSON Pointers to desired CBOR types; that is, allowing to explicitly override the default types where desired. This would be a lot of work, but doable in principle.

It would be nice to hear more alternatives to these two aspects. But since the library needs to be maintained and support is very time-consuming, I am against an approach that relies on non-standard assumptions or conventions that need documentation and may often lead to surprises.

I'm setting this issue to discussion, and I hope to see some ideas. I am currently not planning to work on this topic, so pull requests are, as always, more than welcome! As an additional challenge, we should not forget the library also implements MessagePack and UBJSON, so ideas that also have those binary formats in mind are appreciated.

@nlohmann nlohmann added state: please discuss please discuss the issue or vote for your favorite option and removed state: needs more info the author of the issue needs to provide more details labels Jul 21, 2018
@nlohmann
Copy link
Owner

nlohmann commented Aug 4, 2018

Any opinion on this?

@alexkarpenko
Copy link

alexkarpenko commented Aug 26, 2018

How about adding a data type for binary blobs?

std::vector<uint8_t> myData{0,1,2,3,4,5,6};
json j = {
  {"binary", json::bin(myData.begin(), myData.end())}
};

auto v_msgpack = json::to_msgpack(j);
json j2 = json::from_msgpack(v_msgpack);

assert(j2["binary"].is_binary() == true);
auto dat = j2["binary"].get<std::string>(); // OK. Returns a string that holds the bytes (json::bin can just be a tagged struct with a std::string member)

auto str = j.dump(4); // base64 encodes binary fields
auto j3 = json::parse(str);
assert(j3["binary"].is_binary() == false); // binary data serialized to json will not survive round trip. instead it's now a base64 encoded string

@nlohmann
Copy link
Owner

Hm. Nice idea. Any opinions on this?

@nlohmann nlohmann mentioned this issue Sep 18, 2018
@stale
Copy link

stale bot commented Oct 17, 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the state: stale the issue has not been updated in a while and will be closed automatically soon unless it is updated label Oct 17, 2018
@stale stale bot closed this as completed Oct 24, 2018
@arrtchiu
Copy link

As someone who wants to use MessagePack with binary blobs, I quite like @alexkarpenko's suggestion of a separate type. It's nice and explicit and doesn't rely on developers sticking to conventions.

One point I'm not completely sold on is automatically converting it to base64 string when serialising to JSON. I personally consider adding binary to an object and then trying to serialise it into JSON a developer error that should probably be an exception (if you want to use binary you should serialise to a format that supports it), but I don't hold that opinion too strongly either.

Another potential reason against base64-encoding is.. well.. does this library already have a base64 encoder? We probably don't want to add one just for that! :)

I don't have the time or expertise to complete this myself but I'm open to the possibility of contributing financially to see this done.

@alexkarpenko
Copy link

I don't feel strongly about what to do when serializing binary blobs to JSON. Throwing an exception seems reasonable to me.

@mdenna-nviso
Copy link

We also would really like to have CBOR byte string support to be able to import large binary BLOBs without additional overhead.
The suggestion proposed by @alexkarpenko's seems simple and clean and would be ideal for me.
Regarding the export to JSON, an exception is the best way to go in my opinion. Binary blobs are not a native JSON type anyway, and this would avoid complicate workarounds...

@ayounes-nviso
Copy link

Is it possible to reopen this ticket please?

@nlohmann nlohmann reopened this Mar 11, 2019
@stale stale bot removed the state: stale the issue has not been updated in a while and will be closed automatically soon unless it is updated label Mar 11, 2019
@stale
Copy link

stale bot commented Jun 3, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the state: stale the issue has not been updated in a while and will be closed automatically soon unless it is updated label Jun 3, 2019
@mdenna-nviso
Copy link

Hi, any update on this topic? @OmnipotentEntity, how is your implementation going? I'm afraid I don't have time at the moment to understand the internals of nlohmann/json, I could maybe give a hand if there are some very specific parts to implement..

@stale stale bot removed the state: stale the issue has not been updated in a while and will be closed automatically soon unless it is updated label Jun 5, 2019
@OmnipotentEntity
Copy link
Contributor

Sorry, most of the last few weeks have been eaten by moving house. Once I get my computer set up again, I can upload what I have to my GitHub, if you want to take a crack at it. I'm still intending on finishing this, I'm just slow due to real life reasons.

@mdenna-nviso
Copy link

Great, thanks for feedback! No hurry on my side, when you have something I will be pleased to be a beta-tester.. :)

@OmnipotentEntity
Copy link
Contributor

Hi, I'm back, I finally got my office set up. Please be patient with me while I begin a new job.

I had a question about design. How do you want to handle serialization of a binary value to text JSON? There's three possibilities to my mind:

  1. Refuse to serialize
  2. Serialize to array of numbers or base64 string or something.
  3. Extend JSON with a binary type
  4. Do any combination of the above and allow the user to choose.

These have the following pros and cons to my mind:

Refuse to serialize. Pros: Logically makes sense. There is no valid JSON serialization that is reversible. Cons: Now we have a new throwing path that wasn't there. It may seem surprising that we can serialize to binary formats but not text. And this library was originally about serializing JSON.

Serialize to array of numbers. Pros: This is what some binary formats will require anyway. A good compromise. Cons: Not reversible.

Serialize to base64 string. Pros: Same as previous. And can devise a tag of some kind to automatically change format back to binary. Cons: Small chance that normal data matches the tagging system with surprising consequences.

Extend JSON. Pros: Reversible and unambiguous. Allowed to accept non-JSON spec conforming data per RFC. Cons: This extension would need to be essentially invented (as I cannot find any literature on this topic), and no one else will support it in the short term, and very few at best will support it in the medium to long term. Not allowed to generate non-JSON spec conforming data per RFC.

All of the above. Pros: Obviously if we do everything the user desires nothing can possibly be bad, right? Cons: Lots of work to implement. Need to decide default behavior. Documentation might become confusing, and raises the required knowledge for usage of the library.

I'm reaching the point where I'd have to make a decision on this. And while discussion on this topic has occurred in this thread. No real consensus was reached. So I'm looking for direction.

@znark
Copy link

znark commented Jun 28, 2019

I had a question about design. How do you want to handle serialization of a binary value to text JSON? There's three possibilities to my mind:

1. Refuse to serialize

Refuse to serialize (throw an exception?) by default, but add a hook where you can plug in your own, custom adapter which will produce JSON compatible datatypes out of the binary datatype.

Since not all binary data blobs are the same, it could be useful if the binary data container could, by itself, carry some meta information, such as a pointer to a specific callback function for converting that specific data blob to a JSON-compatible type, or a type hint as a string (relayed to the hook callback function as a parameter), or something to that effect.

Also, if the data is to be serialized as a named field in a JSON object, maybe the callback could be allowed to modify the name of the field in such cases. (Could be used e.g. for adding a suffix denoting the type as suggested by someone else above. Would allow for many different schemes.)

If you can think of a method of adding similar custom hooks for deserializing from plain JSON data to binary blobs (allowing round trips when serializing and deserializing binary data into a custom JSON-compatible representation), that would be most excellent as well, but maybe not as important at this stage.

@znark
Copy link

znark commented Jun 28, 2019

More food for thought: maybe you could use std::any for storing any kind of object in the in-memory JSON variant and have a protocol (define conversion functions etc.) for converting that to the generic bin datatype when applicable and a fallback for converting the type to standard JSON datatypes presentation if doing a plain JSON serialization.

@mdenna-nviso
Copy link

Hi OE, welcome back!
Hmmmm.... since the main focus of this library is JSON and binary data is not a native JSON data type, option 1 (exception) looks good enough IMHO at this stage.
Users wanting to serialize a json tree containing binary nodes can just replace them with whatever encoding they like before calling the serialization method.
Callback hooks are an elegant way to achieve the same automatically and without data duplication, but if this requires additional complexities in the library, maybe can be postponed until there is a real need.

@OmnipotentEntity
Copy link
Contributor

OmnipotentEntity commented Jun 28, 2019

@znark

"such as a pointer to a specific callback function for converting that specific data blob to a JSON-compatible type"

We absolutely cannot store a raw pointer in serialized data. The pointer to that function isn't guaranteed to be the same from invocation to invocation on the same computer, let alone different computers with different operating systems and different processor architectures. A function identifier might be used, but as this is a library, not an application, and this would need to be something provided by the application anyway, I fail to see what we would gain exactly.

Python's builtin json type does have this functionality (callback hooks); however, their json type is also read-only, and Python is itself a weakly typed language, so a polymorphic return type will not cause extra boilerplate.

Putting the ability to store arbitrary types into json in this library would be quite a bit of work and require answering a bunch of difficult questions:

  1. Should we allow only POD types? (Obviously not, but then...)
  2. What if the type is not copyable?
  3. What if the type is only moveable?
  4. How do we keep from having the unknown type screw with the noexcept status of many different functions?
  5. How do we report multiple different unknown types back to the user? The user will not always know a priori the proper type of the item, it could be one of several, and C++ has bad type introspection.

@mdenna-nviso my only real reservation about this approach is that it's a new throwing path. Meaning that current users of the library may run into an issue where code that had been working perfectly now throws. Currently the serializer only throws in two cases, both related to invalid UTF-8 strings, so it's safe to use without a catch if you either do not have strings, or you know the strings are valid UTF-8 (such as if you generated them yourself.) It's probably not a huge deal, but I didn't see you address it.

@nlohmann
Copy link
Owner

For parsing, the first step could be to extend the SAX interface with a binary function that could take a std::vector<std::uint8_t> or std::string or anything that can hold arbitrary binary data. Then the binary parsers (CBOR, MessagePack, ...) need to be adjusted to call this binary function with the binary data. With a default implementation of binary to throw a parse error, this would not break existing implementations.

A user who wants to use binary data would then need to implement the SAX interface including the binary function and decide what to do with the data. Maybe it would be helpful to also pass input_format_t so you'd know more about the origin of the binary data.

For writing binary data, the actual JSON value would need to be extended. It would never be filled by the standard JSON parser, and we may decide to throw if binary data is tried to be dumped. Once this exists, the binary writers (CBOR, MessagePack, ...) would need to write the respective encodings and we would be ready to roundtrip. We could even use the mechanism for arbitrary type conversion to let users define their own mappings from types to binary formats.

But extending the JSON value is a lot of work, because it is used everywhere. It's would not be impossible, just a long way...

What do you think?

@OmnipotentEntity
Copy link
Contributor

I was in the middle of extending the JSON value itself to have a binary type. I'd say I'm about 40-50% complete or so. I'd prefer to do it this way rather than use an intermediate representation, because typically an application would want to parse binary data into a format external to JSON. (My use case in particular are neural network weights, for instance.)

@nlohmann
Copy link
Owner

nlohmann commented Jul 1, 2019

Cool! How does the extension look like? Do you use a templated type?

@OmnipotentEntity
Copy link
Contributor

Yeah, default type is std::vector<std::uint8_t>.

@jpetkau
Copy link

jpetkau commented Jul 30, 2019

Throwing when serializing binary to json seems like a painful choice; it means losing the ability to easily switch formats, if you want to write binary data in formats that support it.

As an alternate suggestion, how about picking a serialization format (e.g. base64) for binary to json conversion, doing nothing special on load (so it doesn't roundtrip), but making it easy to parse base64 to binary in user-defined serializers?

More specifically,

  • add value_t::binary and a binary variant of json_value
  • Add a binary variant to sax_parse().
  • nlohmann::json to/from msgpack or CBOR roundtrips as binary
  • json::parse(string) always produces string, never binary

So far I think this is the same as agreed above and @OmnipotentEntity is implementing. Then:

  • dump() encodes binary as base64 encoded string
  • For API convenience: add a json::bin type (as suggested by @alexkarpenko above), which could just be std::vector<uint8_t> or some wrapper struct, but specifically not std::string.
  • get<json::bin>() returns a binary value as json::bin
  • get<json::bin>() parses string values as base64. (E.g. ("\"aGVsbG8=\""_json).get<json::bin>() == json::bin("hello", 5)

So nlohmann::json objects wouldn't roundtrip through json text (but that's expected since json doesn't support binary). They would roundtrip through CBOR and msgpack, and user-defined types could easily be made to roundtrip through all formats, and there would be no heuristics anywhere; the only conversion would be base64-encoding binary for json, and that would only happen when explicitly requested from C++ code.

@stale
Copy link

stale bot commented Aug 30, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the state: stale the issue has not been updated in a while and will be closed automatically soon unless it is updated label Aug 30, 2019
@stale stale bot closed this as completed Sep 6, 2019
@sijohans
Copy link

sijohans commented Oct 10, 2019

This would be a nice an useful feature. This library can be used for creating tests towards a c target that parses CBOR and the library allows for writing easy to understand code when creating the data that the target should handle.

For test the following changes have been made:

  • If a string start withs "0x" and the following data is a valid hex representation.Encode/decode it as CBOR bytes. Could also perhaps add a flags to the to_cbor()/from_cbor() to enable this feature.

The problem with this is the cases when sending an actual string that starts with "0x" and has a valid hex representation following. For the above mentioned use case I am fine with this corner case, but I understand that this might not be acceptable for the library in general.

Interesting is how the CBOR playground (http://cbor.me/) handles/presents this:

CBOR Representation
A1               # map(1)
   65            # text(5)
      6279746573 # "bytes"
   45            # bytes(5)
      0011223344 # "\x00\x11\"3D"

JSON Representation
{"bytes": h'0011223344'}

Can this issue be reopened?

@ecorm
Copy link

ecorm commented Oct 17, 2019

Check out the WAMP protocol for ideas on JSON-encoding a binary blob: https://wamp-proto.org/_static/gen/wamp_latest.html#binary-conversion-of-json-strings

The basic idea is to prepend a Base64-encoded string with a nul character.

@OmnipotentEntity
Copy link
Contributor

OmnipotentEntity commented Apr 11, 2020

For those still following this issue, #1662 addresses this issue, and is finally in a complete state, and is ready for review and merge.

@nlohmann nlohmann added solution: proposed fix a fix for the issue has been proposed and waits for confirmation and removed state: please discuss please discuss the issue or vote for your favorite option state: stale the issue has not been updated in a while and will be closed automatically soon unless it is updated labels Apr 16, 2020
@nlohmann nlohmann self-assigned this Apr 16, 2020
@nlohmann nlohmann added this to the Release 3.8.0 milestone Apr 16, 2020
@nlohmann
Copy link
Owner

The functionality was added by merging #1662.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
aspect: binary formats BSON, CBOR, MessagePack, UBJSON solution: proposed fix a fix for the issue has been proposed and waits for confirmation
Projects
None yet
Development

No branches or pull requests