CBOR byte string support #1129

daviddesberg · 2018-06-12T17:41:14Z

Currently, the CBOR feature of the library only supports writing strings out as UTF-8 and doesn't support byte arrays as described in the CBOR standard. I'd like to implement COSE on top of this library, but in order to do that, byte string support is needed.

I'm happy to submit a pull request for this, but I wanted to check in advance if this is something you'd be open to

nlohmann · 2018-06-14T14:46:24Z

How would you store the data coming from CBOR?

nlohmann · 2018-06-18T20:12:21Z

Related: #757, #587.

daviddesberg · 2018-06-19T17:40:24Z

my thought was creation of some byte string type, leaving existing treatment of strings alone. i'll have time this week to show in more detail

nlohmann · 2018-06-20T14:34:40Z

So you mean the byte string's bytes should just be parsed and stored in a JSON string? What about encodings? Please also have a look at the issues I referenced above.

nyovaya · 2018-07-04T19:52:54Z

@nlohmann How about adding the suffix _BIN or _BINARY to the key string and the string then contains the byte string as a base64 converted string?

nlohmann · 2018-07-04T19:56:54Z

What about an array of binary strings? What about an array of values of which some are binary strings? How would roundtripping work?

nlohmann · 2018-07-09T20:50:02Z

Any news on this?

nlohmann · 2018-07-18T11:30:30Z

@lusitanian @nifker Any input on this?

nyovaya · 2018-07-19T18:36:48Z

My approach would be still using a prefix/suffix for the key and for the data itself base32768 or base64 - but I dont what performance base32768 has or usable it is.
An example would look like this:

BINARY_binarystring: "媒腻㐤┖ꈳ埳"

binarystring_BINARY: "XmFFzQtc"

The prefix/suffix is trimmed from the key string and the resulting string is used as key in CBOR.
The value is just decoded from base64/base32768 back into a binary string and will be the value in CBOR.

nlohmann · 2018-07-20T10:30:53Z

I do not like this approach, as it relies on the convention to use and respect such prefixes/suffixes.

nyovaya · 2018-07-20T18:21:39Z

Couldn't imagine how else you want to detect that.

nlohmann · 2018-07-21T09:27:27Z

There are two aspects:

Parsing CBOR to JSON (deserialization). There is currently no support for binary strings. It could be added, e.g. by either implementing encoders for base64 and storing the result in a string or by extending the SAX interface to pass std::vector<uint8_t> to the caller so the user can decide how to interpret the received bytes. The first approach has the benefit of being close to CBOR. The second approach has the benefit of being very generic and shifting the decoding work to the user, allowing the library to stay slim in this point.
Creating CBOR from JSON (serialization). We follow the JSON types right now and map all strings to CBOR's UTF-8 strings. In order to tell the serializer to encode specific strings differently, one could think of an additional parameter to the serializer that contains a mapping from JSON Pointers to desired CBOR types; that is, allowing to explicitly override the default types where desired. This would be a lot of work, but doable in principle.

It would be nice to hear more alternatives to these two aspects. But since the library needs to be maintained and support is very time-consuming, I am against an approach that relies on non-standard assumptions or conventions that need documentation and may often lead to surprises.

I'm setting this issue to discussion, and I hope to see some ideas. I am currently not planning to work on this topic, so pull requests are, as always, more than welcome! As an additional challenge, we should not forget the library also implements MessagePack and UBJSON, so ideas that also have those binary formats in mind are appreciated.

nlohmann · 2018-08-04T09:13:53Z

Any opinion on this?

alexkarpenko · 2018-08-26T22:46:08Z

How about adding a data type for binary blobs?

std::vector<uint8_t> myData{0,1,2,3,4,5,6};
json j = {
  {"binary", json::bin(myData.begin(), myData.end())}
};

auto v_msgpack = json::to_msgpack(j);
json j2 = json::from_msgpack(v_msgpack);

assert(j2["binary"].is_binary() == true);
auto dat = j2["binary"].get<std::string>(); // OK. Returns a string that holds the bytes (json::bin can just be a tagged struct with a std::string member)

auto str = j.dump(4); // base64 encodes binary fields
auto j3 = json::parse(str);
assert(j3["binary"].is_binary() == false); // binary data serialized to json will not survive round trip. instead it's now a base64 encoded string

nlohmann · 2018-09-17T17:04:17Z

Hm. Nice idea. Any opinions on this?

stale · 2018-10-17T17:41:40Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

arrtchiu · 2019-01-29T08:41:15Z

As someone who wants to use MessagePack with binary blobs, I quite like @alexkarpenko's suggestion of a separate type. It's nice and explicit and doesn't rely on developers sticking to conventions.

One point I'm not completely sold on is automatically converting it to base64 string when serialising to JSON. I personally consider adding binary to an object and then trying to serialise it into JSON a developer error that should probably be an exception (if you want to use binary you should serialise to a format that supports it), but I don't hold that opinion too strongly either.

Another potential reason against base64-encoding is.. well.. does this library already have a base64 encoder? We probably don't want to add one just for that! :)

I don't have the time or expertise to complete this myself but I'm open to the possibility of contributing financially to see this done.

alexkarpenko · 2019-01-29T18:25:41Z

I don't feel strongly about what to do when serializing binary blobs to JSON. Throwing an exception seems reasonable to me.

mdenna-nviso · 2019-03-01T15:40:28Z

We also would really like to have CBOR byte string support to be able to import large binary BLOBs without additional overhead.
The suggestion proposed by @alexkarpenko's seems simple and clean and would be ideal for me.
Regarding the export to JSON, an exception is the best way to go in my opinion. Binary blobs are not a native JSON type anyway, and this would avoid complicate workarounds...

ayounes-nviso · 2019-03-04T15:47:37Z

Is it possible to reopen this ticket please?

stale · 2019-06-03T20:11:00Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

mdenna-nviso · 2019-06-05T06:46:28Z

Hi, any update on this topic? @OmnipotentEntity, how is your implementation going? I'm afraid I don't have time at the moment to understand the internals of nlohmann/json, I could maybe give a hand if there are some very specific parts to implement..

OmnipotentEntity · 2019-06-05T15:05:53Z

Sorry, most of the last few weeks have been eaten by moving house. Once I get my computer set up again, I can upload what I have to my GitHub, if you want to take a crack at it. I'm still intending on finishing this, I'm just slow due to real life reasons.

mdenna-nviso · 2019-06-06T06:53:04Z

Great, thanks for feedback! No hurry on my side, when you have something I will be pleased to be a beta-tester.. :)

OmnipotentEntity · 2019-06-26T19:53:57Z

Hi, I'm back, I finally got my office set up. Please be patient with me while I begin a new job.

I had a question about design. How do you want to handle serialization of a binary value to text JSON? There's three possibilities to my mind:

Refuse to serialize
Serialize to array of numbers or base64 string or something.
Extend JSON with a binary type
Do any combination of the above and allow the user to choose.

These have the following pros and cons to my mind:

Refuse to serialize. Pros: Logically makes sense. There is no valid JSON serialization that is reversible. Cons: Now we have a new throwing path that wasn't there. It may seem surprising that we can serialize to binary formats but not text. And this library was originally about serializing JSON.

Serialize to array of numbers. Pros: This is what some binary formats will require anyway. A good compromise. Cons: Not reversible.

Serialize to base64 string. Pros: Same as previous. And can devise a tag of some kind to automatically change format back to binary. Cons: Small chance that normal data matches the tagging system with surprising consequences.

Extend JSON. Pros: Reversible and unambiguous. Allowed to accept non-JSON spec conforming data per RFC. Cons: This extension would need to be essentially invented (as I cannot find any literature on this topic), and no one else will support it in the short term, and very few at best will support it in the medium to long term. Not allowed to generate non-JSON spec conforming data per RFC.

All of the above. Pros: Obviously if we do everything the user desires nothing can possibly be bad, right? Cons: Lots of work to implement. Need to decide default behavior. Documentation might become confusing, and raises the required knowledge for usage of the library.

I'm reaching the point where I'd have to make a decision on this. And while discussion on this topic has occurred in this thread. No real consensus was reached. So I'm looking for direction.

znark · 2019-06-28T10:42:02Z

I had a question about design. How do you want to handle serialization of a binary value to text JSON? There's three possibilities to my mind:
1. Refuse to serialize

Refuse to serialize (throw an exception?) by default, but add a hook where you can plug in your own, custom adapter which will produce JSON compatible datatypes out of the binary datatype.

Since not all binary data blobs are the same, it could be useful if the binary data container could, by itself, carry some meta information, such as a pointer to a specific callback function for converting that specific data blob to a JSON-compatible type, or a type hint as a string (relayed to the hook callback function as a parameter), or something to that effect.

Also, if the data is to be serialized as a named field in a JSON object, maybe the callback could be allowed to modify the name of the field in such cases. (Could be used e.g. for adding a suffix denoting the type as suggested by someone else above. Would allow for many different schemes.)

If you can think of a method of adding similar custom hooks for deserializing from plain JSON data to binary blobs (allowing round trips when serializing and deserializing binary data into a custom JSON-compatible representation), that would be most excellent as well, but maybe not as important at this stage.

znark · 2019-06-28T11:03:55Z

More food for thought: maybe you could use std::any for storing any kind of object in the in-memory JSON variant and have a protocol (define conversion functions etc.) for converting that to the generic bin datatype when applicable and a fallback for converting the type to standard JSON datatypes presentation if doing a plain JSON serialization.

mdenna-nviso · 2019-06-28T14:21:11Z

Hi OE, welcome back!
Hmmmm.... since the main focus of this library is JSON and binary data is not a native JSON data type, option 1 (exception) looks good enough IMHO at this stage.
Users wanting to serialize a json tree containing binary nodes can just replace them with whatever encoding they like before calling the serialization method.
Callback hooks are an elegant way to achieve the same automatically and without data duplication, but if this requires additional complexities in the library, maybe can be postponed until there is a real need.

OmnipotentEntity · 2019-06-28T19:24:42Z

@znark

"such as a pointer to a specific callback function for converting that specific data blob to a JSON-compatible type"

We absolutely cannot store a raw pointer in serialized data. The pointer to that function isn't guaranteed to be the same from invocation to invocation on the same computer, let alone different computers with different operating systems and different processor architectures. A function identifier might be used, but as this is a library, not an application, and this would need to be something provided by the application anyway, I fail to see what we would gain exactly.

Python's builtin json type does have this functionality (callback hooks); however, their json type is also read-only, and Python is itself a weakly typed language, so a polymorphic return type will not cause extra boilerplate.

Putting the ability to store arbitrary types into json in this library would be quite a bit of work and require answering a bunch of difficult questions:

Should we allow only POD types? (Obviously not, but then...)
What if the type is not copyable?
What if the type is only moveable?
How do we keep from having the unknown type screw with the noexcept status of many different functions?
How do we report multiple different unknown types back to the user? The user will not always know a priori the proper type of the item, it could be one of several, and C++ has bad type introspection.

@mdenna-nviso my only real reservation about this approach is that it's a new throwing path. Meaning that current users of the library may run into an issue where code that had been working perfectly now throws. Currently the serializer only throws in two cases, both related to invalid UTF-8 strings, so it's safe to use without a catch if you either do not have strings, or you know the strings are valid UTF-8 (such as if you generated them yourself.) It's probably not a huge deal, but I didn't see you address it.

nlohmann · 2019-06-30T06:34:17Z

For parsing, the first step could be to extend the SAX interface with a binary function that could take a std::vector<std::uint8_t> or std::string or anything that can hold arbitrary binary data. Then the binary parsers (CBOR, MessagePack, ...) need to be adjusted to call this binary function with the binary data. With a default implementation of binary to throw a parse error, this would not break existing implementations.

A user who wants to use binary data would then need to implement the SAX interface including the binary function and decide what to do with the data. Maybe it would be helpful to also pass input_format_t so you'd know more about the origin of the binary data.

For writing binary data, the actual JSON value would need to be extended. It would never be filled by the standard JSON parser, and we may decide to throw if binary data is tried to be dumped. Once this exists, the binary writers (CBOR, MessagePack, ...) would need to write the respective encodings and we would be ready to roundtrip. We could even use the mechanism for arbitrary type conversion to let users define their own mappings from types to binary formats.

But extending the JSON value is a lot of work, because it is used everywhere. It's would not be impossible, just a long way...

What do you think?

OmnipotentEntity · 2019-06-30T23:55:50Z

I was in the middle of extending the JSON value itself to have a binary type. I'd say I'm about 40-50% complete or so. I'd prefer to do it this way rather than use an intermediate representation, because typically an application would want to parse binary data into a format external to JSON. (My use case in particular are neural network weights, for instance.)

nlohmann · 2019-07-01T05:58:39Z

Cool! How does the extension look like? Do you use a templated type?

OmnipotentEntity · 2019-07-01T20:46:47Z

Yeah, default type is std::vector<std::uint8_t>.

jpetkau · 2019-07-30T23:31:33Z

Throwing when serializing binary to json seems like a painful choice; it means losing the ability to easily switch formats, if you want to write binary data in formats that support it.

As an alternate suggestion, how about picking a serialization format (e.g. base64) for binary to json conversion, doing nothing special on load (so it doesn't roundtrip), but making it easy to parse base64 to binary in user-defined serializers?

More specifically,

add value_t::binary and a binary variant of json_value
Add a binary variant to sax_parse().
nlohmann::json to/from msgpack or CBOR roundtrips as binary
json::parse(string) always produces string, never binary

So far I think this is the same as agreed above and @OmnipotentEntity is implementing. Then:

dump() encodes binary as base64 encoded string
For API convenience: add a json::bin type (as suggested by @alexkarpenko above), which could just be std::vector<uint8_t> or some wrapper struct, but specifically not std::string.
get<json::bin>() returns a binary value as json::bin
get<json::bin>() parses string values as base64. (E.g. ("\"aGVsbG8=\""_json).get<json::bin>() == json::bin("hello", 5)

So nlohmann::json objects wouldn't roundtrip through json text (but that's expected since json doesn't support binary). They would roundtrip through CBOR and msgpack, and user-defined types could easily be made to roundtrip through all formats, and there would be no heuristics anywhere; the only conversion would be base64-encoding binary for json, and that would only happen when explicitly requested from C++ code.

stale · 2019-08-30T00:18:44Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

sijohans · 2019-10-10T16:53:35Z

This would be a nice an useful feature. This library can be used for creating tests towards a c target that parses CBOR and the library allows for writing easy to understand code when creating the data that the target should handle.

For test the following changes have been made:

If a string start withs "0x" and the following data is a valid hex representation.Encode/decode it as CBOR bytes. Could also perhaps add a flags to the to_cbor()/from_cbor() to enable this feature.

The problem with this is the cases when sending an actual string that starts with "0x" and has a valid hex representation following. For the above mentioned use case I am fine with this corner case, but I understand that this might not be acceptable for the library in general.

Interesting is how the CBOR playground (http://cbor.me/) handles/presents this:

CBOR Representation
A1               # map(1)
   65            # text(5)
      6279746573 # "bytes"
   45            # bytes(5)
      0011223344 # "\x00\x11\"3D"

JSON Representation
{"bytes": h'0011223344'}

Can this issue be reopened?

ecorm · 2019-10-17T19:13:04Z

Check out the WAMP protocol for ideas on JSON-encoding a binary blob: https://wamp-proto.org/_static/gen/wamp_latest.html#binary-conversion-of-json-strings

The basic idea is to prepend a Base64-encoded string with a nul character.

OmnipotentEntity · 2020-04-11T20:43:03Z

For those still following this issue, #1662 addresses this issue, and is finally in a complete state, and is ready for review and merge.

nlohmann · 2020-04-16T09:16:12Z

The functionality was added by merging #1662.

nlohmann added the aspect: binary formats BSON, CBOR, MessagePack, UBJSON label Jun 14, 2018

nlohmann added the state: needs more info the author of the issue needs to provide more details label Jun 18, 2018

nlohmann added state: please discuss please discuss the issue or vote for your favorite option and removed state: needs more info the author of the issue needs to provide more details labels Jul 21, 2018

nlohmann mentioned this issue Sep 18, 2018

BSON support #1244

Closed

stale bot added the state: stale the issue has not been updated in a while and will be closed automatically soon unless it is updated label Oct 17, 2018

stale bot closed this as completed Oct 24, 2018

ayounes-nviso mentioned this issue Mar 11, 2019

CBOR byte string support #1509

Closed

nlohmann reopened this Mar 11, 2019

stale bot removed the state: stale the issue has not been updated in a while and will be closed automatically soon unless it is updated label Mar 11, 2019

stale bot added the state: stale the issue has not been updated in a while and will be closed automatically soon unless it is updated label Jun 3, 2019

stale bot removed the state: stale the issue has not been updated in a while and will be closed automatically soon unless it is updated label Jun 5, 2019

OmnipotentEntity mentioned this issue Jul 5, 2019

Add binary type support to all binary file formats, as well as an internally represented binary type #1662

Merged

4 tasks

stale bot added the state: stale the issue has not been updated in a while and will be closed automatically soon unless it is updated label Aug 30, 2019

stale bot closed this as completed Sep 6, 2019

nlohmann self-assigned this Apr 16, 2020

nlohmann added this to the Release 3.8.0 milestone Apr 16, 2020

deeglaze mentioned this issue Dec 24, 2024

[Migrated] Bug: cocli corim display veraison/cocli#17

Open

CBOR byte string support #1129

CBOR byte string support #1129

Comments

daviddesberg commented Jun 12, 2018

nlohmann commented Jun 14, 2018

nlohmann commented Jun 18, 2018

daviddesberg commented Jun 19, 2018

nlohmann commented Jun 20, 2018

nyovaya commented Jul 4, 2018 • edited Loading

nlohmann commented Jul 4, 2018

nlohmann commented Jul 9, 2018

nlohmann commented Jul 18, 2018

nyovaya commented Jul 19, 2018

nlohmann commented Jul 20, 2018

nyovaya commented Jul 20, 2018

nlohmann commented Jul 21, 2018

nlohmann commented Aug 4, 2018

alexkarpenko commented Aug 26, 2018 • edited Loading

nlohmann commented Sep 17, 2018

stale bot commented Oct 17, 2018

arrtchiu commented Jan 29, 2019

alexkarpenko commented Jan 29, 2019

mdenna-nviso commented Mar 1, 2019

ayounes-nviso commented Mar 4, 2019

stale bot commented Jun 3, 2019

mdenna-nviso commented Jun 5, 2019

OmnipotentEntity commented Jun 5, 2019

mdenna-nviso commented Jun 6, 2019

OmnipotentEntity commented Jun 26, 2019

znark commented Jun 28, 2019

znark commented Jun 28, 2019

mdenna-nviso commented Jun 28, 2019

OmnipotentEntity commented Jun 28, 2019 • edited Loading

nlohmann commented Jun 30, 2019

OmnipotentEntity commented Jun 30, 2019

nlohmann commented Jul 1, 2019

OmnipotentEntity commented Jul 1, 2019

jpetkau commented Jul 30, 2019 • edited Loading

stale bot commented Aug 30, 2019

sijohans commented Oct 10, 2019 • edited Loading

ecorm commented Oct 17, 2019

OmnipotentEntity commented Apr 11, 2020 • edited Loading

nlohmann commented Apr 16, 2020

nyovaya commented Jul 4, 2018 •

edited

Loading

alexkarpenko commented Aug 26, 2018 •

edited

Loading

OmnipotentEntity commented Jun 28, 2019 •

edited

Loading

jpetkau commented Jul 30, 2019 •

edited

Loading

sijohans commented Oct 10, 2019 •

edited

Loading

OmnipotentEntity commented Apr 11, 2020 •

edited

Loading