-
Notifications
You must be signed in to change notification settings - Fork 231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix serde implementation #140
Conversation
In serde, the fixed length byte array (i.e. `[u8; LENGTH]`) and normal byte array (i.e., `&[u8]`) are two different data types. In the previous implementation, these two data types are mixed together. As such, the deserialization operation will fail for certain encoding like json. This commit fixes this issue by using the fixed length byte array as the serde internal data types for all data structures. This also comes with the benefit to reduce the size of the encoded data. We also add more tests to prevent future breakage.
Any comment? |
These changes look good to me. The https://docs.rs/ed25519/1.0.1/src/ed25519/lib.rs.html#219-269 That said, I think this might be considered a SemVer breaking change, at least to the serializers. |
It seems like you are changing from the bytes type in serde data model, to the https://serde.rs/data-model.html In a lot of serialization formats, you will likely have to have some kind of type information for each member of a heterogenous tuple, so I would imagine that means per-byte overhead. And you won't be able to copy all the bytes off the wire using something like memcpy. For something like cbor, representing the bytes simply as bytes won't have any such overhead: https://tools.ietf.org/html/rfc7049#section-2.1 Is this actually going to be smaller than just using the serde bytes type? |
I think the right approach might be to prefer the |
Sorry, but that is not how serde works. AFAIK, the serde doesn't encode the type information at all since they are available during compile time through Rust type system. In fact, the implementation used in the PR is identical to the one serde used internally for fixed-length array: https://github.com/serde-rs/serde/blob/b539cb45d7e1460c42aea8201e29e711f37cd198/serde/src/ser/impls.rs#L140-L168. The only reason we need to do the same thing again is because it only supports the array up to 32 in length. (See serde-rs/serde#631. Big arrays will eventually be supported in serde in the future when const generic becomes stabilized.)
The biggest problem in the current implementation is that it mixed fixed-length array with dynamic length one. Serde-json is not the only format with this problem. In formats like postcard, the fixed-length arrays are serialized directly while the dynamic length arrays will have a length value prepended in the front of the actual data. As such, the existing implementation is completely broken. And I don't think you can find a way to somehow tolerate two distinct data types. |
Here's the documentation of serde for https://docs.serde.rs/serde/trait.Serializer.html#tymethod.serialize_bytes
And the default implementation is like the code that you are giving, but using seq instead of tuple. So, it seems that this patch will preventing the serializer from applying any optimizations that are specific to the data format for byte-slices, such as the special compact path for bytes in CBOR like I mentioned. Anyways, claims like this
can be easily tested. Here's my test code using
now hosted at this repo: https://github.com/garbageslam/ed25519-serde When I point at ed25519-dalek version 1.0, like so,
I get this result:
So I get 66 bytes on the wire, for 64 bytes of key material, so only 2 bytes of overhead total. When I point at your revision from this pull request, like so:
I get this result:
So it's a significant regression, almost twice as long on the wire, as expected. I think this is because after your commit,
This is true for some formats like bincode, but not for others. For self-describing formats, like json or cbor, there must be enough data in the on-the-wire representation to recover the shape of the original structure, without the type-info coming from rust. Protobuf is another example of something where they have a special type just for "bytes" that is distinct from "sequence of heterogenous elements", because there will be significant overhead in the format otherwise. |
You're probably right that the bincode representation is shorter after your patch, by like, a few bytes, because it doesn't need to write down the length, it can get it from the type system. And json length is probably unchanged because json has no special compact bytes representation. But many other formats will regress significantly. |
The implementation we use, which tolerates both seq and bytes when deserializing, works with IMO this is a problem with serde itself. Bytes is supposed to be a primitive in the serde data model, but for data formats that don't actually have an unambiguous representation of bytes, you end up getting back a seq or something sometimes when you deserialize. Intuitively I'd like to say that the deserializer should be responsible to smooth that out -- if I expect to get bytes back then it should try to coerce what it finds to bytes. Unfortunately, this responsibility seems to bleed out to types that implement |
It's really any format that includes a length prefix, including CBOR. A length prefix is redundant in these cases, since all of these types know their lengths a priori (they're fixed).
Can you give an example of a format that would regress? For some background on the technique used in this PR, I'd suggest reading: serde-rs/serde#631 (comment) |
Of the list here: https://serde.rs/#data-formats I would expect these formats to regress: because I know they have special compact handling for bytes buffers, and I think if you don't use Some of these formats I don't know enough about them to judge. Many other formats, like protobuf, also have special handling for compact representations of bytes, it's really typical. But there is no serde/protobuf serializer afaik, due to the issue about protobuf needing tags and serde not having that in the data model. |
That's interesting, I guess we should ask dtolnay My sense from reading serde docs is that, he thinks there is no one right way to map your data to the data model. All I'm saying is, if we are changing away from the current code because we think it will be more efficient, that doesn't seem to be the case. Intuitively, |
I would also be interested if there is any format besides bincode that actually benefits from this change. I would expect that for JSON there is no change in the encoded representation at all, it's just going to become an array of numbers either way. For most things, even if the size is fixed at compile-time, in order to be a valid serialization in the data format, there will have to be some kind of framing. So the number you think you are omitting by fixing it at compile-time won't actually be omitted, except in extremely low-level things like bincode. I don't know that there is anything else like bincode in all of serde ecosystem in this regard. At least that is my prediction. |
IIRC CBOR. You're saying it will regress though, so it'd be good if someone checked. My expectation is, in all cases, that it would be the same size or smaller. |
yeah, it looked to me that CBOR got 60 bytes bigger. I don't understand how it's going to know to use the compact bytes thing if you dont tell it serialize_bytes. But if you expect that this should work, maybe there's a bug somewhere in serde_cbor |
@garbageslam Thanks for the info. I think the reason the size regressed is that CBOR needs to encode the size (not type) for every elements in the array. Can you check if the same regression happens for public key serialization in this PR? In which, it encodes directly as My main goal for this PR is to solve the breakages in the current Serde implementation due to mixing two data types. So another solution would be encoding all data types to dynamic bytes array, i.e. |
Ok, I run some codes for double checks. It seems that the issues I encountered when using postcard actually comes from the postcard (jamesmunns/postcard#22) instead of this crate. Therefore, the only remaining issues are the those when What do you think? I think this should solve all of problems including serialization size and not introducing breaking changes. I can send a new PR for this. |
See #146 for the alternative solution. |
In serde, the fixed length byte array (i.e.
[u8; LENGTH]
) and normal byte array (i.e.,&[u8]
) are two different data types. In the previous implementation, these two data types are mixed together. As such, the deserialization operation will fail for certain encoding like json.This PR fixes this issue by using the fixed length byte array as the serde internal data types for all data structures. This also comes with the benefit to reduce the size of the encoded data.
We also add more tests to prevent future breakage.
Noted that this PR introduced breaking changes. However, the previous serde implementation was already partially broken.