Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix serde implementation #140

Closed
wants to merge 4 commits into from
Closed

Fix serde implementation #140

wants to merge 4 commits into from

Conversation

xu-cheng
Copy link
Contributor

@xu-cheng xu-cheng commented Aug 7, 2020

In serde, the fixed length byte array (i.e. [u8; LENGTH]) and normal byte array (i.e., &[u8]) are two different data types. In the previous implementation, these two data types are mixed together. As such, the deserialization operation will fail for certain encoding like json.

This PR fixes this issue by using the fixed length byte array as the serde internal data types for all data structures. This also comes with the benefit to reduce the size of the encoded data.

We also add more tests to prevent future breakage.

Noted that this PR introduced breaking changes. However, the previous serde implementation was already partially broken.

In serde, the fixed length byte array (i.e. `[u8; LENGTH]`) and
normal byte array (i.e., `&[u8]`) are two different data types.
In the previous implementation, these two data types are mixed together.
As such, the deserialization operation will fail for certain encoding
like json.

This commit fixes this issue by using the fixed length byte array
as the serde internal data types for all data structures.
This also comes with the benefit to reduce the size of the encoded data.

We also add more tests to prevent future breakage.
@xu-cheng
Copy link
Contributor Author

xu-cheng commented Sep 8, 2020

Any comment?

@tarcieri
Copy link
Contributor

tarcieri commented Sep 8, 2020

These changes look good to me. The ed25519 crate (which is used for the Signature type) does something similar:

https://docs.rs/ed25519/1.0.1/src/ed25519/lib.rs.html#219-269

That said, I think this might be considered a SemVer breaking change, at least to the serializers.

@cbeck88
Copy link
Contributor

cbeck88 commented Sep 18, 2020

This PR fixes this issue by using the fixed length byte array as the serde internal data types for all data structures. This also comes with the benefit to reduce the size of the encoded data.

It seems like you are changing from the bytes type in serde data model, to the heterogenous tuple type in serde data model. Is that actually going to be smaller?

https://serde.rs/data-model.html

In a lot of serialization formats, you will likely have to have some kind of type information for each member of a heterogenous tuple, so I would imagine that means per-byte overhead. And you won't be able to copy all the bytes off the wire using something like memcpy.

For something like cbor, representing the bytes simply as bytes won't have any such overhead: https://tools.ietf.org/html/rfc7049#section-2.1

Is this actually going to be smaller than just using the serde bytes type?

@cbeck88
Copy link
Contributor

cbeck88 commented Sep 18, 2020

I think the right approach might be to prefer the bytes representation in serde, and tolerate the seq representation when deserializing, to help serde-json out. That's what we did here (unless I'm mistaken): https://github.com/mobilecoinofficial/mobilecoin/blob/e1792601050db0fdb4c5eb7429db7353f1c6e20c/util/repr-bytes/src/lib.rs#L265

@xu-cheng
Copy link
Contributor Author

In a lot of serialization formats, you will likely have to have some kind of type information for each member of a heterogenous tuple, so I would imagine that means per-byte overhead. And you won't be able to copy all the bytes off the wire using something like memcpy.

Sorry, but that is not how serde works. AFAIK, the serde doesn't encode the type information at all since they are available during compile time through Rust type system.

In fact, the implementation used in the PR is identical to the one serde used internally for fixed-length array: https://github.com/serde-rs/serde/blob/b539cb45d7e1460c42aea8201e29e711f37cd198/serde/src/ser/impls.rs#L140-L168. The only reason we need to do the same thing again is because it only supports the array up to 32 in length. (See serde-rs/serde#631. Big arrays will eventually be supported in serde in the future when const generic becomes stabilized.)

I think the right approach might be to prefer the bytes representation in serde, and tolerate the seq representation when deserializing, to help serde-json out.

The biggest problem in the current implementation is that it mixed fixed-length array with dynamic length one. Serde-json is not the only format with this problem. In formats like postcard, the fixed-length arrays are serialized directly while the dynamic length arrays will have a length value prepended in the front of the actual data. As such, the existing implementation is completely broken. And I don't think you can find a way to somehow tolerate two distinct data types.

@cbeck88
Copy link
Contributor

cbeck88 commented Sep 18, 2020

Here's the documentation of serde for serialize_bytes:

https://docs.serde.rs/serde/trait.Serializer.html#tymethod.serialize_bytes

Serialize a chunk of raw byte data.

Enables serializers to serialize byte slices more compactly or more efficiently than other types of slices. If no efficient implementation is available, a reasonable implementation would be to forward to serialize_seq. If forwarded, the implementation looks usually just like this:

And the default implementation is like the code that you are giving, but using seq instead of tuple.

So, it seems that this patch will preventing the serializer from applying any optimizations that are specific to the data format for byte-slices, such as the special compact path for bytes in CBOR like I mentioned.

Anyways, claims like this

This also comes with the benefit to reduce the size of the encoded data.

can be easily tested.

Here's my test code using serde_cbor,

use ed25519_dalek::{Keypair, KEYPAIR_LENGTH};

static KEYPAIR_BYTES: [u8; KEYPAIR_LENGTH] = [
    239, 085, 017, 235, 167, 103, 034, 062,
    007, 010, 032, 146, 113, 039, 096, 174,
    003, 219, 232, 166, 240, 121, 167, 013,
    098, 238, 122, 116, 193, 114, 215, 213,
    175, 181, 075, 166, 224, 164, 140, 146,
    053, 120, 010, 037, 104, 094, 136, 225,
    249, 102, 171, 160, 097, 132, 015, 071,
    035, 056, 000, 074, 130, 168, 225, 071, ];

fn main() {
    let keypair = Keypair::from_bytes(&KEYPAIR_BYTES).unwrap();
    let cbor_bytes = serde_cbor::to_vec(&keypair).unwrap();
    println!("Encoded length: {}", cbor_bytes.len());
}

now hosted at this repo: https://github.com/garbageslam/ed25519-serde

When I point at ed25519-dalek version 1.0, like so,


[dependencies]
ed25519-dalek = { version = "1.0", features = ["serde"] }
serde_cbor = "0.10"

I get this result:

chris@chris-ThinkPad-X1-Carbon-7th:~/ed25519-serde$ cargo run
   Compiling ed25519-serde v0.1.0 (/home/chris/ed25519-serde)
    Finished dev [unoptimized + debuginfo] target(s) in 0.34s
     Running `target/debug/ed25519-serde`
Encoded length: 66

So I get 66 bytes on the wire, for 64 bytes of key material, so only 2 bytes of overhead total.

When I point at your revision from this pull request, like so:

[dependencies]
ed25519-dalek = { git = "https://github.com/xu-cheng/ed25519-dalek", rev = "25ad90649da0482853a5b70b32ebfce0e03d995f", features = ["serde"] }
serde_cbor = "0.10"

I get this result:

chris@chris-ThinkPad-X1-Carbon-7th:~/ed25519-serde$ cargo run
    Updating git repository `https://github.com/xu-cheng/ed25519-dalek`
    Updating crates.io index
   Compiling byte-tools v0.3.1
   Compiling opaque-debug v0.2.3
   Compiling fake-simd v0.1.2
   Compiling generic-array v0.12.3
   Compiling block-padding v0.1.5
   Compiling digest v0.8.1
   Compiling block-buffer v0.7.3
   Compiling curve25519-dalek v2.1.0
   Compiling sha2 v0.8.2
   Compiling ed25519-dalek v1.0.0-pre.4 (https://github.com/xu-cheng/ed25519-dalek?rev=25ad90649da0482853a5b70b32ebfce0e03d995f#25ad9064)
   Compiling ed25519-serde v0.1.0 (/home/chris/ed25519-serde)
    Finished dev [unoptimized + debuginfo] target(s) in 3.08s
     Running `target/debug/ed25519-serde`
Encoded length: 122

So it's a significant regression, almost twice as long on the wire, as expected.

I think this is because after your commit, serde_cbor cannot just copy all the bytes onto the wire anymore, it needs a special control signal for each individual element, because you are calling serialize_elem in a loop instead of passing them all at once. It is better to allow the serializer to use what optimizations it has for byte slices, as explained in the documentation.

AFAIK, the serde doesn't encode the type information at all since they are available during compile time through Rust type system.

This is true for some formats like bincode, but not for others. For self-describing formats, like json or cbor, there must be enough data in the on-the-wire representation to recover the shape of the original structure, without the type-info coming from rust. Protobuf is another example of something where they have a special type just for "bytes" that is distinct from "sequence of heterogenous elements", because there will be significant overhead in the format otherwise.

@cbeck88
Copy link
Contributor

cbeck88 commented Sep 18, 2020

You're probably right that the bincode representation is shorter after your patch, by like, a few bytes, because it doesn't need to write down the length, it can get it from the type system. And json length is probably unchanged because json has no special compact bytes representation. But many other formats will regress significantly.

@cbeck88
Copy link
Contributor

cbeck88 commented Sep 18, 2020

The biggest problem in the current implementation is that it mixed fixed-length array with dynamic length one. Serde-json is not the only format with this problem. In formats like postcard, the fixed-length arrays are serialized directly while the dynamic length arrays will have a length value prepended in the front of the actual data. As such, the existing implementation is completely broken. And I don't think you can find a way to somehow tolerate two distinct data types.

The implementation we use, which tolerates both seq and bytes when deserializing, works with serde_json. I never tried it with postcard.

IMO this is a problem with serde itself. Bytes is supposed to be a primitive in the serde data model, but for data formats that don't actually have an unambiguous representation of bytes, you end up getting back a seq or something sometimes when you deserialize. Intuitively I'd like to say that the deserializer should be responsible to smooth that out -- if I expect to get bytes back then it should try to coerce what it finds to bytes. Unfortunately, this responsibility seems to bleed out to types that implement Deserialize directly and used the serialize_bytes call when serializing. That doesn't mean that it's wrong or a bug to use serialize_bytes, it just means you need to add some fallbacks if you don't get bytes back exactly due to limitations of the format. serialize_bytes is the only way to get the correct serialization in formats like CBOR.

@tarcieri
Copy link
Contributor

You're probably right that the bincode representation is shorter after your patch, by like, a few bytes, because it doesn't need to write down the length, it can get it from the type system.

It's really any format that includes a length prefix, including CBOR. A length prefix is redundant in these cases, since all of these types know their lengths a priori (they're fixed).

But many other formats will regress significantly.

Can you give an example of a format that would regress?

For some background on the technique used in this PR, I'd suggest reading:

serde-rs/serde#631 (comment)
https://github.com/est31/serde-big-array/blob/master/src/lib.rs

@cbeck88
Copy link
Contributor

cbeck88 commented Sep 18, 2020

Of the list here: https://serde.rs/#data-formats

I would expect these formats to regress:
CBOR
BSON
MessagePack
Avro
Pickle

because I know they have special compact handling for bytes buffers, and I think if you don't use serialize_bytes and instead use a sequence of serialize_elem, there's probably no way the serializer object can after the fact figure out that it should have used that.

Some of these formats I don't know enough about them to judge.

Many other formats, like protobuf, also have special handling for compact representations of bytes, it's really typical. But there is no serde/protobuf serializer afaik, due to the issue about protobuf needing tags and serde not having that in the data model.

@cbeck88
Copy link
Contributor

cbeck88 commented Sep 18, 2020

For some background on the technique used in this PR, I'd suggest reading:

That's interesting, I guess we should ask dtolnay

My sense from reading serde docs is that, he thinks there is no one right way to map your data to the data model. All I'm saying is, if we are changing away from the current code because we think it will be more efficient, that doesn't seem to be the case. Intuitively, serialize_bytes is going to give you the best way to serialize bytes in whatever the target is, because you are giving the serializer the most information.

@cbeck88
Copy link
Contributor

cbeck88 commented Sep 18, 2020

This also comes with the benefit to reduce the size of the encoded data.

I would also be interested if there is any format besides bincode that actually benefits from this change. I would expect that for JSON there is no change in the encoded representation at all, it's just going to become an array of numbers either way. For most things, even if the size is fixed at compile-time, in order to be a valid serialization in the data format, there will have to be some kind of framing. So the number you think you are omitting by fixing it at compile-time won't actually be omitted, except in extremely low-level things like bincode. I don't know that there is anything else like bincode in all of serde ecosystem in this regard. At least that is my prediction.

@tarcieri
Copy link
Contributor

tarcieri commented Sep 18, 2020

I would also be interested if there is any format besides bincode that actually benefits from this change.

IIRC CBOR. You're saying it will regress though, so it'd be good if someone checked.

My expectation is, in all cases, that it would be the same size or smaller.

@cbeck88
Copy link
Contributor

cbeck88 commented Sep 18, 2020

yeah, it looked to me that CBOR got 60 bytes bigger.

I don't understand how it's going to know to use the compact bytes thing if you dont tell it serialize_bytes. But if you expect that this should work, maybe there's a bug somewhere in serde_cbor

@xu-cheng
Copy link
Contributor Author

@garbageslam Thanks for the info. I think the reason the size regressed is that CBOR needs to encode the size (not type) for every elements in the array. Can you check if the same regression happens for public key serialization in this PR? In which, it encodes directly as [u8; PUBLIC_KEY_LENGTH].

My main goal for this PR is to solve the breakages in the current Serde implementation due to mixing two data types. So another solution would be encoding all data types to dynamic bytes array, i.e. &[u8]. However, it is worthy noting that the Signature type is currently encoded in the same way as this PR, i.e. fixed-length byte array.

@xu-cheng
Copy link
Contributor Author

xu-cheng commented Sep 18, 2020

Ok, I run some codes for double checks. It seems that the issues I encountered when using postcard actually comes from the postcard (jamesmunns/postcard#22) instead of this crate.

Therefore, the only remaining issues are the those when serde_json is used. I think the best solution would be keeping using &[u8] for serde and fixing for serde_json. There is actually an official serde crate https://github.com/serde-rs/bytes we can used.

What do you think? I think this should solve all of problems including serialization size and not introducing breaking changes. I can send a new PR for this.

@xu-cheng
Copy link
Contributor Author

See #146 for the alternative solution.

@xu-cheng xu-cheng closed this Sep 22, 2020
@xu-cheng xu-cheng deleted the serde branch September 22, 2020 01:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants