feat: Introduce `max_serialized_size` function #209

mina86 · 2023-08-30T02:05:50Z

The new borsh::max_serialized_size helper function returns the length
of the longest possible encoding of given type. This may be used to
preallocate output buffers or calculate minimum size of a memory
regions needed to keep the values (e.g. in case of Solana accounts).

borsh/src/schema.rs

The new borsh::max_serialized_size helper function returns the length of the longest possible encoding of given type. This may be used to preallocate output buffers or calculate minimum size of a memory regions needed to keep the values (e.g. in case of Solana accounts).

borsh/src/schema_helpers.rs

dj8yfo · 2023-09-03T14:46:52Z

borsh/src/schema_helpers.rs

+        Ok(Definition::Sequence { elements }) => {
+            // Assume that sequence has MAX_LEN elements since that’s the most
+            // it can have.
+            let sz = max_serialized_size_impl(MAX_LEN, elements, defs)?;


Preallocating 4GB of memory for String or Vec<u8>, or HashMap<u64, u32> (it's actually 12 * 4 GB here) doesn't appear terribly useful, if one instead expects the type to casually contain 200 bytes, or 2MB.

Using a number of types like BoundedVec you mentioned instead might be better imo.
Thus there're 2 alternatives:

Add max_length to Definition::Sequence :

Sequence { max_length: u32, elements: Declaration },

which will usually be populated with u32::MAX, and with something less for types such as BoundedVec<T, L, U>.
In this variant it can be made to work with statically bounded types with current helper function max_serialized_size after small adjustments.

To allow to compute max size for types, whose limit may be set at runtime:

let mut rng = rand::thread_rng(); let up: u64 = rng.next_u64() % 100 + 1; let up1: usize = up as usize; // error[E0435]: attempt to use a non-constant value in a constant let vec: BoundedVec<u8, 0, up1> = BoundedVec::from_vec(vec![20, 20, 20]).unwrap();

pub fn max_serialized_size<T: BorshSchema>() -> Option<usize> may be reverted back to be a
BorshSchema method, but with &self parameter, unused in default implementation:

fn max_serialized_size(&self) -> Option<usize> { ... }

Also this variant allows a hypothetical VarInt compute its own maximum size without need to add a separate Definition::Varint (VarInt type would have a "varint" Declaration and empty Definition):

Varint { min_bytes: u8, max_bytes: u8 },

@frol what do you think?

Side note: i've noticed that derived implementation of serde for BoundedVec<T, L, U> doesn't attempt to check the imposed length bounds at all.

For the second point, this would require changing schema so that rather than having a mapping from Declaration to Definition, there would be a mapping from Declaration to &dyn BorshSchema and then additional fn definition(&self) -> Option<Definition> method. Or Definition would need to be changed to be a trait with definition(&self) -> Option<Definition>; and max_serialized_size(&self) -> Option<usize> methods.

However, problem with that approach is that this isn’t serialisable. If the idea behind BorshSchema is to be able to read encoded data without having the definition of the types, it’s potentially pointless to rely on traits and such to get the max size information.

Maybe I’m overthinking this, but I’m imagining something like:

pub enum Definition { /// A sequence of elements of the same type. Sequence { /// How many bytes does the length prefix occupy. /// /// Zero means this is a fixed-length array or a custom encoding schema /// which doesn’t follow typical borsh conventions. In the former case, /// `min_length` and `max_length` will be set to the same value. In the /// latter case, it may be impossible to deserialise the data without /// additional knowledge of the declared type. length_width: u8, /// Minimum length of the sequence. Usually `0` for dynamically-sized /// sequences. min_length: u64, /// Maximum length of the sequence. Usually `2**length_width - 1`. max_length: u64, /// Type of the elements of the sequence. elements: Declaration, }, /// A fixed-size tuple with the length known at the compile time and the /// elements of different types. Tuple { /// Types of elements of the tuple. elements: Vec<Declaration>, }, /// A tagged union, a.k.a enum. Tagged-unions have variants with associated /// structures. Enum { /// Width of the discriminant tag. tag_width: u8, /// Possible variants of the enumeration. variants: Vec<(VariantName, Declaration)>, }, /// A structure, structurally similar to a tuple. Struct { /// Named fields of the structure with their types. /// /// Tuple-like structures are modelled as a `Tuple`. fields: Vec<(FieldName, Declaration)>, }, }

With that

[u8; 10] would be Sequence { length_width: 0, min_length: 10, max_length: 10, elements: "u8" },

Vec<u8> would be Sequence { length_width: 4, min_length: 0, max_length: u32::MAX, elements: "u8" },

SmallVec<u8> would be Sequence { length_width: 1, min_length: 0, max_length: 255, elements: "u8" },

BoundVec<1, 10, u8> would be Sequence { length_width: 4, min_length: 1, max_length: 10, elements: "u8" } and

VarInt<u32> would be Sequence { length_width: 0, min_length: 1, max_length: 5, elements: "u8" }.

And the tag_width addition to Enum would future-proof the spec for cases where users want to use more than 256 enum variants.

Sequence { length_width: u8, min_length: u64, max_length: u64, elements: Declaration } doesn't look that bad, especially due to being able to accommodate VarInt.

I would still not remove Definition::Array, though Sequence { length_width: 0, min_length: 10, max_length: 10, elements: "u8" } is fine to be understood too.

tag_width for Enum is redundant imo, as Enum is supposed to be human-readable, and if one has an Enum with over-than-u8-range variants, or over-than-u16-range variants, he/she probably has to factor something out, to make the definition of Enum human readable again.

I like explicitness, so I like the proposed extended Definition

borsh/src/schema_helpers.rs

dj8yfo · 2023-09-07T20:00:32Z

borsh/src/schema.rs

@@ -333,7 +333,7 @@ where
 #[cfg(hash_collections)]
 pub mod hashes {
    //!
-    //! Module defines [BorshSchema](crate::schema::BorshSchema) implementation for  
+    //! Module defines [BorshSchema](crate::schema::BorshSchema) implementation for
    //! [HashMap](std::collections::HashMap)/[HashSet](std::collections::HashSet).
    use crate::BorshSchema;



replace Definition in this file with the one proposed (or, strictly speaking, very close to the one proposed)

pub enum Definition { /// A sequence of elements of the same type. Sequence { /// How many bytes does the length prefix occupy. /// /// Zero means this is a fixed-length array or a custom encoding schema /// which doesn’t follow typical borsh conventions. In the former case, /// `min_length` and `max_length` will be set to the same value. In the /// latter case, it may be impossible to deserialise the data without /// additional knowledge of the declared type. length_width: u8, /// Minimum length of the sequence. Usually `0` for dynamically-sized /// sequences. min_length: u64, /// Maximum length of the sequence. Usually `2**length_width - 1`. max_length: u64, /// Type of the elements of the sequence. elements: Declaration, }, /// A fixed-size tuple with the length known at the compile time and the /// elements of different types. Tuple { /// Types of elements of the tuple. elements: Vec<Declaration>, }, /// A tagged union, a.k.a enum. Tagged-unions have variants with associated /// structures. Enum { /// Width of the discriminant tag. tag_width: u8, /// Possible variants of the enumeration. variants: Vec<(VariantName, Declaration)>, }, /// A structure, structurally similar to a tuple. Struct { fields: Fields, }, }

adjust max_serialized_size implementation accordingly.

A couple of questions about this:

First of all, what’s even the point of Fields? Specifically, why do
we bother encoding unnamed struct as a struct rather than as a tuple.
This only complicates clients of the code since they have two
identical cases to consider separately.

And secondly, (while I would even go as far as get rid of field names,
but baring that) why not get rid of Struct and instead use:

Tuple { /// Types of elements of the tuple. elements: Vec<(Option<FieldName>, Declaration)>, },

@mina86 let's address only Array, Enum and Sequence in this pr, as these are where new details are being added, the proposed changes to Struct aren't adding new information, they are just rearranging existing representation in a slightly different manner, albeit (arguably) in a more correct one.
We can try to include another pr with Definition::Struct/Tuple changes into 1.0.0 too.

@mina86 on second thought, it might be better to accept this pr as is.
And then issue 3 separate prs:

remove Definition::Array , replace it with new extended Definition::Sequence + a test for a local BoundedVec<T, L, U> max_serialized_size

add tag_width to Definition::Enum

make the change so that tuple structs become modelled as a Tuple

/// A fixed-size tuple with the length known at the compile time and the /// elements of different types. Tuple { /// Types of elements of the tuple. elements: Vec<Declaration>, }, /// A structure, structurally similar to a tuple. Struct { /// Named fields of the structure with their types. /// /// Tuple-like structures are modelled as a `Tuple`. fields: Vec<(FieldName, Declaration)>, },

Thus it will be easier for everyone to digest it in 4 separate chunks and do additional clarification per each point if needed. I completely skipped the part with point 3. on first pass.
What do you think?

I’m already working on the other things as separate PRs. Whether this PR should be accepted now or later I have no opinion.

borsh/src/schema_helpers.rs

…x found lints

frol

@mina86 Thanks for contributing it! Looks good to be merged.

@dj8yfo Please, merge when it makes sense (to minimize merge conflicts with other PRs, if any)

…` and `&str`

mina86 requested review from frol and dj8yfo as code owners August 30, 2023 02:05

mina86 force-pushed the b branch 3 times, most recently from 0b3f3d8 to 68a965a Compare August 30, 2023 14:54

dj8yfo reviewed Aug 30, 2023

View reviewed changes

borsh/src/schema.rs Outdated Show resolved Hide resolved

dj8yfo reviewed Aug 30, 2023

View reviewed changes

borsh/src/schema.rs Outdated Show resolved Hide resolved

mina86 force-pushed the b branch from 68a965a to 91ab622 Compare August 31, 2023 00:12

mina86 changed the title ~~Introduce BorshSchema::max_serialized_size method~~ Introduce borsh::max_serialized_size function Aug 31, 2023

mina86 force-pushed the b branch from 91ab622 to 86e3bde Compare August 31, 2023 00:38

mina86 requested a review from dj8yfo August 31, 2023 05:25

mina86 force-pushed the b branch from 86e3bde to 8ae9f6c Compare August 31, 2023 14:19

dj8yfo mentioned this pull request Sep 3, 2023

BorshSchema vs custom serialisation #211

Closed

dj8yfo reviewed Sep 3, 2023

View reviewed changes

mina86 added 2 commits September 6, 2023 04:52

wip

dd48929

fix test

c08a862

dj8yfo requested changes Sep 7, 2023

View reviewed changes

dj8yfo changed the title ~~Introduce borsh::max_serialized_size function~~ feat!: Introduce max_serialized_size function; enhance schema::Definition::{Sequence, Enum}; remove schema::Definition::Array Sep 7, 2023

dj8yfo changed the title ~~feat!: Introduce max_serialized_size function; enhance schema::Definition::{Sequence, Enum}; remove schema::Definition::Array~~ feat: Introduce max_serialized_size function Sep 8, 2023

dj8yf0μl added 2 commits September 8, 2023 22:08

build: revert externing alloc crate when std is available; fix test

fc87b14

chore(ci): add modules, gated by schema feature to clippy check; fi…

8f3f6ea

…x found lints

dj8yfo force-pushed the b branch from 75ee284 to 8f3f6ea Compare September 8, 2023 19:30

Merge branch 'master'

c477dc3

dj8yfo approved these changes Sep 8, 2023

View reviewed changes

switch from generic to operating on BorshSchemaContainer

cbe83ab

frol approved these changes Sep 9, 2023

View reviewed changes

frol and others added 2 commits September 9, 2023 08:21

chore: addressed clippy warning

cc12f26

chore: change BorshSchemaContainer api to accept both `&Declaration…

98b2b76

…` and `&str`

dj8yfo merged commit 5ae2d81 into near:master Sep 9, 2023
7 checks passed

frol mentioned this pull request Sep 9, 2023

chore: release #217

Merged

mina86 deleted the b branch September 9, 2023 14:57

dj8yfo mentioned this pull request Sep 10, 2023

Add Definition::Enum::tag_width field #215

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Introduce `max_serialized_size` function #209

feat: Introduce `max_serialized_size` function #209

mina86 commented Aug 30, 2023 •

edited by dj8yfo

Loading

dj8yfo Sep 3, 2023 •

edited

Loading

mina86 Sep 6, 2023 •

edited

Loading

dj8yfo Sep 6, 2023

frol Sep 7, 2023

dj8yfo Sep 7, 2023 •

edited

Loading

mina86 Sep 8, 2023

dj8yfo Sep 8, 2023

dj8yfo Sep 8, 2023

mina86 Sep 8, 2023

frol left a comment

feat: Introduce max_serialized_size function #209

feat: Introduce max_serialized_size function #209

Conversation

mina86 commented Aug 30, 2023 • edited by dj8yfo Loading

dj8yfo Sep 3, 2023 • edited Loading

Choose a reason for hiding this comment

mina86 Sep 6, 2023 • edited Loading

Choose a reason for hiding this comment

dj8yfo Sep 6, 2023

Choose a reason for hiding this comment

frol Sep 7, 2023

Choose a reason for hiding this comment

dj8yfo Sep 7, 2023 • edited Loading

Choose a reason for hiding this comment

mina86 Sep 8, 2023

Choose a reason for hiding this comment

dj8yfo Sep 8, 2023

Choose a reason for hiding this comment

dj8yfo Sep 8, 2023

Choose a reason for hiding this comment

mina86 Sep 8, 2023

Choose a reason for hiding this comment

frol left a comment

Choose a reason for hiding this comment

feat: Introduce `max_serialized_size` function #209

feat: Introduce `max_serialized_size` function #209

mina86 commented Aug 30, 2023 •

edited by dj8yfo

Loading

dj8yfo Sep 3, 2023 •

edited

Loading

mina86 Sep 6, 2023 •

edited

Loading

dj8yfo Sep 7, 2023 •

edited

Loading