Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Introduce max_serialized_size function #209

Merged
merged 9 commits into from
Sep 9, 2023
Merged

Conversation

mina86
Copy link
Contributor

@mina86 mina86 commented Aug 30, 2023

The new borsh::max_serialized_size helper function returns the length
of the longest possible encoding of given type. This may be used to
preallocate output buffers or calculate minimum size of a memory
regions needed to keep the values (e.g. in case of Solana accounts).

borsh/src/schema.rs Outdated Show resolved Hide resolved
borsh/src/schema.rs Outdated Show resolved Hide resolved
@mina86 mina86 changed the title Introduce BorshSchema::max_serialized_size method Introduce borsh::max_serialized_size function Aug 31, 2023
@mina86 mina86 requested a review from dj8yfo August 31, 2023 05:25
The new borsh::max_serialized_size helper function returns the length
of the longest possible encoding of given type.  This may be used to
preallocate output buffers or calculate minimum size of a memory
regions needed to keep the values (e.g. in case of Solana accounts).
borsh/src/schema_helpers.rs Outdated Show resolved Hide resolved
borsh/src/schema_helpers.rs Outdated Show resolved Hide resolved
Ok(Definition::Sequence { elements }) => {
// Assume that sequence has MAX_LEN elements since that’s the most
// it can have.
let sz = max_serialized_size_impl(MAX_LEN, elements, defs)?;
Copy link
Collaborator

@dj8yfo dj8yfo Sep 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preallocating 4GB of memory for String or Vec<u8>, or HashMap<u64, u32> (it's actually 12 * 4 GB here) doesn't appear terribly useful, if one instead expects the type to casually contain 200 bytes, or 2MB.

Using a number of types like BoundedVec you mentioned instead might be better imo.
Thus there're 2 alternatives:

  1. Add max_length to Definition::Sequence :
    Sequence { max_length: u32, elements: Declaration },

which will usually be populated with u32::MAX, and with something less for types such as BoundedVec<T, L, U>.
In this variant it can be made to work with statically bounded types with current helper function max_serialized_size after small adjustments.

  1. To allow to compute max size for types, whose limit may be set at runtime:
    let mut rng = rand::thread_rng();
    let up: u64 = rng.next_u64() % 100 + 1;
    let up1: usize = up as usize;
    // error[E0435]: attempt to use a non-constant value in a constant
    let vec: BoundedVec<u8, 0, up1> = BoundedVec::from_vec(vec![20, 20, 20]).unwrap();

pub fn max_serialized_size<T: BorshSchema>() -> Option<usize> may be reverted back to be a
BorshSchema method, but with &self parameter, unused in default implementation:

    fn max_serialized_size(&self) -> Option<usize> {
    ...
    }

Also this variant allows a hypothetical VarInt compute its own maximum size without need to add a separate Definition::Varint (VarInt type would have a "varint" Declaration and empty Definition):

    Varint { min_bytes: u8, max_bytes: u8 },

@frol what do you think?

Side note: i've noticed that derived implementation of serde for BoundedVec<T, L, U> doesn't attempt to check the imposed length bounds at all.

Copy link
Contributor Author

@mina86 mina86 Sep 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the second point, this would require changing schema so that rather than having a mapping from Declaration to Definition, there would be a mapping from Declaration to &dyn BorshSchema and then additional fn definition(&self) -> Option<Definition> method. Or Definition would need to be changed to be a trait with definition(&self) -> Option<Definition>; and max_serialized_size(&self) -> Option<usize> methods.

However, problem with that approach is that this isn’t serialisable. If the idea behind BorshSchema is to be able to read encoded data without having the definition of the types, it’s potentially pointless to rely on traits and such to get the max size information.

Maybe I’m overthinking this, but I’m imagining something like:

pub enum Definition {
    /// A sequence of elements of the same type.
    Sequence {
        /// How many bytes does the length prefix occupy.
        ///
        /// Zero means this is a fixed-length array or a custom encoding schema
        /// which doesn’t follow typical borsh conventions.  In the former case,
        /// `min_length` and `max_length` will be set to the same value.  In the
        /// latter case, it may be impossible to deserialise the data without
        /// additional knowledge of the declared type.
        length_width: u8,

        /// Minimum length of the sequence.  Usually `0` for dynamically-sized
        /// sequences.
        min_length: u64,

        /// Maximum length of the sequence.  Usually `2**length_width - 1`.
        max_length: u64,

        /// Type of the elements of the sequence.
        elements: Declaration,
    },

    /// A fixed-size tuple with the length known at the compile time and the
    /// elements of different types.
    Tuple {
        /// Types of elements of the tuple.
        elements: Vec<Declaration>,
    },

    /// A tagged union, a.k.a enum. Tagged-unions have variants with associated
    /// structures.
    Enum {
        /// Width of the discriminant tag.
        tag_width: u8,

        /// Possible variants of the enumeration.
        variants: Vec<(VariantName, Declaration)>,
    },

    /// A structure, structurally similar to a tuple.
    Struct {
        /// Named fields of the structure with their types.
        ///
        /// Tuple-like structures are modelled as a `Tuple`.
        fields: Vec<(FieldName, Declaration)>,
    },
}

With that

  • [u8; 10] would be Sequence { length_width: 0, min_length: 10, max_length: 10, elements: "u8" },
  • Vec<u8> would be Sequence { length_width: 4, min_length: 0, max_length: u32::MAX, elements: "u8" },
  • SmallVec<u8> would be Sequence { length_width: 1, min_length: 0, max_length: 255, elements: "u8" },
  • BoundVec<1, 10, u8> would be Sequence { length_width: 4, min_length: 1, max_length: 10, elements: "u8" } and
  • VarInt<u32> would be Sequence { length_width: 0, min_length: 1, max_length: 5, elements: "u8" }.

And the tag_width addition to Enum would future-proof the spec for cases where users want to use more than 256 enum variants.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sequence { length_width: u8, min_length: u64, max_length: u64, elements: Declaration } doesn't look that bad, especially due to being able to accommodate VarInt.

I would still not remove Definition::Array, though Sequence { length_width: 0, min_length: 10, max_length: 10, elements: "u8" } is fine to be understood too.

tag_width for Enum is redundant imo, as Enum is supposed to be human-readable, and if one has an Enum with over-than-u8-range variants, or over-than-u16-range variants, he/she probably has to factor something out, to make the definition of Enum human readable again.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like explicitness, so I like the proposed extended Definition

borsh/src/schema_helpers.rs Show resolved Hide resolved
borsh/src/schema_helpers.rs Outdated Show resolved Hide resolved
borsh/src/schema_helpers.rs Outdated Show resolved Hide resolved
@@ -333,7 +333,7 @@ where
#[cfg(hash_collections)]
pub mod hashes {
//!
//! Module defines [BorshSchema](crate::schema::BorshSchema) implementation for
//! Module defines [BorshSchema](crate::schema::BorshSchema) implementation for
//! [HashMap](std::collections::HashMap)/[HashSet](std::collections::HashSet).
use crate::BorshSchema;

Copy link
Collaborator

@dj8yfo dj8yfo Sep 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace Definition in this file with the one proposed (or, strictly speaking, very close to the one proposed)

pub enum Definition {
    /// A sequence of elements of the same type.
    Sequence {
        /// How many bytes does the length prefix occupy.
        ///
        /// Zero means this is a fixed-length array or a custom encoding schema
        /// which doesn’t follow typical borsh conventions.  In the former case,
        /// `min_length` and `max_length` will be set to the same value.  In the
        /// latter case, it may be impossible to deserialise the data without
        /// additional knowledge of the declared type.
        length_width: u8,

        /// Minimum length of the sequence.  Usually `0` for dynamically-sized
        /// sequences.
        min_length: u64,

        /// Maximum length of the sequence.  Usually `2**length_width - 1`.
        max_length: u64,

        /// Type of the elements of the sequence.
        elements: Declaration,
    },

    /// A fixed-size tuple with the length known at the compile time and the
    /// elements of different types.
    Tuple {
        /// Types of elements of the tuple.
        elements: Vec<Declaration>,
    },

    /// A tagged union, a.k.a enum. Tagged-unions have variants with associated
    /// structures.
    Enum {
        /// Width of the discriminant tag.
        tag_width: u8,

        /// Possible variants of the enumeration.
        variants: Vec<(VariantName, Declaration)>,
    },

    /// A structure, structurally similar to a tuple.
    Struct {
        fields: Fields,
    },
}

adjust max_serialized_size implementation accordingly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of questions about this:

First of all, what’s even the point of Fields? Specifically, why do
we bother encoding unnamed struct as a struct rather than as a tuple.
This only complicates clients of the code since they have two
identical cases to consider separately.

And secondly, (while I would even go as far as get rid of field names,
but baring that) why not get rid of Struct and instead use:

    Tuple {
        /// Types of elements of the tuple.
        elements: Vec<(Option<FieldName>, Declaration)>,
    },

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mina86 let's address only Array, Enum and Sequence in this pr, as these are where new details are being added, the proposed changes to Struct aren't adding new information, they are just rearranging existing representation in a slightly different manner, albeit (arguably) in a more correct one.
We can try to include another pr with Definition::Struct/Tuple changes into 1.0.0 too.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mina86 on second thought, it might be better to accept this pr as is.
And then issue 3 separate prs:

  1. remove Definition::Array , replace it with new extended Definition::Sequence + a test for a local BoundedVec<T, L, U> max_serialized_size
  2. add tag_width to Definition::Enum
  3. make the change so that tuple structs become modelled as a Tuple
    /// A fixed-size tuple with the length known at the compile time and the
    /// elements of different types.
    Tuple {
        /// Types of elements of the tuple.
        elements: Vec<Declaration>,
    },
    /// A structure, structurally similar to a tuple.
    Struct {
        /// Named fields of the structure with their types.
        ///
        /// Tuple-like structures are modelled as a `Tuple`.
        fields: Vec<(FieldName, Declaration)>,
    },

Thus it will be easier for everyone to digest it in 4 separate chunks and do additional clarification per each point if needed. I completely skipped the part with point 3. on first pass.
What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m already working on the other things as separate PRs. Whether this PR should be accepted now or later I have no opinion.

borsh/src/schema_helpers.rs Outdated Show resolved Hide resolved
@dj8yfo dj8yfo changed the title Introduce borsh::max_serialized_size function feat!: Introduce max_serialized_size function; enhance schema::Definition::{Sequence, Enum}; remove schema::Definition::Array Sep 7, 2023
@dj8yfo dj8yfo changed the title feat!: Introduce max_serialized_size function; enhance schema::Definition::{Sequence, Enum}; remove schema::Definition::Array feat: Introduce max_serialized_size function Sep 8, 2023
Copy link
Collaborator

@frol frol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mina86 Thanks for contributing it! Looks good to be merged.

@dj8yfo Please, merge when it makes sense (to minimize merge conflicts with other PRs, if any)

@dj8yfo dj8yfo merged commit 5ae2d81 into near:master Sep 9, 2023
7 checks passed
@frol frol mentioned this pull request Sep 9, 2023
@mina86 mina86 deleted the b branch September 9, 2023 14:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants