Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wasmparser: Add no-hash-maps crate feature & support #1517

Closed
Robbepop opened this issue Apr 25, 2024 · 12 comments · Fixed by #1521
Closed

wasmparser: Add no-hash-maps crate feature & support #1517

Robbepop opened this issue Apr 25, 2024 · 12 comments · Fixed by #1521

Comments

@Robbepop
Copy link
Collaborator

Robbepop commented Apr 25, 2024

This issue follows #1493 and is about the implementation proposals introduced in bytecodealliance/wasmtime#8341 (comment) and detailed in bytecodealliance/wasmtime#8341 (comment).

My personal questions about this before I start with the implementation:

  • What should the crate feature be called?
  • When will btreemap-based data structures be enabled?
    • If #[cfg(feature = "x")]: Simple design but may be sub-optimal in cases where std is enabled.
    • If #[cfg(and(feature = "x", not(feature = "std"))]: Less transparent design but uses more efficient hashmaps whenever possible. For Wasmi this would be fine I suppose. I think this is the better design since you only really needs to avoid hashmaps for no_std mode since under std mode the crate has a proper random source.

Currently the type aliases can be found in this file:
https://github.com/bytecodealliance/wasm-tools/blob/main/crates/wasmparser/src/map.rs

So if I understood correctly the PR for btreemap-based data structures shall introduce a new type for each IndexMap and HashMap that either wraps hashbrown::HashMap or BTreeMap or indexmap::IndexMap or our custom solution, right?

Is this required so that we can have unified trait bounds that require both T: Ord + Hash to make uses always compile whatever mode is enabled?

Also we may want to rename the HashMap and HashSet type aliases to just Map and Set since then they are not necessarily hash based.

@Robbepop Robbepop changed the title wasmparser: Add btreemap-mode wasmparser: Add btreemap-mode Apr 25, 2024
@alexcrichton
Copy link
Member

Thanks for opening an issue for this!

How about no-hash-maps for a name? It'd be an off-by-default feature and when enabled wasmparser would switch all its maps to a btree map?

To ensure that this works well with Cargo features we'll need to change the type aliases for Index{Map,Set} to being a struct Index{Map,Set} along with all the iterator types and such in the module. Internally it'd then #[cfg] between the two implementations.

Renaming Hash{Map,Set} sounds good to me too!

@Robbepop
Copy link
Collaborator Author

Robbepop commented Apr 26, 2024

How about no-hash-maps for a name? It'd be an off-by-default feature and when enabled wasmparser would switch all its maps to a btree map?

So to answer my question; would no-hash-maps even switch all its maps to btree maps if at the same time the std feature was enabled which would provide a proper random source to all hash maps thus solving the underlying problem of why we want to also support btree maps in the first place?

To ensure that this works well with Cargo features we'll need to change the type aliases for Index{Map,Set} to being a struct Index{Map,Set} along with all the iterator types and such in the module. Internally it'd then #[cfg] between the two implementations.

So are you saying that the current Index{Map,Set} type aliases defined in https://github.com/bytecodealliance/wasm-tools/blob/main/crates/wasmparser/src/map.rs can no longer be type aliases that point to different types depending on their #[cfg] flags but must be proper struct types instead? Just out of curiosity: why could type aliases be problematic with crate features? I assume that the HashMap and HashSet type aliases can remain type aliases or will they need the same treatment?

@Robbepop
Copy link
Collaborator Author

While implementing the initial parts of this issue I think I just answered one of the questions myself.

We indeed want proper types wrapping either Hash{Map,Set} or BTree{Map,Set} and provide a unified API or otherwise we'd have to take care of conditional uses throughout the entire crate. Also we want to expose unified Entry and iterator types.

@alexcrichton
Copy link
Member

Yeah cargo features are required to be additive to work and in this case that's not what's being toggled so we need an opaque new type to be able to switch the implementation under the hood.

For the feature that was my thinking, it's possible to just trees and std just on an opt-in basis. Not being able to use hash maps I think is relatively niche so I think it's ok to just kinda do whatever works here. Can always change it in the future if necessary

@Robbepop
Copy link
Collaborator Author

Robbepop commented Apr 26, 2024

Okay.

I will open a draft PR soon so we can comment there to have some code to work/review.

@Robbepop
Copy link
Collaborator Author

Robbepop commented Apr 26, 2024

High-level API Proposal

Currently the map submodule looks like this:

image

My proposal is to make it look like this:

image

Where each submodule of the collections module, namely index_map, index_set, map and set, has its own set of definitions for Entry, Iter, IterMut etc.

  • The hash submodule contains RandomState and RandomStateHasher.
  • The IndexMap, IndexSet, Map and Set type definitions at the wasmparser::collections scope are just re-exports of the aforementioned submodules.

This design also opens doors for future data structures that are often based on hash maps such as string interners.

@alexcrichton what do you think about this high-level API design?

@Robbepop Robbepop changed the title wasmparser: Add btreemap-mode wasmparser: Add no-hash-maps crate feature & support Apr 26, 2024
@alexcrichton
Copy link
Member

That seems reasonable to me yeah, thanks!

@Robbepop

This comment was marked as outdated.

@Robbepop

This comment was marked as resolved.

@Shnatsel
Copy link

Shnatsel commented May 6, 2024

Speaking of which, why doesn't wasmparser always use BTreeMap in #[no_std] mode? Are there good reasons to keep hashbrown + ahash (and all the associated dependencies, see #1528) as the default #[no_std] option? Especially since the #[no_std] randomness source seems flimsy.

If wasmparser simply used BTreeMap when the std feature is disabled and HashMap with the std feature enabled, then #1532 wouldn't be necessary. And this would also eliminate external dependencies with unsafe code in them, making the crate easier to review and easier to adopt in enterprise environments.

@alexcrichton
Copy link
Member

There's some more discussion of that on bytecodealliance/wasmtime#8341, but my personal reasoning is that I'd rather not rock the boat as I don't fully understand the impact of such a change. I don't feel like there's a great sense for precisely what the attack vector is and/or how feasible it is to exploit something in no_std mode vs not. In the abstract I would like to keep no_std as similar to "std mode" as possible since "std mode" is much more heavily tested and divergences are where bugs typically lie.

I've wanted to do #1532 for a long time anyway, and while I understand that the more dependencies the less merrier I basically don't personally feel equipped to say whether this is a change that should be made or not.

@Robbepop
Copy link
Collaborator Author

Robbepop commented May 11, 2024

@Shnatsel @alexcrichton I use very similar data structures in Wasmi via its wasmi_collections crate and saw wastly better performance when using the btree-map based types (15-25%). I think the explanation for this is that hash-maps scale better but btree-maps perform better for small data sets (and consume less memory) which are very common for Wasmi. However, even heavy loads such as compiling ffmpeg.wasm were faster with the no-hash-maps feature enabled.

Another explanation for the performance difference could be that in std mode the "slowish" SIP hasher is used, so maybe in no_std mode using ahash the performance difference may be less drastic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants