Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: implement RPO hash using AVX2 instructions #234

Merged
merged 1 commit into from
Jan 4, 2024

Conversation

gswirski
Copy link
Contributor

@gswirski gswirski commented Dec 8, 2023

It improves performance of the RPO hash function by leveraging AVX2 instructions on compatible x86_64 hardware (code based on plonky2: https://github.com/0xPolygonZero/plonky2/blob/main/plonky2/src/hash/arch/x86_64/poseidon_goldilocks_avx2_bmi2.rs). On AWS' Zen4 instance (c7a), I'm measuring 40% improvement against the baseline commit c86bdc6.

To leverage AVX2 implementation, code needs to be compiled with RUSTFLAGS="-C target-feature=+avx2". Despite availability of AVX512 with larger register sizes, I wasn't able to leverage it for any additional speedups.

Below RUSTFLAGS="-C target-feature=+avx2" cargo bench improvement against the baseline cargo bench.

photo_2023-12-08 20 15 16

Checklist before requesting a review

  • Repo forked and branch created from next according to naming convention.
  • Commit messages and codestyle follow conventions.
  • Relevant issues are linked in the PR description.
  • Tests added for new functionality.
  • Documentation/comments updated according to changes.

@bobbinth
Copy link
Contributor

Thank you! Great results! A couple of preliminary comments/questions:

  • Could you apply these changes to the next branch? The structure of the code there has changed a bit - but I think it shouldn't cause any issues for this PR.
  • Would it make sense to add avx512 feature (similar to how we did it with sve) so that we have the option to disable the optimizations on platforms where with AVX512 support?
  • I'm curious, how does this approach compare to the one used by Polygon Hermez here?

@gswirski
Copy link
Contributor Author

  1. Sure thing! I’ll rebase onto next.

  2. I defer to you, but my recommendation would be to get rid of these features (including sve). We already have a way of controlling optimizations:

    • enable: RUSTFLAGS="-C target-feature=+avx2”
    • disable: RUSTFLAGS="-C target-feature=-sve”
    • get the best version possible:RUSTFLAGS=“-C target-cpu=native”

    Adding more flags can cause people to miss these important optimizations and even lead to bit rot (e.g. we accidentally stop executing tests for optimized versions).

    As a side note, this PR only uses avx2 which has much better compatibility with chips on the market (compared to avx512).

  3. I have not seen Hermez implementation before. They do actually use AVX512 and its larger 512-bit registers - I’ll benchmark their code to see if they do better than this PR.

@bobbinth
Copy link
Contributor

Thank you!

Re point 2: I agree with you - let's not introduce the extra feature (and I'll remove the sve feature later on in a different PR).

@gswirski
Copy link
Contributor Author

I rebased this PR onto next, still need a few days to dive into Hermez.

Copy link
Contributor

@bobbinth bobbinth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thank you! I left a couple of comments inline - mostly about code organization.

still need a few days to dive into Hermez.

Yeah - curious what you think. My understanding is that they were able to get something like 3x - 4x speed-up for Poseidon. Though, they context for hashing is somewhat different, I believe.

src/hash/rescue/mod.rs Outdated Show resolved Hide resolved
src/arch/x86_64_avx2/mod.rs Outdated Show resolved Hide resolved
@gswirski
Copy link
Contributor Author

I looked into Hermez and their AVX2 implementation looks almost identical to the implementation in this PR. Their drastic improvement compared to scalar is most likely due to a worse scalar/baseline implementation, but I didn't go as far as reimplementing Poseidon using our Rust primitives or reimplementing RPO using their C++ primitives.

Hermez does score a ~20% win by using AVX512 but it is because Poseidon can make better use of huge registers. Their AVX512 interleaves 3 multiplications of 8-element u64 vectors. Since we only operate on 12 elements, throughput benefits are lost to higher latency of AVX512 operations.

Copy link
Contributor

@bobbinth bobbinth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! And thank you fro the explanation! I left one minor comment inline. After it is addressed, we can merge.

Comment on lines 47 to 48
#[cfg(target_feature = "avx2")]
pub mod x86_64_avx2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: does this need to be public?

Copy link

sonarcloud bot commented Jan 4, 2024

Quality Gate Passed Quality Gate passed

Kudos, no new issues were introduced!

0 New issues
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarCloud

Copy link
Contributor

@bobbinth bobbinth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All looks good! Thank you!

@bobbinth bobbinth merged commit 862ccf5 into 0xPolygonMiden:next Jan 4, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants