Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize bind_least_significant_variable #319

Merged
merged 5 commits into from
Sep 18, 2024

Conversation

plafer
Copy link
Contributor

@plafer plafer commented Sep 17, 2024

This PR removes the slide bound checks in MultiLinearPoly::bind_least_significant_variable.

In the serial case, we see a 7-9% improvement in the bind-variable benchmark, and sum-check performance improvement by 2.5-3.5%. Parallel performance doesn't change, presumably since the algorithm is bounded by the threading overhead rather than the time each threads spends looping over its chunk.

The LogUp-GKR benchmark also didn't change, since bind_least_significant_variable was not a bottleneck. However I see this PR as a proof of concept, and will apply similar bound checks removal in any hot loop I find in a subsequent PR.

Below are the serial bind-variable and sum-check benchmark results:

bind variable //262144  time:   [219.29 µs 225.73 µs 231.25 µs]
                        change: [-9.5833% -6.9170% -3.6918%] (p = 0.00 < 0.05)
                        Performance has improved.
bind variable //1048576 time:   [877.21 µs 907.05 µs 923.14 µs]
                        change: [-12.616% -8.9228% -4.7734%] (p = 0.00 < 0.05)
                        Performance has improved.

Sum-check prover high degree//18
                        time:   [26.296 ms 26.391 ms 26.504 ms]
                        change: [-3.3216% -2.7946% -2.2863%] (p = 0.00 < 0.05)
                        Performance has improved.
Sum-check prover high degree//20
                        time:   [100.89 ms 101.80 ms 102.65 ms]
                        change: [-3.6253% -2.6530% -1.6493%] (p = 0.00 < 0.05)
                        Performance has improved.

Sum-check prover plain//18
                        time:   [8.1104 ms 8.1457 ms 8.2036 ms]
                        change: [-4.6356% -3.6520% -2.6455%] (p = 0.00 < 0.05)
                        Performance has improved.
Sum-check prover plain//20
                        time:   [32.213 ms 32.604 ms 32.995 ms]
                        change: [-3.9200% -2.2554% -0.5909%] (p = 0.03 < 0.05)
                        Change within noise threshold.

Copy link
Contributor

@Al-Kindi-0 Al-Kindi-0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!
What would the take home message be from this optimization?

Copy link
Collaborator

@irakliyk irakliyk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! Thank you!

@plafer
Copy link
Contributor Author

plafer commented Sep 17, 2024

What would the take home message be from this optimization?

Rust inserts bound checks on every slice indexing when it's not able to prove that the indexing will not be out of bounds. For example, this will not result in any bound checks being inserted

for i in 0..v.len() {
  v[i] = 3;
}

since i clearly goes from 0 to v.len() - 1. While Rust can figure out on its own that some indexing expression will never be out-of-bounds (e.g. v[i-1] in a loop from 1 to v.len()), it was not able to figure out that our self.evaluations[i << 1] and self.evaluations[(i << 1) + 1] would never be out of bounds, so it inserted out-of-bounds checks.

A good tool to figure that out is cargo-show-asm, where you can inspect the generated assembly of a function that you specify. I'd recommend the bounds-check-cookbook for a good introduction on the subject. Otherwise, you can also just run a relevant benchmark and see if it improved.

Bound checks are bad for 2 reasons:

  • it inserts a useless branch into the code
  • can prevent auto-vectorization (i.e. automatic generation of SIMD instructions by the compiler)

In our case, both versions of the code vectorized the load of self.evaluations[i << 1] and self.evaluations[(i<< 1) + 1]; the only difference between the 2 generated assemblies was the removal of 2 branches (due to bound checks) in the optimized version. Since this is done in a hot loop, I suppose it ended up costing a nontrivial amount. However, I'm hoping to find cases elsewhere where inserted bound checks also prevent the use of vectorized loads/stores, in which case we should see an even bigger performance boost.

Note that since with Goldilocks we're playing with "big" u64 values, and so the vectorization opportunities are typically limited to bundling two loads/stores together (ldp and stp on aarch64), since e.g. on aarch64 the vector registers are 128 bits wide. But this is basically the basis of the argument for using 32 bit-based fields :).

@irakliyk irakliyk merged commit 8b95c8a into facebook:logup-gkr Sep 18, 2024
8 checks passed
@Al-Kindi-0
Copy link
Contributor

Makes sense, thank you for the great explanation!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants