Optimize `bind_least_significant_variable` #319

plafer · 2024-09-17T14:50:00Z

This PR removes the slide bound checks in MultiLinearPoly::bind_least_significant_variable.

In the serial case, we see a 7-9% improvement in the bind-variable benchmark, and sum-check performance improvement by 2.5-3.5%. Parallel performance doesn't change, presumably since the algorithm is bounded by the threading overhead rather than the time each threads spends looping over its chunk.

The LogUp-GKR benchmark also didn't change, since bind_least_significant_variable was not a bottleneck. However I see this PR as a proof of concept, and will apply similar bound checks removal in any hot loop I find in a subsequent PR.

Below are the serial bind-variable and sum-check benchmark results:

bind variable //262144  time:   [219.29 µs 225.73 µs 231.25 µs]
                        change: [-9.5833% -6.9170% -3.6918%] (p = 0.00 < 0.05)
                        Performance has improved.
bind variable //1048576 time:   [877.21 µs 907.05 µs 923.14 µs]
                        change: [-12.616% -8.9228% -4.7734%] (p = 0.00 < 0.05)
                        Performance has improved.

Sum-check prover high degree//18
                        time:   [26.296 ms 26.391 ms 26.504 ms]
                        change: [-3.3216% -2.7946% -2.2863%] (p = 0.00 < 0.05)
                        Performance has improved.
Sum-check prover high degree//20
                        time:   [100.89 ms 101.80 ms 102.65 ms]
                        change: [-3.6253% -2.6530% -1.6493%] (p = 0.00 < 0.05)
                        Performance has improved.

Sum-check prover plain//18
                        time:   [8.1104 ms 8.1457 ms 8.2036 ms]
                        change: [-4.6356% -3.6520% -2.6455%] (p = 0.00 < 0.05)
                        Performance has improved.
Sum-check prover plain//20
                        time:   [32.213 ms 32.604 ms 32.995 ms]
                        change: [-3.9200% -2.2554% -0.5909%] (p = 0.03 < 0.05)
                        Change within noise threshold.

Al-Kindi-0

Looks great!
What would the take home message be from this optimization?

irakliyk

Very nice! Thank you!

plafer · 2024-09-17T19:28:10Z

What would the take home message be from this optimization?

Rust inserts bound checks on every slice indexing when it's not able to prove that the indexing will not be out of bounds. For example, this will not result in any bound checks being inserted

for i in 0..v.len() {
  v[i] = 3;
}

since i clearly goes from 0 to v.len() - 1. While Rust can figure out on its own that some indexing expression will never be out-of-bounds (e.g. v[i-1] in a loop from 1 to v.len()), it was not able to figure out that our self.evaluations[i << 1] and self.evaluations[(i << 1) + 1] would never be out of bounds, so it inserted out-of-bounds checks.

A good tool to figure that out is cargo-show-asm, where you can inspect the generated assembly of a function that you specify. I'd recommend the bounds-check-cookbook for a good introduction on the subject. Otherwise, you can also just run a relevant benchmark and see if it improved.

Bound checks are bad for 2 reasons:

it inserts a useless branch into the code
can prevent auto-vectorization (i.e. automatic generation of SIMD instructions by the compiler)

In our case, both versions of the code vectorized the load of self.evaluations[i << 1] and self.evaluations[(i<< 1) + 1]; the only difference between the 2 generated assemblies was the removal of 2 branches (due to bound checks) in the optimized version. Since this is done in a hot loop, I suppose it ended up costing a nontrivial amount. However, I'm hoping to find cases elsewhere where inserted bound checks also prevent the use of vectorized loads/stores, in which case we should see an even bigger performance boost.

Note that since with Goldilocks we're playing with "big" u64 values, and so the vectorization opportunities are typically limited to bundling two loads/stores together (ldp and stp on aarch64), since e.g. on aarch64 the vector registers are 128 bits wide. But this is basically the basis of the argument for using 32 bit-based fields :).

Al-Kindi-0 · 2024-09-18T08:36:53Z

Makes sense, thank you for the great explanation!

plafer added 5 commits September 17, 2024 08:57

perf: optimize serial bind_least_significant_variable

95712a1

Merge branch 'logup-gkr' into plafer-optimize-bind-variable

ad5fe82

bind_variable benchmark time

936b3a4

perf: improve bind_least_significant_variable in concurrent case

9c04e69

always inline bind_least_significant_variable

dc3d9dd

facebook-github-bot added the cla signed label Sep 17, 2024

Al-Kindi-0 reviewed Sep 17, 2024

View reviewed changes

irakliyk approved these changes Sep 17, 2024

View reviewed changes

irakliyk merged commit 8b95c8a into facebook:logup-gkr Sep 18, 2024
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize `bind_least_significant_variable` #319

Optimize `bind_least_significant_variable` #319

plafer commented Sep 17, 2024

Al-Kindi-0 left a comment

irakliyk left a comment

plafer commented Sep 17, 2024 •

edited

Loading

Al-Kindi-0 commented Sep 18, 2024

Optimize bind_least_significant_variable #319

Optimize bind_least_significant_variable #319

Conversation

plafer commented Sep 17, 2024

Al-Kindi-0 left a comment

Choose a reason for hiding this comment

irakliyk left a comment

Choose a reason for hiding this comment

plafer commented Sep 17, 2024 • edited Loading

Al-Kindi-0 commented Sep 18, 2024

Optimize `bind_least_significant_variable` #319

Optimize `bind_least_significant_variable` #319

plafer commented Sep 17, 2024 •

edited

Loading