Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[opt] Optimizing Shoup's MulMod in Apple M1/M2 chip. #17

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

fionser
Copy link

@fionser fionser commented Feb 4, 2024

Math behind Shoup's MulMod trick:

  1. z = mul_h64(x * y_prime)
  2. r = x * y - z * p // mod 2^64 implicitly
  3. (Optional) reduce r from [0, 2p) to [0, p)

We found that the the multiplication in Step 2 is faster when doing in u128 than in u64 under Apple M1/M2.
This findings are only for the u64 prime, and is not working for u32 prime.

Here is the benchmarks.
SPEC: MacBook Pro 2022, Apple M2, 16 GB RAM. macOS 14.2.1

Running benches/ntt.rs (target/release/deps/ntt-7052c42fe5c957b6)
fwd-64-4611686018427322369-4096
                        time:   [14.998 µs 15.067 µs 15.166 µs]
                        change: [-27.234% -25.010% -23.685%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 19 outliers among 100 measurements (19.00%)
  5 (5.00%) low severe
  5 (5.00%) high mild
  9 (9.00%) high severe

inv-64-4611686018427322369-4096
                        time:   [14.614 µs 14.729 µs 14.895 µs]
                        change: [-31.587% -30.396% -29.322%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  2 (2.00%) high mild
  11 (11.00%) high severe

fwd-64-9223372036853661697-4096
                        time:   [15.615 µs 15.655 µs 15.705 µs]
                        change: [-36.233% -35.386% -34.617%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
  2 (2.00%) high mild
  13 (13.00%) high severe

inv-64-9223372036853661697-4096
                        time:   [15.353 µs 15.380 µs 15.414 µs]
                        change: [-35.897% -35.730% -35.557%] (p = 0.00 < 0.05)
                        Performance has improved.

@tlepoint
Copy link

@sarah-ek Is this an optimization you would consider for concrete-ntt?

@sarah-quinones
Copy link

sorry for the late response. i think it would be simpler to maintain if you make the aarch64 version of the code universal and replace the old one. since the scalar path is going to be rare enough on x86 (only if avx2 is not available) that it doesn't change much

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants