[opt] Optimizing Shoup's MulMod in Apple M1/M2 chip. #17

fionser · 2024-02-04T15:10:03Z

Math behind Shoup's MulMod trick:

z = mul_h64(x * y_prime)
r = x * y - z * p // mod 2^64 implicitly
(Optional) reduce r from [0, 2p) to [0, p)

We found that the the multiplication in Step 2 is faster when doing in u128 than in u64 under Apple M1/M2.
This findings are only for the u64 prime, and is not working for u32 prime.

Here is the benchmarks.
SPEC: MacBook Pro 2022, Apple M2, 16 GB RAM. macOS 14.2.1

Running benches/ntt.rs (target/release/deps/ntt-7052c42fe5c957b6)
fwd-64-4611686018427322369-4096
                        time:   [14.998 µs 15.067 µs 15.166 µs]
                        change: [-27.234% -25.010% -23.685%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 19 outliers among 100 measurements (19.00%)
  5 (5.00%) low severe
  5 (5.00%) high mild
  9 (9.00%) high severe

inv-64-4611686018427322369-4096
                        time:   [14.614 µs 14.729 µs 14.895 µs]
                        change: [-31.587% -30.396% -29.322%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  2 (2.00%) high mild
  11 (11.00%) high severe

fwd-64-9223372036853661697-4096
                        time:   [15.615 µs 15.655 µs 15.705 µs]
                        change: [-36.233% -35.386% -34.617%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
  2 (2.00%) high mild
  13 (13.00%) high severe

inv-64-9223372036853661697-4096
                        time:   [15.353 µs 15.380 µs 15.414 µs]
                        change: [-35.897% -35.730% -35.557%] (p = 0.00 < 0.05)
                        Performance has improved.

tlepoint · 2024-03-26T02:52:48Z

@sarah-ek Is this an optimization you would consider for concrete-ntt?

sarah-quinones · 2024-07-22T08:12:29Z

sorry for the late response. i think it would be simpler to maintain if you make the aarch64 version of the code universal and replace the old one. since the scalar path is going to be rare enough on x86 (only if avx2 is not available) that it doesn't change much

[opt] Optimizing Shoup's MulMod in Apple M1/M2 chip.

344a6b9

cla-bot bot added the cla-signed label Feb 4, 2024

fionser mentioned this pull request Feb 4, 2024

Use concrete-ntt for NTT operations tlepoint/fhe.rs#243

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[opt] Optimizing Shoup's MulMod in Apple M1/M2 chip. #17

[opt] Optimizing Shoup's MulMod in Apple M1/M2 chip. #17

fionser commented Feb 4, 2024 •

edited

Loading

tlepoint commented Mar 26, 2024

sarah-quinones commented Jul 22, 2024

[opt] Optimizing Shoup's MulMod in Apple M1/M2 chip. #17

Are you sure you want to change the base?

[opt] Optimizing Shoup's MulMod in Apple M1/M2 chip. #17

Conversation

fionser commented Feb 4, 2024 • edited Loading

tlepoint commented Mar 26, 2024

sarah-quinones commented Jul 22, 2024

fionser commented Feb 4, 2024 •

edited

Loading