Use `concrete-ntt` for NTT operations #243

tlepoint · 2024-02-04T01:09:44Z

We allow the use of concrete-ntt in fhe-math crate via the feature flags

concrete-ntt
concrete-ntt-nightly

Benchmarks to be added to this PR.

codecov · 2024-02-04T01:17:53Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (7be5ece) 93.71% compared to head (6f3d47e) 93.71%.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #243   +/-   ##
=======================================
  Coverage   93.71%   93.71%           
=======================================
  Files          42       42           
  Lines        8194     8194           
=======================================
  Hits         7679     7679           
  Misses        515      515

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

tlepoint · 2024-02-04T01:20:53Z

Apple M2 on battery:

$ cargo criterion ntt --features concrete-ntt
   Compiling fhe-math v0.1.0-beta.8 (/Users/tancrede/git/fhe.rs/crates/fhe-math)
   Compiling fhe v0.1.0-beta.8 (/Users/tancrede/git/fhe.rs/crates/fhe)
    Finished bench [optimized] target(s) in 8.12s

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 4 filtered out; finished in 0.00s

ntt/forward/1024/62     time:   [4.2295 µs 4.2310 µs 4.2335 µs]                                
                        change: [+31.699% +32.158% +32.656%] (p = 0.00 < 0.05)
                        Performance has regressed.
ntt/forward_vt/1024/62  time:   [4.2292 µs 4.2308 µs 4.2333 µs]                                   
                        change: [+30.739% +31.144% +31.576%] (p = 0.00 < 0.05)
                        Performance has regressed.
ntt/backward/1024/62    time:   [4.9529 µs 4.9564 µs 4.9596 µs]                                 
                        change: [+42.015% +42.382% +42.770%] (p = 0.00 < 0.05)
                        Performance has regressed.
ntt/backward_vt/1024/62 time:   [4.9462 µs 4.9509 µs 4.9564 µs]                                    
                        change: [+41.606% +41.982% +42.352%] (p = 0.00 < 0.05)
                        Performance has regressed.
ntt/forward/1024/16     time:   [4.2616 µs 4.2904 µs 4.3288 µs]                                
                        change: [+33.286% +34.570% +36.245%] (p = 0.00 < 0.05)
                        Performance has regressed.
ntt/forward_vt/1024/16  time:   [4.4043 µs 4.5029 µs 4.6164 µs]                                   
                        change: [+35.090% +38.045% +43.124%] (p = 0.00 < 0.05)
                        Performance has regressed.
ntt/backward/1024/16    time:   [4.9739 µs 4.9961 µs 5.0263 µs]                                 
                        change: [+43.092% +43.963% +44.946%] (p = 0.00 < 0.05)
                        Performance has regressed.
ntt/backward_vt/1024/16 time:   [4.9532 µs 4.9581 µs 4.9631 µs]                                    
                        change: [+41.480% +42.032% +42.502%] (p = 0.00 < 0.05)
                        Performance has regressed.
ntt/forward/4096/62     time:   [19.723 µs 19.757 µs 19.802 µs]                                
                        change: [+35.091% +36.112% +37.347%] (p = 0.00 < 0.05)
                        Performance has regressed.
ntt/forward_vt/4096/62  time:   [19.873 µs 19.990 µs 20.126 µs]                                   
                        change: [+34.851% +35.493% +36.214%] (p = 0.00 < 0.05)
                        Performance has regressed.
ntt/backward/4096/62    time:   [22.864 µs 22.901 µs 22.951 µs]                                 
                        change: [+44.177% +44.813% +45.415%] (p = 0.00 < 0.05)
                        Performance has regressed.
ntt/backward_vt/4096/62 time:   [22.846 µs 22.856 µs 22.872 µs]                                    
                        change: [+44.354% +44.766% +45.210%] (p = 0.00 < 0.05)
                        Performance has regressed.
ntt/forward/4096/16     time:   [19.775 µs 19.871 µs 19.986 µs]                                
                        change: [+35.310% +36.557% +38.172%] (p = 0.00 < 0.05)
                        Performance has regressed.
ntt/forward_vt/4096/16  time:   [19.678 µs 19.689 µs 19.702 µs]                                   
                        change: [+33.242% +33.690% +34.139%] (p = 0.00 < 0.05)
                        Performance has regressed.
ntt/backward/4096/16    time:   [22.840 µs 22.849 µs 22.861 µs]                                 
                        change: [+43.660% +44.123% +44.560%] (p = 0.00 < 0.05)
                        Performance has regressed.
ntt/backward_vt/4096/16 time:   [22.892 µs 22.960 µs 23.032 µs]                                    
                        change: [+44.630% +45.081% +45.544%] (p = 0.00 < 0.05)
                        Performance has regressed.

fionser · 2024-02-04T01:25:12Z

Actually, I want to modify the concrete-ntt a little.
Specially, the inverse ntt operation. The multiplication with n^{-1} can be merged into the last layer ot intt.

tlepoint · 2024-02-04T01:34:38Z

Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz

cargo criterion ntt --features concrete-ntt
   Compiling fhe-math v0.1.0-beta.8 (/data/dev/[fhe.rs/crates/fhe-math](http://fhe.rs/crates/fhe-math))
   Compiling fhe v0.1.0-beta.8 (/data/dev/[fhe.rs/crates/fhe](http://fhe.rs/crates/fhe))
    Finished bench [optimized] target(s) in 4.60s
Gnuplot not found, using plotters backend
Gnuplot not found, using plotters backend
Gnuplot not found, using plotters backend

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 4 filtered out; finished in 0.00s

Gnuplot not found, using plotters backend
ntt/forward/1024/62     time:   [7.9211 µs 7.9222 µs 7.9231 µs]                                
                        change: [-52.194% -52.185% -52.178%] (p = 0.00 < 0.05)
                        Performance has improved.
ntt/forward_vt/1024/62  time:   [7.8962 µs 7.9078 µs 7.9179 µs]                                   
                        change: [-48.513% -48.461% -48.413%] (p = 0.00 < 0.05)
                        Performance has improved.
ntt/backward/1024/62    time:   [8.9256 µs 8.9387 µs 8.9510 µs]                                 
                        change: [-22.332% -22.265% -22.208%] (p = 0.00 < 0.05)
                        Performance has improved.
ntt/backward_vt/1024/62 time:   [8.8733 µs 8.8781 µs 8.8836 µs]                                    
                        change: [-18.803% -18.765% -18.725%] (p = 0.00 < 0.05)
                        Performance has improved.
ntt/forward/1024/16     time:   [7.4251 µs 7.5024 µs 7.6175 µs]                                
                        change: [-54.565% -54.342% -53.987%] (p = 0.00 < 0.05)
                        Performance has improved.
ntt/forward_vt/1024/16  time:   [7.5126 µs 7.5287 µs 7.5420 µs]                                   
                        change: [-49.884% -49.787% -49.698%] (p = 0.00 < 0.05)
                        Performance has improved.
ntt/backward/1024/16    time:   [8.3443 µs 8.3462 µs 8.3478 µs]                                 
                        change: [-27.461% -27.448% -27.436%] (p = 0.00 < 0.05)
                        Performance has improved.
ntt/backward_vt/1024/16 time:   [8.3488 µs 8.3499 µs 8.3512 µs]                                    
                        change: [-23.645% -23.601% -23.574%] (p = 0.00 < 0.05)
                        Performance has improved.
ntt/forward/4096/62     time:   [33.557 µs 33.562 µs 33.566 µs]                                
                        change: [-55.338% -54.399% -53.716%] (p = 0.00 < 0.05)
                        Performance has improved.
ntt/forward_vt/4096/62  time:   [34.956 µs 35.053 µs 35.149 µs]                                   
                        change: [-48.215% -48.066% -47.919%] (p = 0.00 < 0.05)
                        Performance has improved.
ntt/backward/4096/62    time:   [40.084 µs 40.110 µs 40.129 µs]                                 
                        change: [-25.304% -25.241% -25.178%] (p = 0.00 < 0.05)
                        Performance has improved.
ntt/backward_vt/4096/62 time:   [39.784 µs 39.793 µs 39.804 µs]                                    
                        change: [-22.524% -22.509% -22.493%] (p = 0.00 < 0.05)
                        Performance has improved.
ntt/forward/4096/16     time:   [35.116 µs 35.187 µs 35.243 µs]                                
                        change: [-50.887% -50.801% -50.726%] (p = 0.00 < 0.05)
                        Performance has improved.
ntt/forward_vt/4096/16  time:   [33.574 µs 33.584 µs 33.598 µs]                                   
                        change: [-49.150% -49.135% -49.119%] (p = 0.00 < 0.05)
                        Performance has improved.
ntt/backward/4096/16    time:   [38.478 µs 38.485 µs 38.492 µs]                                 
                        change: [-31.876% -29.802% -28.464%] (p = 0.00 < 0.05)
                        Performance has improved.
ntt/backward_vt/4096/16 time:   [38.505 µs 38.513 µs 38.521 µs]                                    
                        change: [-25.007% -24.990% -24.972%] (p = 0.00 < 0.05)
                        Performance has improved.

$ cargo +nightly criterion ntt --features concrete-ntt-nightly

ntt/forward/1024/62     time:   [4.0195 µs 4.0297 µs 4.0412 µs]                                
                        change: [-49.195% -49.092% -48.961%] (p = 0.00 < 0.05)
                        Performance has improved.
ntt/forward_vt/1024/62  time:   [3.8602 µs 3.8623 µs 3.8644 µs]                                   
                        change: [-51.222% -51.168% -51.109%] (p = 0.00 < 0.05)
                        Performance has improved.
ntt/backward/1024/62    time:   [4.6452 µs 4.6559 µs 4.6658 µs]                                 
                        change: [-48.186% -48.072% -47.969%] (p = 0.00 < 0.05)
                        Performance has improved.
ntt/backward_vt/1024/62 time:   [4.4434 µs 4.4449 µs 4.4475 µs]                                    
                        change: [-49.954% -49.869% -49.726%] (p = 0.00 < 0.05)
                        Performance has improved.
ntt/forward/1024/16     time:   [4.0979 µs 4.0988 µs 4.0997 µs]                                
                        change: [-45.527% -45.114% -44.823%] (p = 0.00 < 0.05)
                        Performance has improved.
ntt/forward_vt/1024/16  time:   [3.8746 µs 3.8767 µs 3.8787 µs]                                   
                        change: [-48.519% -48.427% -48.325%] (p = 0.00 < 0.05)
                        Performance has improved.
ntt/backward/1024/16    time:   [4.6727 µs 4.6841 µs 4.6956 µs]                                 
                        change: [-43.972% -43.862% -43.766%] (p = 0.00 < 0.05)
                        Performance has improved.
ntt/backward_vt/1024/16 time:   [4.4763 µs 4.4767 µs 4.4772 µs]                                    
                        change: [-46.397% -46.388% -46.378%] (p = 0.00 < 0.05)
                        Performance has improved.
ntt/forward/4096/62     time:   [21.465 µs 21.498 µs 21.522 µs]                                
                        change: [-35.987% -35.904% -35.851%] (p = 0.00 < 0.05)
                        Performance has improved.
ntt/forward_vt/4096/62  time:   [20.704 µs 20.710 µs 20.717 µs]                                   
                        change: [-41.233% -41.067% -40.901%] (p = 0.00 < 0.05)
                        Performance has improved.
ntt/backward/4096/62    time:   [23.186 µs 23.228 µs 23.253 µs]                                 
                        change: [-42.044% -41.976% -41.908%] (p = 0.00 < 0.05)
                        Performance has improved.
ntt/backward_vt/4096/62 time:   [23.653 µs 23.666 µs 23.679 µs]                                    
                        change: [-40.539% -40.501% -40.463%] (p = 0.00 < 0.05)
                        Performance has improved.
ntt/forward/4096/16     time:   [18.732 µs 18.736 µs 18.741 µs]                                
                        change: [-46.796% -46.713% -46.621%] (p = 0.00 < 0.05)
                        Performance has improved.
ntt/forward_vt/4096/16  time:   [18.746 µs 18.755 µs 18.764 µs]                                   
                        change: [-44.181% -44.154% -44.126%] (p = 0.00 < 0.05)
                        Performance has improved.
ntt/backward/4096/16    time:   [20.750 µs 20.759 µs 20.772 µs]                                 
                        change: [-46.058% -45.980% -45.892%] (p = 0.00 < 0.05)
                        Performance has improved.
ntt/backward_vt/4096/16 time:   [21.054 µs 21.056 µs 21.058 µs]                                    
                        change: [-45.338% -45.325% -45.312%] (p = 0.00 < 0.05)
                        Performance has improved.

tlepoint · 2024-02-04T01:35:24Z

@fionser I was going to ask you to review. I fully support what you propose, it would be great to merge the normalization in concrete-ntt 💯

fionser · 2024-02-04T01:39:01Z

@tlepoint Does the benchmark means the concrete-ntt is slower ?

I see, the nightly-concrete-ntt can support AVX512 which should be faster than the non-nightly one

tlepoint · 2024-02-04T01:40:20Z

It's slower on Apple M1/M2, but it's faster on Intel with AVX2/AVX512.

fionser · 2024-02-04T03:42:05Z

It suprised me that the concret-ntt impl is 30% slower on Apple M1/M2. Given that the concrete-ntt is quit similiar to the pointer-base implemetation in fhe.rs when n <= 1024.

tlepoint · 2024-02-04T04:22:09Z

It suprised me that the concret-ntt impl is 30% slower on Apple M1/M2. Given that the concrete-ntt is quit similiar to the pointer-base implemetation in fhe.rs when n <= 1024.

This would need to be fully investigated; but a possible reason is that this library is using "Shoup" representation for the NTT precomputed values, and limiting moduli to 62-bits maximum, which allows to speed up the base arithmetic operations.

fionser · 2024-02-04T08:01:14Z

Acutally, the concrete-ntt also uses Shoup's mulmod trick and Harvey’s butterfly (when the prime modulus is less than 62-bit) which is what fhe.rs doing.

fionser · 2024-02-04T08:39:30Z

I find my optimization expereience is not working on Rust/Apple :(

// The shoup's mul that uses 2 double-width mul (i.e., fhe.rs)
pub const fn lazy_mul_shoup(&self, a: u64, b: u64, b_shoup: u64) -> u64 {
        let q = ((a as u128) * (b_shoup as u128)) >> 64;
        let r = ((a as u128) * (b as u128) - q * (self.p as u128)) as u64;
        r
 }
ntt/forward/4096/62     time:   [14.730 µs 14.774 µs 14.845 µs]
                        change: [-26.838% -26.447% -26.049%] (p = 0.00 < 0.05)

// The shoup's mul that uses only one double-width mul which is expected to be faster (i.e,. concrete-ntt)
// However, this code show slower in M2
pub const fn lazy_mul_shoup(&self, a: u64, b: u64, b_shoup: u64) -> u64 {
        let q = (((a as u128) * (b_shoup as u128)) >> 64) as u64;
        let q = q.wrapping_mul(self.p);
        let r = a.wrapping_mul(b).wrapping_sub(q);
        r
}
ntt/forward/4096/62     time:   [20.120 µs 20.150 µs 20.187 µs]

This also match the benchmark on concrete-ntt. Thus, I would say the performance downgrade from fhe.rs to concrete-ntt is due to this mulmod function.

It will be nice @sarah-ek can give us a hint why a such thing happens (I am a newbee to Rust)

tlepoint · 2024-02-04T14:35:39Z

I am not sure what happens either (but I can reproduce the same slowdown when using your code). It is supposed to be compiled the same way:

Is concrete-ntt faster if you work with u128 there?

fionser · 2024-02-04T15:12:01Z

I am not sure what happens either (but I can reproduce the same slowdown when using your code). It is supposed to be compiled the same way:

https://rust.compiler-explorer.com/z/ferEbY3EK

https://rust.compiler-explorer.com/z/W6cGGKTG3

Is concrete-ntt faster if you work with u128 there?

Yes, the concrete-ntt is faster when doing both multiplication in u128 for Apple M2.
Check out this MR

Concrete

f1f6a17

tlepoint requested a review from fionser February 4, 2024 01:38

Merge branch 'main' into tl/concrete

6f3d47e

fionser approved these changes Feb 5, 2024

View reviewed changes

tlepoint merged commit 1c3c16d into main Feb 7, 2024
9 checks passed

tlepoint deleted the tl/concrete branch February 7, 2024 03:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use `concrete-ntt` for NTT operations #243

Use `concrete-ntt` for NTT operations #243

tlepoint commented Feb 4, 2024

codecov bot commented Feb 4, 2024 •

edited

Loading

tlepoint commented Feb 4, 2024

fionser commented Feb 4, 2024

tlepoint commented Feb 4, 2024 •

edited

Loading

tlepoint commented Feb 4, 2024 •

edited

Loading

fionser commented Feb 4, 2024 •

edited

Loading

tlepoint commented Feb 4, 2024

fionser commented Feb 4, 2024

tlepoint commented Feb 4, 2024

fionser commented Feb 4, 2024

fionser commented Feb 4, 2024 •

edited

Loading

tlepoint commented Feb 4, 2024

fionser commented Feb 4, 2024

Use concrete-ntt for NTT operations #243

Use concrete-ntt for NTT operations #243

Conversation

tlepoint commented Feb 4, 2024

codecov bot commented Feb 4, 2024 • edited Loading

Codecov Report

tlepoint commented Feb 4, 2024

fionser commented Feb 4, 2024

tlepoint commented Feb 4, 2024 • edited Loading

tlepoint commented Feb 4, 2024 • edited Loading

fionser commented Feb 4, 2024 • edited Loading

tlepoint commented Feb 4, 2024

fionser commented Feb 4, 2024

tlepoint commented Feb 4, 2024

fionser commented Feb 4, 2024

fionser commented Feb 4, 2024 • edited Loading

tlepoint commented Feb 4, 2024

fionser commented Feb 4, 2024

Use `concrete-ntt` for NTT operations #243

Use `concrete-ntt` for NTT operations #243

codecov bot commented Feb 4, 2024 •

edited

Loading

tlepoint commented Feb 4, 2024 •

edited

Loading

tlepoint commented Feb 4, 2024 •

edited

Loading

fionser commented Feb 4, 2024 •

edited

Loading

fionser commented Feb 4, 2024 •

edited

Loading