Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use concrete-ntt for NTT operations #243

Merged
merged 2 commits into from
Feb 7, 2024
Merged

Use concrete-ntt for NTT operations #243

merged 2 commits into from
Feb 7, 2024

Conversation

tlepoint
Copy link
Owner

@tlepoint tlepoint commented Feb 4, 2024

We allow the use of concrete-ntt in fhe-math crate via the feature flags

  • concrete-ntt
  • concrete-ntt-nightly

Benchmarks to be added to this PR.

Copy link

codecov bot commented Feb 4, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (7be5ece) 93.71% compared to head (6f3d47e) 93.71%.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #243   +/-   ##
=======================================
  Coverage   93.71%   93.71%           
=======================================
  Files          42       42           
  Lines        8194     8194           
=======================================
  Hits         7679     7679           
  Misses        515      515           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@tlepoint
Copy link
Owner Author

tlepoint commented Feb 4, 2024

Apple M2 on battery:

$ cargo criterion ntt --features concrete-ntt
   Compiling fhe-math v0.1.0-beta.8 (/Users/tancrede/git/fhe.rs/crates/fhe-math)
   Compiling fhe v0.1.0-beta.8 (/Users/tancrede/git/fhe.rs/crates/fhe)
    Finished bench [optimized] target(s) in 8.12s

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 4 filtered out; finished in 0.00s

ntt/forward/1024/62     time:   [4.2295 µs 4.2310 µs 4.2335 µs]                                
                        change: [+31.699% +32.158% +32.656%] (p = 0.00 < 0.05)
                        Performance has regressed.
ntt/forward_vt/1024/62  time:   [4.2292 µs 4.2308 µs 4.2333 µs]                                   
                        change: [+30.739% +31.144% +31.576%] (p = 0.00 < 0.05)
                        Performance has regressed.
ntt/backward/1024/62    time:   [4.9529 µs 4.9564 µs 4.9596 µs]                                 
                        change: [+42.015% +42.382% +42.770%] (p = 0.00 < 0.05)
                        Performance has regressed.
ntt/backward_vt/1024/62 time:   [4.9462 µs 4.9509 µs 4.9564 µs]                                    
                        change: [+41.606% +41.982% +42.352%] (p = 0.00 < 0.05)
                        Performance has regressed.
ntt/forward/1024/16     time:   [4.2616 µs 4.2904 µs 4.3288 µs]                                
                        change: [+33.286% +34.570% +36.245%] (p = 0.00 < 0.05)
                        Performance has regressed.
ntt/forward_vt/1024/16  time:   [4.4043 µs 4.5029 µs 4.6164 µs]                                   
                        change: [+35.090% +38.045% +43.124%] (p = 0.00 < 0.05)
                        Performance has regressed.
ntt/backward/1024/16    time:   [4.9739 µs 4.9961 µs 5.0263 µs]                                 
                        change: [+43.092% +43.963% +44.946%] (p = 0.00 < 0.05)
                        Performance has regressed.
ntt/backward_vt/1024/16 time:   [4.9532 µs 4.9581 µs 4.9631 µs]                                    
                        change: [+41.480% +42.032% +42.502%] (p = 0.00 < 0.05)
                        Performance has regressed.
ntt/forward/4096/62     time:   [19.723 µs 19.757 µs 19.802 µs]                                
                        change: [+35.091% +36.112% +37.347%] (p = 0.00 < 0.05)
                        Performance has regressed.
ntt/forward_vt/4096/62  time:   [19.873 µs 19.990 µs 20.126 µs]                                   
                        change: [+34.851% +35.493% +36.214%] (p = 0.00 < 0.05)
                        Performance has regressed.
ntt/backward/4096/62    time:   [22.864 µs 22.901 µs 22.951 µs]                                 
                        change: [+44.177% +44.813% +45.415%] (p = 0.00 < 0.05)
                        Performance has regressed.
ntt/backward_vt/4096/62 time:   [22.846 µs 22.856 µs 22.872 µs]                                    
                        change: [+44.354% +44.766% +45.210%] (p = 0.00 < 0.05)
                        Performance has regressed.
ntt/forward/4096/16     time:   [19.775 µs 19.871 µs 19.986 µs]                                
                        change: [+35.310% +36.557% +38.172%] (p = 0.00 < 0.05)
                        Performance has regressed.
ntt/forward_vt/4096/16  time:   [19.678 µs 19.689 µs 19.702 µs]                                   
                        change: [+33.242% +33.690% +34.139%] (p = 0.00 < 0.05)
                        Performance has regressed.
ntt/backward/4096/16    time:   [22.840 µs 22.849 µs 22.861 µs]                                 
                        change: [+43.660% +44.123% +44.560%] (p = 0.00 < 0.05)
                        Performance has regressed.
ntt/backward_vt/4096/16 time:   [22.892 µs 22.960 µs 23.032 µs]                                    
                        change: [+44.630% +45.081% +45.544%] (p = 0.00 < 0.05)
                        Performance has regressed.

@fionser
Copy link
Collaborator

fionser commented Feb 4, 2024

Actually, I want to modify the concrete-ntt a little.
Specially, the inverse ntt operation. The multiplication with n^{-1} can be merged into the last layer ot intt.

@tlepoint
Copy link
Owner Author

tlepoint commented Feb 4, 2024

Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz

cargo criterion ntt --features concrete-ntt
   Compiling fhe-math v0.1.0-beta.8 (/data/dev/[fhe.rs/crates/fhe-math](http://fhe.rs/crates/fhe-math))
   Compiling fhe v0.1.0-beta.8 (/data/dev/[fhe.rs/crates/fhe](http://fhe.rs/crates/fhe))
    Finished bench [optimized] target(s) in 4.60s
Gnuplot not found, using plotters backend
Gnuplot not found, using plotters backend
Gnuplot not found, using plotters backend

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 4 filtered out; finished in 0.00s

Gnuplot not found, using plotters backend
ntt/forward/1024/62     time:   [7.9211 µs 7.9222 µs 7.9231 µs]                                
                        change: [-52.194% -52.185% -52.178%] (p = 0.00 < 0.05)
                        Performance has improved.
ntt/forward_vt/1024/62  time:   [7.8962 µs 7.9078 µs 7.9179 µs]                                   
                        change: [-48.513% -48.461% -48.413%] (p = 0.00 < 0.05)
                        Performance has improved.
ntt/backward/1024/62    time:   [8.9256 µs 8.9387 µs 8.9510 µs]                                 
                        change: [-22.332% -22.265% -22.208%] (p = 0.00 < 0.05)
                        Performance has improved.
ntt/backward_vt/1024/62 time:   [8.8733 µs 8.8781 µs 8.8836 µs]                                    
                        change: [-18.803% -18.765% -18.725%] (p = 0.00 < 0.05)
                        Performance has improved.
ntt/forward/1024/16     time:   [7.4251 µs 7.5024 µs 7.6175 µs]                                
                        change: [-54.565% -54.342% -53.987%] (p = 0.00 < 0.05)
                        Performance has improved.
ntt/forward_vt/1024/16  time:   [7.5126 µs 7.5287 µs 7.5420 µs]                                   
                        change: [-49.884% -49.787% -49.698%] (p = 0.00 < 0.05)
                        Performance has improved.
ntt/backward/1024/16    time:   [8.3443 µs 8.3462 µs 8.3478 µs]                                 
                        change: [-27.461% -27.448% -27.436%] (p = 0.00 < 0.05)
                        Performance has improved.
ntt/backward_vt/1024/16 time:   [8.3488 µs 8.3499 µs 8.3512 µs]                                    
                        change: [-23.645% -23.601% -23.574%] (p = 0.00 < 0.05)
                        Performance has improved.
ntt/forward/4096/62     time:   [33.557 µs 33.562 µs 33.566 µs]                                
                        change: [-55.338% -54.399% -53.716%] (p = 0.00 < 0.05)
                        Performance has improved.
ntt/forward_vt/4096/62  time:   [34.956 µs 35.053 µs 35.149 µs]                                   
                        change: [-48.215% -48.066% -47.919%] (p = 0.00 < 0.05)
                        Performance has improved.
ntt/backward/4096/62    time:   [40.084 µs 40.110 µs 40.129 µs]                                 
                        change: [-25.304% -25.241% -25.178%] (p = 0.00 < 0.05)
                        Performance has improved.
ntt/backward_vt/4096/62 time:   [39.784 µs 39.793 µs 39.804 µs]                                    
                        change: [-22.524% -22.509% -22.493%] (p = 0.00 < 0.05)
                        Performance has improved.
ntt/forward/4096/16     time:   [35.116 µs 35.187 µs 35.243 µs]                                
                        change: [-50.887% -50.801% -50.726%] (p = 0.00 < 0.05)
                        Performance has improved.
ntt/forward_vt/4096/16  time:   [33.574 µs 33.584 µs 33.598 µs]                                   
                        change: [-49.150% -49.135% -49.119%] (p = 0.00 < 0.05)
                        Performance has improved.
ntt/backward/4096/16    time:   [38.478 µs 38.485 µs 38.492 µs]                                 
                        change: [-31.876% -29.802% -28.464%] (p = 0.00 < 0.05)
                        Performance has improved.
ntt/backward_vt/4096/16 time:   [38.505 µs 38.513 µs 38.521 µs]                                    
                        change: [-25.007% -24.990% -24.972%] (p = 0.00 < 0.05)
                        Performance has improved.
$ cargo +nightly criterion ntt --features concrete-ntt-nightly

ntt/forward/1024/62     time:   [4.0195 µs 4.0297 µs 4.0412 µs]                                
                        change: [-49.195% -49.092% -48.961%] (p = 0.00 < 0.05)
                        Performance has improved.
ntt/forward_vt/1024/62  time:   [3.8602 µs 3.8623 µs 3.8644 µs]                                   
                        change: [-51.222% -51.168% -51.109%] (p = 0.00 < 0.05)
                        Performance has improved.
ntt/backward/1024/62    time:   [4.6452 µs 4.6559 µs 4.6658 µs]                                 
                        change: [-48.186% -48.072% -47.969%] (p = 0.00 < 0.05)
                        Performance has improved.
ntt/backward_vt/1024/62 time:   [4.4434 µs 4.4449 µs 4.4475 µs]                                    
                        change: [-49.954% -49.869% -49.726%] (p = 0.00 < 0.05)
                        Performance has improved.
ntt/forward/1024/16     time:   [4.0979 µs 4.0988 µs 4.0997 µs]                                
                        change: [-45.527% -45.114% -44.823%] (p = 0.00 < 0.05)
                        Performance has improved.
ntt/forward_vt/1024/16  time:   [3.8746 µs 3.8767 µs 3.8787 µs]                                   
                        change: [-48.519% -48.427% -48.325%] (p = 0.00 < 0.05)
                        Performance has improved.
ntt/backward/1024/16    time:   [4.6727 µs 4.6841 µs 4.6956 µs]                                 
                        change: [-43.972% -43.862% -43.766%] (p = 0.00 < 0.05)
                        Performance has improved.
ntt/backward_vt/1024/16 time:   [4.4763 µs 4.4767 µs 4.4772 µs]                                    
                        change: [-46.397% -46.388% -46.378%] (p = 0.00 < 0.05)
                        Performance has improved.
ntt/forward/4096/62     time:   [21.465 µs 21.498 µs 21.522 µs]                                
                        change: [-35.987% -35.904% -35.851%] (p = 0.00 < 0.05)
                        Performance has improved.
ntt/forward_vt/4096/62  time:   [20.704 µs 20.710 µs 20.717 µs]                                   
                        change: [-41.233% -41.067% -40.901%] (p = 0.00 < 0.05)
                        Performance has improved.
ntt/backward/4096/62    time:   [23.186 µs 23.228 µs 23.253 µs]                                 
                        change: [-42.044% -41.976% -41.908%] (p = 0.00 < 0.05)
                        Performance has improved.
ntt/backward_vt/4096/62 time:   [23.653 µs 23.666 µs 23.679 µs]                                    
                        change: [-40.539% -40.501% -40.463%] (p = 0.00 < 0.05)
                        Performance has improved.
ntt/forward/4096/16     time:   [18.732 µs 18.736 µs 18.741 µs]                                
                        change: [-46.796% -46.713% -46.621%] (p = 0.00 < 0.05)
                        Performance has improved.
ntt/forward_vt/4096/16  time:   [18.746 µs 18.755 µs 18.764 µs]                                   
                        change: [-44.181% -44.154% -44.126%] (p = 0.00 < 0.05)
                        Performance has improved.
ntt/backward/4096/16    time:   [20.750 µs 20.759 µs 20.772 µs]                                 
                        change: [-46.058% -45.980% -45.892%] (p = 0.00 < 0.05)
                        Performance has improved.
ntt/backward_vt/4096/16 time:   [21.054 µs 21.056 µs 21.058 µs]                                    
                        change: [-45.338% -45.325% -45.312%] (p = 0.00 < 0.05)
                        Performance has improved.

@tlepoint
Copy link
Owner Author

tlepoint commented Feb 4, 2024

@fionser I was going to ask you to review. I fully support what you propose, it would be great to merge the normalization in concrete-ntt 💯

@tlepoint tlepoint requested a review from fionser February 4, 2024 01:38
@fionser
Copy link
Collaborator

fionser commented Feb 4, 2024

@tlepoint Does the benchmark means the concrete-ntt is slower ?

I see, the nightly-concrete-ntt can support AVX512 which should be faster than the non-nightly one

@tlepoint
Copy link
Owner Author

tlepoint commented Feb 4, 2024

It's slower on Apple M1/M2, but it's faster on Intel with AVX2/AVX512.

@fionser
Copy link
Collaborator

fionser commented Feb 4, 2024

It suprised me that the concret-ntt impl is 30% slower on Apple M1/M2. Given that the concrete-ntt is quit similiar to the pointer-base implemetation in fhe.rs when n <= 1024.

@tlepoint
Copy link
Owner Author

tlepoint commented Feb 4, 2024

It suprised me that the concret-ntt impl is 30% slower on Apple M1/M2. Given that the concrete-ntt is quit similiar to the pointer-base implemetation in fhe.rs when n <= 1024.

This would need to be fully investigated; but a possible reason is that this library is using "Shoup" representation for the NTT precomputed values, and limiting moduli to 62-bits maximum, which allows to speed up the base arithmetic operations.

@fionser
Copy link
Collaborator

fionser commented Feb 4, 2024

Acutally, the concrete-ntt also uses Shoup's mulmod trick and Harvey’s butterfly (when the prime modulus is less than 62-bit) which is what fhe.rs doing.

@fionser
Copy link
Collaborator

fionser commented Feb 4, 2024

I find my optimization expereience is not working on Rust/Apple :(

// The shoup's mul that uses 2 double-width mul (i.e., fhe.rs)
pub const fn lazy_mul_shoup(&self, a: u64, b: u64, b_shoup: u64) -> u64 {
        let q = ((a as u128) * (b_shoup as u128)) >> 64;
        let r = ((a as u128) * (b as u128) - q * (self.p as u128)) as u64;
        r
 }
ntt/forward/4096/62     time:   [14.730 µs 14.774 µs 14.845 µs]
                        change: [-26.838% -26.447% -26.049%] (p = 0.00 < 0.05)
// The shoup's mul that uses only one double-width mul which is expected to be faster (i.e,. concrete-ntt)
// However, this code show slower in M2
pub const fn lazy_mul_shoup(&self, a: u64, b: u64, b_shoup: u64) -> u64 {
        let q = (((a as u128) * (b_shoup as u128)) >> 64) as u64;
        let q = q.wrapping_mul(self.p);
        let r = a.wrapping_mul(b).wrapping_sub(q);
        r
}
ntt/forward/4096/62     time:   [20.120 µs 20.150 µs 20.187 µs] 

This also match the benchmark on concrete-ntt. Thus, I would say the performance downgrade from fhe.rs to concrete-ntt is due to this mulmod function.

It will be nice @sarah-ek can give us a hint why a such thing happens (I am a newbee to Rust)

@tlepoint
Copy link
Owner Author

tlepoint commented Feb 4, 2024

I am not sure what happens either (but I can reproduce the same slowdown when using your code). It is supposed to be compiled the same way:

Is concrete-ntt faster if you work with u128 there?

@fionser
Copy link
Collaborator

fionser commented Feb 4, 2024

I am not sure what happens either (but I can reproduce the same slowdown when using your code). It is supposed to be compiled the same way:

Is concrete-ntt faster if you work with u128 there?

Yes, the concrete-ntt is faster when doing both multiplication in u128 for Apple M2.
Check out this MR

@tlepoint tlepoint merged commit 1c3c16d into main Feb 7, 2024
9 checks passed
@tlepoint tlepoint deleted the tl/concrete branch February 7, 2024 03:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants