SABER NEON (aarch64) implementation by Duc Tri Nguyen (CERG Geogre Mason University) with improvements.
-
CPU max freq: 1200 Mhz (no overclock)
-
ARMv8 ASIMD instruction
-
RAM: 1 GB
-
Speculative: Yes
-
Out-of-order: No
-
Branch prediction: Unknown
-
Compiler flags:
-
Reference code:
gcc -Wall -Wextra -O3 -fno-tree-vectorize -fomit-frame-pointer -mcpu=cortex-a53 -mtune=native -g3
-
NEON GCC:
gcc -Wall -Wextra -O3 -fomit-frame-pointer -march=native -mtune=native -g3
-
NEON Clang:
clang -march=native -O3 -fomit-frame-pointer -fwrapv -Qunused-arguments
-
-
Compiler version:
-
gcc version 10.1.0, aarch64-unknown-linux-gnu
-
clang version 10.0.0, aarch64-unknown-linux-gnu
-
-
Report result:
-
Unit is clock cycles at 1200 MHz.
-
Implementation |
Keygen |
Encap |
Decap |
Reference |
455,672 |
626,820 |
786,213 |
NEON GCC |
170,241 |
198,130 |
215,307 |
NEON Clang |
160,567 |
183,897 |
199,918 |
Ref/NEON GCC |
2.7 |
3.2 |
3.7 |
Ref/NEON Clang |
2.8 |
3.4 |
3.9 |
Implementation |
Keygen |
Encap |
Decap |
Reference |
946,380 |
1,209,930 |
1,453,019 |
NEON GCC |
306,544 |
354,892 |
386,246 |
NEON Clang |
279,335 |
322,512 |
350,027 |
Ref/NEON GCC |
3.1 |
3.4 |
3.8 |
Ref/NEON Clang |
3.4 |
3.8 |
4.2 |
Implementation |
Keygen |
Encap |
Decap |
Reference |
1,619,635 |
1,971,942 |
2,303,857 |
NEON GCC |
487,703 |
557,291 |
605,773 |
NEON Clang |
445,264 |
504,956 |
548,119 |
Ref/NEON GCC |
3.3 |
3.5 |
3.8 |
Ref/NEON Clang |
3.6 |
3.9 |
4.2 |
Update August 27 2020
By applying techniques described in paper [1], the neon implementation improves significantly. The reference code mostly get improved by compiler version.
Compiler flags:
-
Reference1:
-Wall -Wextra -O3 -fomit-frame-pointer -march=native -g3 -fno-tree-vectorize
-
Reference2:
-Wall -Wextra -O3 -fomit-frame-pointer -march=native -g3
-
Compiler settings:
-
gcc version 10.2.0 (Gentoo 10.2.0 p1)
-
clang version 10.0.1 aarch64-unknown-linux-gnu
-
Compiler flags:
-Wall -Wextra -O3 -fomit-frame-pointer -mtune=native -mcpu=native -lcrypto -fwrapv -w -g3 -lpapi
-
-
Report result:
-
Unit is clock cycles at 1200 MHz.
-
-
Implementation |
Keygen |
Encap |
Decap |
Reference1 |
441,148 |
603,767 |
755,280 |
Reference2 |
230,080 |
288,346 |
335,560 |
NEON GCC |
158,233 |
183,854 |
197,265 |
NEON Clang |
149,218 |
170,212 |
181,233 |
Implementation |
Keygen |
Encap |
Decap |
Reference1 |
908,543 |
1,158,383 |
1,390,112 |
Reference2 |
440,565 |
534,134 |
610,619 |
NEON GCC |
266,603 |
306,626 |
327,156 |
NEON Clang |
246,387 |
282,507 |
301,024 |
Implementation |
Keygen |
Encap |
Decap |
Reference1 |
1,552,956 |
1,888,738 |
2,202,832 |
Reference2 |
721,688 |
852,713 |
962,702 |
NEON GCC |
400,885 |
457,888 |
490,390 |
NEON Clang |
373,043 |
419,959 |
449,171 |
Implementation |
Keygen |
Encap |
Decap |
Reference1 |
367.6 |
503.1 |
629.4 |
Reference2 |
191.7 |
240.2 |
279.6 |
NEON GCC |
131.8 |
153.2 |
164.3 |
NEON Clang |
124.3 |
141.8 |
151.0 |
Implementation |
Keygen |
Encap |
Decap |
Reference1 |
757.1 |
965.3 |
1158.4 |
Reference2 |
367.1 |
445.1 |
508.8 |
NEON GCC |
222.1 |
255.5 |
272.6 |
NEON Clang |
205.3 |
235.4 |
250.8 |
Implementation |
Keygen |
Encap |
Decap |
Reference1 |
1294.1 |
1573.9 |
1835.6 |
Reference2 |
601.4 |
710.5 |
802.2 |
NEON GCC |
334.0 |
381.5 |
408.6 |
NEON Clang |
310.8 |
349.9 |
374.3 |
Benchmark on Raspberry Pi 4:
-
CPU max freq: 1500 Mhz (no overclock)
-
ARMv8 ASIMD instruction
-
RAM: 8 GB
-
Speculator: Yes
-
Out-of-order: Yes
-
Branch prediction: Yes
Use the same settings as Cortex-A53 update August 27 2020.
Compiler flags:
-
Reference1:
-Wall -Wextra -O3 -fomit-frame-pointer -march=native -g3 -fno-tree-vectorize
-
Reference2:
-Wall -Wextra -O3 -fomit-frame-pointer -march=native -g3
-
Compiler settings:
-
gcc version 10.2.0 (GCC)
-
clang version 10.0.1 aarch64-unknown-linux-gnu
-
Compiler flags:
-Wall -Wextra -O3 -fomit-frame-pointer -mtune=native -mcpu=native -lcrypto -fwrapv -w -g3 -lpapi
-
-
Report result:
-
Unit is clock cycles at 1900 MHz (overclocked)
-
-
Implementation |
Keygen |
Encap |
Decap |
Reference1 |
248.5 |
339.3 |
423.9 |
Reference2 |
102.7 |
121.6 |
133.8 |
NEON GCC |
80.2 |
88.3 |
90.7 |
NEON Clang |
80.5 |
89.9 |
92.1 |
Implementation |
Keygen |
Encap |
Decap |
Reference1 |
512.3 |
654.8 |
785.2 |
Reference2 |
184.4 |
220.2 |
240.8 |
NEON GCC |
127.1 |
144.0 |
149.0 |
NEON Clang |
129.8 |
146.0 |
151.8 |
Implementation |
Keygen |
Encap |
Decap |
Reference1 |
878.3 |
1069.7 |
1247.8 |
Reference2 |
299.2 |
348.5 |
377.0 |
NEON GCC |
189.2 |
212.3 |
222.1 |
NEON Clang |
190.7 |
216.6 |
226.7 |
What happen if one overclock Pi 4 to 1900 Mhz? — Here it is, 25% increase in CPU frequency lead to ~25% gain in performance.
Implementation |
Keygen |
Encap |
Decap |
Reference1 |
195.8 |
267.1 |
334.4 |
Reference2 |
80.3 |
95.5 |
106.2 |
NEON GCC |
63.2 |
68.7 |
70.5 |
NEON Clang |
64.7 |
72.6 |
74.2 |
Implementation |
Keygen |
Encap |
Decap |
Reference1 |
404.9 |
517.4 |
621.4 |
Reference2 |
146.2 |
172.3 |
192.2 |
NEON GCC |
100.0 |
113.0 |
116.8 |
NEON Clang |
102.0 |
115.8 |
120.1 |
Implementation |
Keygen |
Encap |
Decap |
Reference1 |
692.1 |
842.8 |
984.0 |
Reference2 |
232.9 |
271.3 |
301.7 |
NEON GCC |
150.5 |
169.3 |
176.8 |
NEON Clang |
149.9 |
170.2 |
178.2 |
I’m actually surprise that clang perform worse than GCC in out-of-order CPU, e.g Cortex-A72. It’s useful to know clang can schedule instruction better than GCC in non out-of-order CPU, e.g Cortex-A53.
Feel free to create a pull request.
Is SABER faster than NTRU? Want a comparison?
You can find my other repo NEON implementation of NTRU here: https://github.com/cothan/ntru