-
Notifications
You must be signed in to change notification settings - Fork 196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster base conversion #580
base: develop
Are you sure you want to change the base?
Conversation
Yep, feared that ;-) |
f68379e
to
2118eed
Compare
Took me a bit but the test for the largest possible input for MP_64BIT run successfully.
Input number generated with Pari/GP
Largest possible number for 60-bit limbs would be This result has not been made with the develop branch (would have taken weeks) but with some additional fast algorithms: TC 2-5 , FFT(FHT), Newton-division (no NTT yet). TC only up to 5 because the lower cutoff of FFT was so close to TC-5 that I called it: good enough for this test. Multi-threading by a handful of simple OpenMP instructions, so there is some room for improvement. Test run on an older i7-2600 with 4 cores/8 threads. |
Memory used was not measured exactly but was about 6.5 Gibibytes (Pari/GP uses less but it uses fully filled limbs). |
For
|
For
|
Ah, I see. |
Speed-test gets reinstated when cutoff-tuning is implemented. |
Other limb-sizes: There is also a spot in the code in There are not a lot of differences here, a fixed cutoff at 500 bits (rounded to next limb-size) would make the most sense for both: as a general cutoff and the cutoff between the build-in high short product and full multiplication. as said before: it is already a lot of code for a "simple" radix conversion! |
What? Ah, |
Line 29 in && (b->used > (2 * MP_MUL_KARATSUBA_CUTOFF)) Great! *sigh* |
It is already a lot of code. Not so much for reading but quite a lot for writing. So I like to stop here. No more cutoff branches (only bases 16 and 64 would be of interest anyways), and no more internal branches except for radices of the form So without any complains I like to give it a wet wipe and call it done. OK? |
Absolutely OK :) Those graphs look good. Do I understand correctly that the "slow path" is the old version? |
They are the old ones, yes, but with the added string positioning which shouldn't add much of runtime because it is outside of the loops. Yes, my ability to give the children good names is still highly underdeveloped, sorry ;-) |
a1417cf
to
eeed7bd
Compare
As always: when you thought you had it all ... ;-) |
2b6a2f1
to
5ece04f
Compare
Hold your horses, please, I just saw, that I forgot to add the printing part in |
No, I don't know how that happened (the rotation of You might try |
Tuning does not work in |
Sure, make a separate PR. Did you check whether the amalgamated version of the library results in different tuning parameters? |
No, I was just happy that it works in the first place! ;-) But I'll take a look, of course.
Good question. I'll take a look. (I think we would need the full triplet tune->profile->tune)
Nearly as much as in the cereals aisle at Walmart ;-) But I found a problem with #define MP_DEFAULT_MUL_KARATSUBA_CUTOFF 115
#define MP_DEFAULT_SQR_KARATSUBA_CUTOFF 152
#define MP_DEFAULT_MUL_TOOM_CUTOFF 139
#define MP_DEFAULT_SQR_TOOM_CUTOFF 212
#define MP_DEFAULT_MUL_TOOM_4_CUTOFF 218
#define MP_DEFAULT_SQR_TOOM_4_CUTOFF 256
#define MP_DEFAULT_MUL_TOOM_5_CUTOFF 254
#define MP_DEFAULT_SQR_TOOM_5_CUTOFF 256
#define MP_DEFAULT_MUL_TOOM_6_CUTOFF 256
#define MP_DEFAULT_SQR_TOOM_6_CUTOFF 256
#define MP_DEFAULT_MUL_FFT_CUTOFF 608
#define MP_DEFAULT_SQR_FFT_CUTOFF 708 Which is clearly nonsense. (256 limbs is the COMBA size on this machine btw) I mean: my implementation of FFT is fast but I'm pretty sure not that fast ;-) #define MP_DEFAULT_MUL_KARATSUBA_CUTOFF 109
#define MP_DEFAULT_SQR_KARATSUBA_CUTOFF 138
#define MP_DEFAULT_MUL_TOOM_CUTOFF 149
#define MP_DEFAULT_SQR_TOOM_CUTOFF 266
#define MP_DEFAULT_MUL_TOOM_4_CUTOFF 880
#define MP_DEFAULT_SQR_TOOM_4_CUTOFF 2952
#define MP_DEFAULT_MUL_TOOM_5_CUTOFF 982
#define MP_DEFAULT_SQR_TOOM_5_CUTOFF 927
#define MP_DEFAULT_MUL_TOOM_6_CUTOFF 1251
#define MP_DEFAULT_SQR_TOOM_6_CUTOFF 1083
/* Results with steps of 100 and (1000) resp. */
#define MP_DEFAULT_MUL_FFT_CUTOFF 24208 (47008)
#define MP_DEFAULT_SQR_FFT_CUTOFF 12308 (43008) That looks a bit more reasonable. There is not much difference up to Why is (git blame --line-porcelain etc/tune_it.sh; git blame --line-porcelain etc/tune.c ) \
| sed -n 's/^author //p' | sort | uniq -c | sort -rn Ah, I forgot, never mind ;-) The outlier And because I know you like a pretty picture or two: Ok, the FFT cutoffs from the beginning were not a lie ;-) TC-7 is in the works (some coefficients are too large even for 32-bit) and TC-8 has a bug in the implementation (PARI/GP script works, though), With TC-9 and above the coefficients get larger than 64 bit. |
Me and my big mouth! ;-) I see that The rest of the TODOs in this PR are mostly optimizations that need external work (e.g. short products) or small things that are not function related and hence not urgent. Will again shove my mop through it and wrap it up for good now. I know my tendency for "featuritis" too well ;-). |
ea9a964
to
a492616
Compare
This PR includes both variations to implement the Schönhage trick: the standard way and the method proposed by Lars Helmström to compute the reciprocals.
The default is the normal way. Switch to the second method which uses a round of Newton-Raphson (N-R-method) with
make "CFLAGS=-DMP_TO_RADIX_USE_NEWTON_RAPHSON " test
Found no significant difference in speed, but YMMV, as always.
The cutoffs are all regulated by
MP_RADIX_BARRETT_START_MULTIPLICATOR
, no finer resolution for now. Default isMP_RADIX_BARRETT_START_MULTIPLICATOR=10
. There is not much diffference but there is:(Timings are read/write combined)
The timing used in
demo/test.c
to check if the faster method is actually faster is switched off if it runs in Valgrind.There are a lot of edgecases, so it is a lot to test. On the upper end with
MP_64BIT
and base 10, the N-R-method works well up to2^(2^27)
, the normal way up to2^(2^30)
( Up toMP_MAX_DIGIT_COUNT - 4
in reading, writing is still running )More information in the comments in the code.