-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate speedups for __udivti3 and __udivdi3 #368
Comments
🎉 The author of the LLVM PR @danlark1 has also expressed interest in making a PR for Rust: https://news.ycombinator.com/item?id=23888517 |
Also, the original PR was https://reviews.llvm.org/D81809 and benchmarks https://reviews.llvm.org/D83027 |
See also #332 |
I believe this PR fully (and nicely) covers the suggested in my article optimizations with some extensions where the difference between high bits is big enough. The future thing to test might be a https://gmplib.org/~tege/division-paper.pdf |
cc @AaronKutch |
Hello danlark1, I saw your website a few days back and thought I might want to contact you, but could not find an email link on any accounts I saw. I'm glad you're looking at this optimization problem. If I understand the GMP division paper correctly, it uses a reciprocal based method to do 128 bit by 64 bit divisions. I tried using a reciprocal based method for short division cases like 128 bit by 32 bit or less divisions, but I could only ever barely improve over using hardware division based trivial short division in one case. I tried a few variations interpolating between all short divisions and only one 64/32 bit division to construct the reciprocal, and found that for every short division I removed, it added about 4-5 multiplications of the same size. The one case where it improved was probably because the multiplications could be done in parallel on a CPU with many multiplier units. The speed of Goldschmidt hardware division is based on the principle of being able to perform the numerator and denominator multiplications in parallel. The part of the GMP paper where they get to omit computing the most significant word may be the trick to make it faster in software. It mentions a speedup of 30% on processors of the day, but does this hold true in 2020? |
I updated the blog with the direct link to my info, sorry it was hidden :). Feel free to write me at kutdanila at gmail dot com.
This might be absolutely true, I also got some inconsistent results from the GMP paper.
Okay, looking forward to it. |
@danlark1 It doesn't seem that anyone is able to help me with the two remaining LLVM related division performance problems. I would make a PR to LLVM to fix them, but I am very inexperienced with LLVM and don't know the root causes to begin with. Since you have already submitted some division related changes to LLVM, I think you can help me with the basics at least. One is a problem with leading zeros calculations (see this comment). Another problem is that LLVM appears to only use For example:
compiles down to assembly
If I try to manually call
LLVM converts it to a call to I tested also on a |
Regarding the division. LLVM in the middle-end has a div/rem pass which replaces the remainder with a multiplication and subtraction. Some backends like x86-64 for integer division combine them in the back-end for 16,32,64 bits in one instruction, likely it was introduced for x86-64 here and evolved since. So, simply middle end does not know about __udivmodti4 and divrem is not instrumented. I tried to fix it a bit but failed in a short period of time. I claim this is an LLVM bug and should be properly addressed. Yet, I currently don't know how difficult it will be to fix this. For combining divrem, you should look somewhere here. Regarding leading_zeros, looking into it. |
I see a |
Hi, I believe The patch for AVR removed the stack issues for division rather than the div/rem optimization. |
Closing as per @AaronKutch 's comment above. Thanks everyone! |
There seems to be room for optimization of the __udivti3 function. I guess the improvement would also apply to __udivdi3. See these links:
https://danlark.org/2020/06/14/128-bit-division/
https://gmplib.org/~tege/division-paper.pdf
https://reviews.llvm.org/D83547
The text was updated successfully, but these errors were encountered: