Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate speedups for __udivti3 and __udivdi3 #368

Closed
est31 opened this issue Jul 19, 2020 · 13 comments
Closed

Investigate speedups for __udivti3 and __udivdi3 #368

est31 opened this issue Jul 19, 2020 · 13 comments

Comments

@est31
Copy link
Member

est31 commented Jul 19, 2020

There seems to be room for optimization of the __udivti3 function. I guess the improvement would also apply to __udivdi3. See these links:

@est31
Copy link
Member Author

est31 commented Jul 19, 2020

🎉 The author of the LLVM PR @danlark1 has also expressed interest in making a PR for Rust: https://news.ycombinator.com/item?id=23888517

@danlark1
Copy link

Also, the original PR was https://reviews.llvm.org/D81809 and benchmarks https://reviews.llvm.org/D83027

@Amanieu
Copy link
Member

Amanieu commented Jul 19, 2020

See also #332

@danlark1
Copy link

See also #332

I believe this PR fully (and nicely) covers the suggested in my article optimizations with some extensions where the difference between high bits is big enough. The future thing to test might be a https://gmplib.org/~tege/division-paper.pdf

@Amanieu
Copy link
Member

Amanieu commented Jul 19, 2020

cc @AaronKutch

@AaronKutch
Copy link
Contributor

Hello danlark1, I saw your website a few days back and thought I might want to contact you, but could not find an email link on any accounts I saw. I'm glad you're looking at this optimization problem. If I understand the GMP division paper correctly, it uses a reciprocal based method to do 128 bit by 64 bit divisions. I tried using a reciprocal based method for short division cases like 128 bit by 32 bit or less divisions, but I could only ever barely improve over using hardware division based trivial short division in one case. I tried a few variations interpolating between all short divisions and only one 64/32 bit division to construct the reciprocal, and found that for every short division I removed, it added about 4-5 multiplications of the same size. The one case where it improved was probably because the multiplications could be done in parallel on a CPU with many multiplier units. The speed of Goldschmidt hardware division is based on the principle of being able to perform the numerator and denominator multiplications in parallel. The part of the GMP paper where they get to omit computing the most significant word may be the trick to make it faster in software. It mentions a speedup of 30% on processors of the day, but does this hold true in 2020?
One important performance artifact I found on a Intel CPU I benchmarked, is that performing 64/32 divisions instead of 128/64 where possible leads to a significant performance increase, despite using the same underlying assembly instruction (see the parts of asymmetric.rs that start "Some x86_64 CPUs have bad hardware implementations..."). Some x86_64 CPUs definitely have optimization problems surrounding div.
Also, note that specialized_div_rem 0.4.0 which I am using to develop these algorithms isn't actually published on crates.io yet, the most up-to-date version is on my GitHub master branch. I'm currently working on a cleanup for my PR that I should submit soon.

@danlark1
Copy link

Hello danlark1, I saw your website a few days back and thought I might want to contact you, but could not find an email link on any accounts I saw.

I updated the blog with the direct link to my info, sorry it was hidden :). Feel free to write me at kutdanila at gmail dot com.

I'm glad you're looking at this optimization problem. If I understand the GMP division paper correctly, it uses a reciprocal based method to do 128 bit by 64 bit divisions. I tried using a reciprocal based method for short division cases like 128 bit by 32 bit or less divisions, but I could only ever barely improve over using hardware division based trivial short division in one case. I tried a few variations interpolating between all short divisions and only one 64/32 bit division to construct the reciprocal, and found that for every short division I removed, it added about 4-5 multiplications of the same size. The one case where it improved was probably because the multiplications could be done in parallel on a CPU with many multiplier units. The speed of Goldschmidt hardware division is based on the principle of being able to perform the numerator and denominator multiplications in parallel. The part of the GMP paper where they get to omit computing the most significant word may be the trick to make it faster in software. It mentions a speedup of 30% on processors of the day, but does this hold true in 2020?

This might be absolutely true, I also got some inconsistent results from the GMP paper.

Also, note that specialized_div_rem 0.4.0 which I am using to develop these algorithms isn't actually published on crates.io yet, the most up-to-date version is on my GitHub master branch. I'm currently working on a cleanup for my PR that I should submit soon.

Okay, looking forward to it.

@AaronKutch
Copy link
Contributor

@danlark1 It doesn't seem that anyone is able to help me with the two remaining LLVM related division performance problems. I would make a PR to LLVM to fix them, but I am very inexperienced with LLVM and don't know the root causes to begin with. Since you have already submitted some division related changes to LLVM, I think you can help me with the basics at least. One is a problem with leading zeros calculations (see this comment). Another problem is that LLVM appears to only use __div{s,d,t}i3 functions instead of also using __divmod{s,d,t}i3 and __mod{s,d,t}i3 functions. Whenever a remainder is needed, LLVM uses multiplications to derive the remainder from the quotient, but this is wasteful because my __divmod... and __mod... functions can often get the remainder in fewer operations. In fact, the way I have set up my algorithms currently leads to the remainder always being calculated but LLVM always throws them away even when they are useful.

For example:

pub fn test(a: u128, b: u128) -> (u128, u128) {
    let quo = a / b;
    let rem = a % b;
    (quo, rem)
}

compiles down to

assembly
aaron_test_lib::test:
 push    r15
 push    r14
 push    rsi
 push    rdi
 push    rbx
 sub     rsp, 64
 mov     rdi, qword, ptr, [rsp, +, 144]
 mov     rax, r9
 or      rax, rdi
 je      .LBB0_2
 mov     rsi, r9
 mov     r14, r8
 mov     rbx, rdx
 mov     r15, rcx
 mov     qword, ptr, [rsp, +, 48], rdx
 mov     qword, ptr, [rsp, +, 32], r9
 mov     qword, ptr, [rsp, +, 56], r8
 mov     qword, ptr, [rsp, +, 40], rdi
 lea     rcx, [rsp, +, 48]
 lea     rdx, [rsp, +, 32]
 call    __udivti3
 pshufd  xmm1, xmm0, 78
 movq    rcx, xmm1
 movq    rax, xmm0
 imul    rdi, rax
 mul     rsi
 add     rdx, rdi
 imul    rcx, rsi
 add     rcx, rdx
 sub     rbx, rax
 sbb     r14, rcx
 movdqu  xmmword, ptr, [r15], xmm0
 mov     qword, ptr, [r15, +, 16], rbx
 mov     qword, ptr, [r15, +, 24], r14
 mov     rax, r15
 add     rsp, 64
 pop     rbx
 pop     rdi
 pop     rsi
 pop     r14
 pop     r15
 ret
.LBB0_2:
 lea     rcx, [rip, +, str.0]
 lea     r8, [rip, +, __unnamed_1]
 mov     edx, 25
 call    core::panicking::panic
 ud2

If I try to manually call __udivmodti4:

pub fn test(a: u128, b: u128) -> (u128, u128) {
    let rem = &mut 0;
    let quo = compiler_builtins::int::udiv::__udivmodti4(a, b, Some(rem));
    (quo, *rem)
}

LLVM converts it to a call to __udivti3, which I did not know it was capable of until now.

I tested also on a riscv32 target and found that strangely, the same trick is not being pulled for any size division.

@danlark1
Copy link

danlark1 commented Aug 5, 2020

Regarding the division. LLVM in the middle-end has a div/rem pass which replaces the remainder with a multiplication and subtraction. Some backends like x86-64 for integer division combine them in the back-end for 16,32,64 bits in one instruction, likely it was introduced for x86-64 here and evolved since.

So, simply middle end does not know about __udivmodti4 and divrem is not instrumented. I tried to fix it a bit but failed in a short period of time.

I claim this is an LLVM bug and should be properly addressed. Yet, I currently don't know how difficult it will be to fix this.

For combining divrem, you should look somewhere here.

Regarding leading_zeros, looking into it.

@AaronKutch
Copy link
Contributor

AaronKutch commented Aug 7, 2020

I see a SDIVREM_I128 call being referenced here. Currently, there is no signed counterpart to the __udivmodti4 function in compiler-rt or compiler-builtins. Is this something that can cause optimization problems arount i128? Amanieu found that support used to exist, but it was removed for some reason (would adding it back break AVR?). Could I add a __divmodti4 function to compiler-builtins to fix the asymmetry, or does there need to be support enabled somewhere first?

@danlark1
Copy link

danlark1 commented Aug 8, 2020

Hi, I believe __divmodti4 was not existing on gcc prior to 7th version and llvm somewhat did not catch up in time and there is a patch on it. I don't think it causes problems right now, the problem that LLVM just does not have this optimization implemented

The patch for AVR removed the stack issues for division rather than the div/rem optimization.

@AaronKutch
Copy link
Contributor

@danlark1 I have merged PR #332. The only problem now is with the div/rem pass problems, which I think should be tracked in the Rust side here. Also, did you find any easy fixes for the leading_zeros problem?

@est31
Copy link
Member Author

est31 commented Sep 12, 2020

Closing as per @AaronKutch 's comment above. Thanks everyone!

@est31 est31 closed this as completed Sep 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants