-
Notifications
You must be signed in to change notification settings - Fork 213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Greatly improve division performance for u128 and other cases #332
Conversation
I forgot to run rustfmt. However, I get the error
when I try to run |
Thanks for the PR! Unfortunately due to the way this repository is set up with CI and how it integrates into the compiler this crate is unable to take on any dependencies. Could the code be copied into this crate? |
42a8a95
to
b03ce24
Compare
Do I run my rustfmt nightly directly on the files? What's the cause of i686 failing? |
// `debug-assertions = true` | ||
|
||
// This replicates `carrying_mul` (rust-lang rfc #2417). LLVM correctly optimizes this | ||
// to use a widening multiply to 128 bits. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But cg_clif won't, causing an infinite loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also is it optimized correctly on archs other than x86?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I presume cg_clif
is the cranelift backend. I would understand that cranelift can't optimize it into a single instruction, but what do you mean by infinite loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never mind, thought this function was for multiplication, in which case the wrapping_mul
would end up calling this function again.
I presume cg_clif is the cranelift backend.
Indeed
For a summary from some assembly references: Architectures with both a widening mul and asymmetric div: Architectures with only a widening mul, we only need to check that LLVM is using the widening intructions correctly: MIPS is interesting and I think we should just use the default division hierarchy of I found a reference nested inside the manual for some IBM assembler that mentions concatenated registers for a division operation, but it is for power and not powerpc if I am reading it correctly. I cannot even find something on widening multiplication for powerpc64. I could not find widening multiplication for webassembly anywhere, the closest I got to a assembly reference is this. I do not know what to do in the case of Wasm. That's all the notable architectures. I know for certain it is worth it to do the inline assembly and use |
Is there some tests we already have or could have to test that widening multiplications optimizations work on notable architectures? It is the most important factor in the performance of some algorithms. Maybe we should move widening_mul rfc #2417 forward with some kind of solution that is optimized whether we work with LLVM or the cranelift backends? |
wasm32 made it but i686 is still failing |
According to https://stackoverflow.com/questions/3378616/the-application-failed-to-initialize-properly-0xc000007b this exit code corresponds to |
I remembered encountering an error before with |
The error occurs when cargo wants to know the rustc version. It is unrelated to the changes in this PR. |
It gave me the error when I ran
but I remember getting the more obscure error with |
I believe that was a transient error at the time and have requeued checks. |
Any updates on this? |
Sorry this fell off my radar. This unfortunately though is a pretty huge PR to a lot of code I did not write myself nor do I fully understand. I can't quite tell what's a functional change and what's just a refactoring. I'm sort of naively expecting a fast path to crop up here or there with inline asm to do various things, but it doesn't look like this PR does that. Would it be possible to pare this down to not contain organizational changes, or if there are organizational changes separate them out into commits for easier review? |
The code I am replacing is a very slow binary long division algorithm that was presumably a MVP to get Also, does the
I do use inline asm to optimize for x86_64. The only other place that I see inline asm being useful is for binary long division. The Also,
|
Nevermind about the formatting problem, it isn't caused by |
My guess: your |
I have no idea why this is, but trying to run If I run
This corresponds to these lines:
I have no idea what the purpose of these, and why a If I change my directory to the testcrate, |
Clippy is showing 55 warnings. Can I modernize |
I was accidentally bringing in a dependency on a panic function. All the checks are passing now. |
@alexcrichton I think my algorithms should be looked at by some numerics minded person. Do you know anyone around the LLVM project or Rust project that might might provide some input? I wonder if any of my work should be upstreamed or there is something I missed. Also, there are some questions in this thread that haven't been answered yet. I have a fuzz tester specifically designed for division algorithms that should be put somewhere. Should it be put in |
Unfortunately I do not know of folks to help review this myself, but I agree it would be good to get more review before landing. I think adding CI is fine so long as it's not too burdensome. |
I haven't been able to work on this until now because of college, but now I can. Do I need to do a full compiler bootstrap to test performance myself? I am wondering about the performance impact of flags like |
I am done with my round of changes. I think I have resolved all issues except for the one where I should use
It generates an insane amount of magic number juggling: assembly
If I instead manually split the integer with the equivalent function:
assembly
It generates optimal code. A similar problem is happening on RISC-V and will probably still exist even after the effects from my PR make it into nightly. The problem is certainly in LLVM, and I found the |
I published |
I don't think any changes to testing are required. |
I finally found a potentially good way to reimplement my algorithms in terms of generics, that involves replacing the current |
Let's merge this for now. |
I guess we should close this issue down? |
@leonardo-m You can close it if you feel that the performance improvements are enough to resolve the issue. |
When do the performance changes make it into nightly? The performance does not seem to have improved yet. |
It used to be a git submodule, but nowadays it's pushed to crates.io apparently, from where the compiler pulls it, see https://github.com/rust-lang/compiler-builtins/blob/master/PUBLISHING.md So the answer is: whenever a new version is published on crates.io, and rustc updates it. I guess you could try making PRs for both. |
Wait, there is apparently no automation, so you can't do it yourself, but need to ask a maintainer to do it. |
I just realized that I need the tweak the inlining more after benchmarking the div only function versus divmod in my |
See also: rust-lang/rust#54867 |
Update `compiler_builtins` to 0.1.36 So, the libc build with cargo's `build-std` feature emits a lot of warnings like: ``` warning: a method with this name may be added to the standard library in the future --> /home/runner/.cargo/registry/src/github.com-1ecc6299db9ec823/compiler_builtins-0.1.35/src/int/udiv.rs:98:23 | 98 | q = n << (<$ty>::BITS - sr); | ^^^^^^^^^^^ ... 268 | udivmod_inner!(n, d, rem, u128) | ------------------------------- in this macro invocation | = warning: once this method is added to the standard library, the ambiguity may cause an error or change in behavior! = note: for more information, see issue rust-lang#48919 <rust-lang/issues/48919> = help: call with fully qualified syntax `Int::BITS(...)` to keep using the current method = help: add `#![feature(int_bits_const)]` to the crate attributes to enable `num::<impl u128>::BITS` = note: this warning originates in a macro (in Nightly builds, run with -Z macro-backtrace for more info) ``` (You can find the full log in https://github.com/rust-lang/libc/runs/1283695796?check_suite_focus=true for example.) 0.1.36 contains rust-lang/compiler-builtins#332 so this version should remove this warning. cc rust-lang/libc#1942
Update `compiler_builtins` to 0.1.36 So, the libc build with cargo's `build-std` feature emits a lot of warnings like: ``` warning: a method with this name may be added to the standard library in the future --> /home/runner/.cargo/registry/src/github.com-1ecc6299db9ec823/compiler_builtins-0.1.35/src/int/udiv.rs:98:23 | 98 | q = n << (<$ty>::BITS - sr); | ^^^^^^^^^^^ ... 268 | udivmod_inner!(n, d, rem, u128) | ------------------------------- in this macro invocation | = warning: once this method is added to the standard library, the ambiguity may cause an error or change in behavior! = note: for more information, see issue rust-lang#48919 <rust-lang/issues/48919> = help: call with fully qualified syntax `Int::BITS(...)` to keep using the current method = help: add `#![feature(int_bits_const)]` to the crate attributes to enable `num::<impl u128>::BITS` = note: this warning originates in a macro (in Nightly builds, run with -Z macro-backtrace for more info) ``` (You can find the full log in https://github.com/rust-lang/libc/runs/1283695796?check_suite_focus=true for example.) 0.1.36 contains rust-lang/compiler-builtins#332 so this version should remove this warning. cc rust-lang/libc#1942
This cleans up
udiv.rs
andsdiv.rs
by replacing their implementations with my optimized division functions from myspecialized-div-rem
crate. On x86_64, you can expect u128 division performance improvements of 300% to 1100% depending on what ranges of inputs you use. I am also sure that division performance is improved on 32 bit platforms, but someone needs to benchmark it just in case.