Greatly improve division performance for u128 and other cases #332

AaronKutch · 2019-12-18T20:39:20Z

This cleans up udiv.rs and sdiv.rs by replacing their implementations with my optimized division functions from my specialized-div-rem crate. On x86_64, you can expect u128 division performance improvements of 300% to 1100% depending on what ranges of inputs you use. I am also sure that division performance is improved on 32 bit platforms, but someone needs to benchmark it just in case.

AaronKutch · 2019-12-18T20:45:09Z

I forgot to run rustfmt. However, I get the error

error: couldn't read \\?\C:\...GitHub\compiler-builtins\src\..\libm\src\math\mod.rs: The filename, directory name, or volume label syntax is incorrect. (os error 123)
 --> \\?\C:\Users\Allen\Documents\GitHub\compiler-builtins\src\math.rs:3:5
  |
3 | mod libm;
  |     ^^^^

when I try to run cargo fmt

alexcrichton · 2019-12-18T23:38:05Z

Thanks for the PR!

Unfortunately due to the way this repository is set up with CI and how it integrates into the compiler this crate is unable to take on any dependencies. Could the code be copied into this crate?

AaronKutch · 2019-12-19T01:04:05Z

Do I run my rustfmt nightly directly on the files? What's the cause of i686 failing?

src/int/specialized_div_rem/mod.rs

bjorn3 · 2019-12-19T19:02:54Z

src/int/specialized_div_rem/trifecta.rs

+            // `debug-assertions = true`
+
+            // This replicates `carrying_mul` (rust-lang rfc #2417). LLVM correctly optimizes this
+            // to use a widening multiply to 128 bits.


But cg_clif won't, causing an infinite loop.

Also is it optimized correctly on archs other than x86?

I presume cg_clif is the cranelift backend. I would understand that cranelift can't optimize it into a single instruction, but what do you mean by infinite loop?

Never mind, thought this function was for multiplication, in which case the wrapping_mul would end up calling this function again.

I presume cg_clif is the cranelift backend.

Indeed

src/int/specialized_div_rem/mod.rs

AaronKutch · 2019-12-19T20:45:04Z

For a summary from some assembly references:

Architectures with both a widening mul and asymmetric div:
x86-64: 128 bit mulq and divq
x86: 64 bit mul and div

Architectures with only a widening mul, we only need to check that LLVM is using the widening intructions correctly:
riscv32, riscv64
armv7, armv8

MIPS is interesting and I think we should just use the default division hierarchy of _binary_long <- _delegate <- _trifecta and let LLVM do whatever it wants and not worry the performance of that case.

I found a reference nested inside the manual for some IBM assembler that mentions concatenated registers for a division operation, but it is for power and not powerpc if I am reading it correctly. I cannot even find something on widening multiplication for powerpc64.

I could not find widening multiplication for webassembly anywhere, the closest I got to a assembly reference is this. I do not know what to do in the case of Wasm.

That's all the notable architectures. I know for certain it is worth it to do the inline assembly and use _asymmetric for u128 on x86_64, but I do not know if it is worth it to do it for anything else.

AaronKutch · 2019-12-19T21:11:07Z

Is there some tests we already have or could have to test that widening multiplications optimizations work on notable architectures? It is the most important factor in the performance of some algorithms. Maybe we should move widening_mul rfc #2417 forward with some kind of solution that is optimized whether we work with LLVM or the cranelift backends?

AaronKutch · 2019-12-19T21:52:16Z

wasm32 made it but i686 is still failing

bjorn3 · 2019-12-20T11:06:35Z

error: process didn't exit successfully: rustc -vV (exit code: 0xc000007b)

According to https://stackoverflow.com/questions/3378616/the-application-failed-to-initialize-properly-0xc000007b this exit code corresponds to STATUS_INVALID_IMAGE_FORMAT.

AaronKutch · 2019-12-20T14:10:56Z

I remembered encountering an error before with error: process didn't exit successfully: rustc -vV, and it had to do with a windows-gnu toolchain that I was running through MSYS2. the PATH variable inside MSYS2 had not been set to include important dlls like libstdc++-6.dll. There was a popup window on the first run that mentioned it, but afterwards only the obscure error: process didn't exit successfully: rustc -vV showed. How is compiler-builtins introduced? Are the symbols brought in via a dll like mechanism, and it is missing these?

bjorn3 · 2019-12-20T14:20:48Z

The error occurs when cargo wants to know the rustc version. It is unrelated to the changes in this PR.

AaronKutch · 2019-12-20T14:25:06Z

It gave me the error when I ran rustc --version
edit: I just reproduced it, and actually the error message is:

$ rustc --version
C:/Users/.../.cargo/bin/rustc.exe: error while loading shared libraries: ?: cannot open shared object file: No such file or directory

but I remember getting the more obscure error with rustc -vV somewhere

alexcrichton · 2020-01-06T22:58:02Z

I believe that was a transient error at the time and have requeued checks.

AaronKutch · 2020-01-27T02:47:15Z

Any updates on this?

alexcrichton · 2020-01-28T08:53:34Z

Sorry this fell off my radar. This unfortunately though is a pretty huge PR to a lot of code I did not write myself nor do I fully understand. I can't quite tell what's a functional change and what's just a refactoring.

I'm sort of naively expecting a fast path to crop up here or there with inline asm to do various things, but it doesn't look like this PR does that. Would it be possible to pare this down to not contain organizational changes, or if there are organizational changes separate them out into commits for easier review?

AaronKutch · 2020-01-28T17:34:15Z

The code I am replacing is a very slow binary long division algorithm that was presumably a MVP to get u128 stabilized. I completely replaced the functional part. I can't really factor out more because the old algorithm had code in different modules and inside the extern functions, and the new algorithm is self contained but depends on the whole subalgorithm chain working at once. The old blocks of extern functions also had a weird ordering and inconsistent docs which I fixed.

Also, does the #[maybe_use_optimized_c_shim] attribute replace my code with different code? I am curious about the performance differences. I have test functions and benchmarking functions, but where should I put those?

I'm sort of naively expecting a fast path to crop up here or there with inline asm to do various things, but it doesn't look like this PR does that.

I do use inline asm to optimize for x86_64. The only other place that I see inline asm being useful is for binary long division. The _binary_long algorithm I have is at least faster than the original even without assembly. I don't know if it is worth it to add assembly

Also, cargo fmt is not working for me. It gives me the error

Error: Decoding config file failed:
invalid type: integer `2018`, expected string for key `edition`
Please check your config file.

AaronKutch · 2020-01-28T17:50:23Z

Nevermind about the formatting problem, it isn't caused by compiler-builtins. I have run rustup to update to the latest nightly, but still getting the error on all my crates. Where is the mentioned configuration file? I haven't modified RUSTUP_HOME or done anything special.

amosonn · 2020-02-01T22:45:34Z

My guess: your Cargo.toml (at the crate root) has for some reason edition = 2018 instead of edition = "2018".

AaronKutch · 2020-02-02T03:20:32Z

I have no idea why this is, but trying to run cargo fmt on a crate that is under .../Desktop/ (Windows 10) will always cause that error. I simply moved the crate to .../Documents/GitHub and it started producing the original error messages.

If I run cargo fmt at the root of the crate, I get:

error: couldn't read \\?\C:\Users\Aaron Kutch\Documents\GitHub\compiler-builtins\src\..\libm\src\math\mod.rs: The filename, directory name, or volume label syntax is incorrect. (os error 123)
 --> \\?\C:\Users\Aaron Kutch\Documents\GitHub\compiler-builtins\src\math.rs:3:5
  |
3 | mod libm;
  |     ^^^^

This corresponds to these lines:

#[allow(dead_code)]
#[path = "../libm/src/math/mod.rs"]
mod libm;

I have no idea what the purpose of these, and why a #[allow(dead_code)] is needed. There is no documentation around them.

If I change my directory to the testcrate, cargo fmt runs fine.
If I change my directory to src, cargo fmt fails to find targets.
If I try to run rustfmt on src, it appears to hang. Is this a bug?
If I run rustfmt on src/int/mod.rs, it finally works.

AaronKutch · 2020-02-02T03:26:43Z

Clippy is showing 55 warnings. Can I modernize compiler-builtins, or does it need to stay 2015 edition for backwards compatibility?

AaronKutch · 2020-02-02T05:47:58Z

I was accidentally bringing in a dependency on a panic function. All the checks are passing now.

AaronKutch · 2020-02-07T15:35:35Z

@alexcrichton I think my algorithms should be looked at by some numerics minded person. Do you know anyone around the LLVM project or Rust project that might might provide some input? I wonder if any of my work should be upstreamed or there is something I missed.

Also, there are some questions in this thread that haven't been answered yet. I have a fuzz tester specifically designed for division algorithms that should be put somewhere. Should it be put in compiler-builtins or somewhere else for integration testing?

alexcrichton · 2020-02-12T18:12:02Z

Unfortunately I do not know of folks to help review this myself, but I agree it would be good to get more review before landing. I think adding CI is fine so long as it's not too burdensome.

AaronKutch · 2020-05-04T03:49:42Z

I haven't been able to work on this until now because of college, but now I can. Do I need to do a full compiler bootstrap to test performance myself? I am wondering about the performance impact of flags like maybe_use_optimized_c_shim and want to see which is faster.

AaronKutch · 2020-07-31T19:23:02Z

I am done with my round of changes. I think I have resolved all issues except for the one where I should use LargeInt for splitting up integers. I think I will go with the position that I will keep my code style as is, and follow up PRs can be made if someone doesn't like the style.
I have rebased on my recently merged PR improving RISC-V leading_zeros. There is still a serious performance problem with u128::leading_zeros on 64 bit targets and u64::leading_zeros on 32 bit targets however. If I look at the assembly for cargo rustc --release --target=thumbv8m.base-none-eabi -- --emit asm

pub fn u64_lz(x: u64) -> u32 {
    x.leading_zeros()
}

It generates an insane amount of magic number juggling:

assembly

_ZN14aaron_test_lib6u64_lz17h1c6fdfc79fede3f3E:
	.fnstart
	.save	{r4, r5, r7, lr}
	push	{r4, r5, r7, lr}
	.setfp	r7, sp, #8
	add	r7, sp, #8
	mov	r5, r1
	mov	r4, r0
	mov	r0, r1
	bl	__clzsi2
	cbnz	r5, .LBB0_2
	lsrs	r0, r4, #1
	orrs	r0, r4
	lsrs	r1, r0, #2
	orrs	r1, r0
	lsrs	r0, r1, #4
	orrs	r0, r1
	lsrs	r1, r0, #8
	orrs	r1, r0
	lsrs	r0, r1, #16
	orrs	r0, r1
	mvns	r0, r0
	lsrs	r1, r0, #1
	movw	r2, #21845
	movt	r2, #21845
	ands	r2, r1
	subs	r0, r0, r2
	movw	r1, #13107
	movt	r1, #13107
	lsrs	r2, r0, #2
	ands	r0, r1
	ands	r2, r1
	adds	r0, r0, r2
	lsrs	r1, r0, #4
	adds	r0, r0, r1
	movw	r1, #3855
	movt	r1, #3855
	ands	r1, r0
	movw	r0, #257
	movt	r0, #257
	muls	r0, r1, r0
	lsrs	r0, r0, #24
	adds	r0, #32
.LBB0_2:
	pop	{r4, r5, r7, pc}

If I instead manually split the integer with the equivalent function:

pub fn u64_lz(x: u64) -> u32 {
    let mut x = x;
    let mut z = 0;
    if ((x >> 32) as u32) == 0 {
        z += 32;
    } else {
        x >>= 32;
    }
    z + (x as u32).leading_zeros()
}

assembly


_ZN14aaron_test_lib6u64_lz17h1c6fdfc79fede3f3E:
	.fnstart
	.save	{r4, r6, r7, lr}
	push	{r4, r6, r7, lr}
	.setfp	r7, sp, #8
	add	r7, sp, #8
	rsbs	r2, r1, #0
	adcs	r2, r1
	cbz	r1, .LBB0_2
	mov	r0, r1
.LBB0_2:
	lsls	r4, r2, #5
	cbz	r0, .LBB0_4
	bl	__clzsi2
	adds	r0, r0, r4
	pop	{r4, r6, r7, pc}
.LBB0_4:
	movs	r0, #32
	adds	r0, r0, r4
	pop	{r4, r6, r7, pc}

It generates optimal code. A similar problem is happening on RISC-V and will probably still exist even after the effects from my PR make it into nightly. The problem is certainly in LLVM, and I found the LegalizerHelper::narrowScalarCTLZ function that appears to do the job in llvm/lib/CodeGen/GlobalISel/LegalizerHelper.cpp. There is also DAGTypeLegalizer::ExpandIntRes_CTLZ. However, they seem to do what my function is doing. I don't know what causes the problem. Does someone more familiar with LLVM know how to fix this?

AaronKutch · 2020-08-14T20:40:11Z

I published specialized-div-rem 1.0.0 and made minor documentation adjustments. I'm confident enough to use unreachable_unchecked for divisions by zero. It will save time on the critical path. The only extra thing I want to add is __divmodti3. Even though LLVM does not have it yet, it should be included since GCC has it. Does this require changes to the testing setup?

Amanieu · 2020-08-15T15:53:26Z

I don't think any changes to testing are required.

AaronKutch · 2020-08-31T21:31:51Z

I finally found a potentially good way to reimplement my algorithms in terms of generics, that involves replacing the current LargetInt trait with two more flexible traits (see issue #367). Unfortunately, I have very little free time at the moment, and it will probably be a few weeks before I can accumulate enough free time to finish this. Is there anything else preventing merging this PR as is? I don't know if it is better to merge now or wait a potentially long time.

Amanieu · 2020-09-03T19:52:34Z

Let's merge this for now.

leonardo-m · 2020-09-11T17:43:24Z

I guess we should close this issue down?
rust-lang/rust#39078

Amanieu · 2020-09-11T19:45:05Z

@leonardo-m You can close it if you feel that the performance improvements are enough to resolve the issue.

AaronKutch · 2020-09-12T03:06:31Z

When do the performance changes make it into nightly? The performance does not seem to have improved yet.

est31 · 2020-09-12T03:16:39Z

It used to be a git submodule, but nowadays it's pushed to crates.io apparently, from where the compiler pulls it, see https://github.com/rust-lang/compiler-builtins/blob/master/PUBLISHING.md

So the answer is: whenever a new version is published on crates.io, and rustc updates it. I guess you could try making PRs for both.

est31 · 2020-09-12T03:18:13Z

Wait, there is apparently no automation, so you can't do it yourself, but need to ask a maintainer to do it.

AaronKutch · 2020-09-12T17:26:06Z

I just realized that I need the tweak the inlining more after benchmarking the div only function versus divmod in my specialized-div-rem crate. I thought I remembered doing this before and only getting a slight perf increase, but it looks like the difference is actually about 2x, probably due to the function call overhead.

leonardo-m · 2020-09-12T17:37:38Z

See also: rust-lang/rust#54867

Update `compiler_builtins` to 0.1.36 So, the libc build with cargo's `build-std` feature emits a lot of warnings like: ``` warning: a method with this name may be added to the standard library in the future --> /home/runner/.cargo/registry/src/github.com-1ecc6299db9ec823/compiler_builtins-0.1.35/src/int/udiv.rs:98:23 | 98 | q = n << (<$ty>::BITS - sr); | ^^^^^^^^^^^ ... 268 | udivmod_inner!(n, d, rem, u128) | ------------------------------- in this macro invocation | = warning: once this method is added to the standard library, the ambiguity may cause an error or change in behavior! = note: for more information, see issue rust-lang#48919 <rust-lang/issues/48919> = help: call with fully qualified syntax `Int::BITS(...)` to keep using the current method = help: add `#![feature(int_bits_const)]` to the crate attributes to enable `num::<impl u128>::BITS` = note: this warning originates in a macro (in Nightly builds, run with -Z macro-backtrace for more info) ``` (You can find the full log in https://github.com/rust-lang/libc/runs/1283695796?check_suite_focus=true for example.) 0.1.36 contains rust-lang/compiler-builtins#332 so this version should remove this warning. cc rust-lang/libc#1942

Set the MSRV to 1.63

AaronKutch force-pushed the issue-265 branch from 42a8a95 to b03ce24 Compare December 19, 2019 00:57

AaronKutch mentioned this pull request Dec 19, 2019

Better rustfmt Robbepop/apint#51

Merged

bjorn3 reviewed Dec 19, 2019

View reviewed changes

src/int/specialized_div_rem/mod.rs Outdated Show resolved Hide resolved

bjorn3 reviewed Dec 19, 2019

View reviewed changes

AaronKutch mentioned this pull request Jan 14, 2020

Inline assembly rust-lang/rfcs#2850

Closed

AaronKutch force-pushed the issue-265 branch from 0d276ab to 71aba93 Compare January 28, 2020 17:34

AaronKutch force-pushed the issue-265 branch from 74192a9 to c9b42ce Compare February 2, 2020 05:39

AaronKutch force-pushed the issue-265 branch from c9b42ce to 30b6b45 Compare May 4, 2020 03:46

AaronKutch added 2 commits July 28, 2020 13:46

Remove erroneous aapcs_on_arm and add maybe_use_optimized_c_shim

752ab52

Remove unused code

6aef025

AaronKutch force-pushed the issue-265 branch from ae16fc2 to 16fe7ae Compare July 28, 2020 18:56

AaronKutch added 3 commits August 14, 2020 15:31

Use specialized-div-rem 1.0.0 for division algorithms

1621c6d

Change inlining to favor three underlying division functions

0e6d75d

Use unreachable_unchecked

bc06465

AaronKutch force-pushed the issue-265 branch from 16fe7ae to bc06465 Compare August 14, 2020 20:32

AaronKutch mentioned this pull request Aug 16, 2020

#[aapcs_on_arm] is applied to two functions with no floating point #373

Closed

Add __divmodti4

26fe6ff

Amanieu merged commit 1220e67 into rust-lang:master Sep 3, 2020

AaronKutch mentioned this pull request Sep 12, 2020

There exists significantly faster division algorithms for certain CPUs #265

Closed

alexcrichton mentioned this pull request Oct 5, 2020

Division tweaks #380

Merged

JohnTitor mentioned this pull request Oct 21, 2020

Update compiler_builtins to 0.1.36 rust-lang/rust#78209

Merged

AaronKutch deleted the issue-265 branch March 30, 2021 04:50

nicholasbishop mentioned this pull request Jun 20, 2021

u128 division not working on x86_64-unknown-uefi rust-lang/rust#86494

Closed

tgross35 added a commit to tgross35/compiler-builtins that referenced this pull request Feb 23, 2025

Merge pull request rust-lang#332 from tgross35/msrv-test

6325b92

Set the MSRV to 1.63

Greatly improve division performance for u128 and other cases #332

Greatly improve division performance for u128 and other cases #332

Uh oh!

Conversation

AaronKutch commented Dec 18, 2019

Uh oh!

AaronKutch commented Dec 18, 2019

Uh oh!

alexcrichton commented Dec 18, 2019

Uh oh!

AaronKutch commented Dec 19, 2019

Uh oh!

Uh oh!

bjorn3 Dec 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bjorn3 Dec 19, 2019

Choose a reason for hiding this comment

Uh oh!

AaronKutch Dec 19, 2019

Choose a reason for hiding this comment

Uh oh!

bjorn3 Dec 19, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

AaronKutch commented Dec 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AaronKutch commented Dec 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AaronKutch commented Dec 19, 2019

Uh oh!

bjorn3 commented Dec 20, 2019

Uh oh!

AaronKutch commented Dec 20, 2019

Uh oh!

bjorn3 commented Dec 20, 2019

Uh oh!

AaronKutch commented Dec 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexcrichton commented Jan 6, 2020

Uh oh!

AaronKutch commented Jan 27, 2020

Uh oh!

alexcrichton commented Jan 28, 2020

Uh oh!

AaronKutch commented Jan 28, 2020

Uh oh!

AaronKutch commented Jan 28, 2020

Uh oh!

amosonn commented Feb 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AaronKutch commented Feb 2, 2020

Uh oh!

AaronKutch commented Feb 2, 2020

Uh oh!

AaronKutch commented Feb 2, 2020

Uh oh!

AaronKutch commented Feb 7, 2020

Uh oh!

alexcrichton commented Feb 12, 2020

Uh oh!

AaronKutch commented May 4, 2020

Uh oh!

AaronKutch commented Jul 31, 2020

Uh oh!

AaronKutch commented Aug 14, 2020

Uh oh!

Amanieu commented Aug 15, 2020

Uh oh!

AaronKutch commented Aug 31, 2020

Uh oh!

Amanieu commented Sep 3, 2020

Uh oh!

leonardo-m commented Sep 11, 2020

bjorn3 Dec 19, 2019 •

edited

Loading

AaronKutch commented Dec 19, 2019 •

edited

Loading

AaronKutch commented Dec 19, 2019 •

edited

Loading

AaronKutch commented Dec 20, 2019 •

edited

Loading

amosonn commented Feb 1, 2020 •

edited

Loading