Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Greatly improve division performance for u128 and other cases #332

Merged
merged 8 commits into from
Sep 3, 2020

Conversation

AaronKutch
Copy link
Contributor

This cleans up udiv.rs and sdiv.rs by replacing their implementations with my optimized division functions from my specialized-div-rem crate. On x86_64, you can expect u128 division performance improvements of 300% to 1100% depending on what ranges of inputs you use. I am also sure that division performance is improved on 32 bit platforms, but someone needs to benchmark it just in case.

@AaronKutch
Copy link
Contributor Author

I forgot to run rustfmt. However, I get the error

error: couldn't read \\?\C:\...GitHub\compiler-builtins\src\..\libm\src\math\mod.rs: The filename, directory name, or volume label syntax is incorrect. (os error 123)
 --> \\?\C:\Users\Allen\Documents\GitHub\compiler-builtins\src\math.rs:3:5
  |
3 | mod libm;
  |     ^^^^

when I try to run cargo fmt

@alexcrichton
Copy link
Member

Thanks for the PR!

Unfortunately due to the way this repository is set up with CI and how it integrates into the compiler this crate is unable to take on any dependencies. Could the code be copied into this crate?

@AaronKutch
Copy link
Contributor Author

Do I run my rustfmt nightly directly on the files? What's the cause of i686 failing?

// `debug-assertions = true`

// This replicates `carrying_mul` (rust-lang rfc #2417). LLVM correctly optimizes this
// to use a widening multiply to 128 bits.
Copy link
Member

@bjorn3 bjorn3 Dec 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But cg_clif won't, causing an infinite loop.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also is it optimized correctly on archs other than x86?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I presume cg_clif is the cranelift backend. I would understand that cranelift can't optimize it into a single instruction, but what do you mean by infinite loop?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind, thought this function was for multiplication, in which case the wrapping_mul would end up calling this function again.

I presume cg_clif is the cranelift backend.

Indeed

src/int/specialized_div_rem/mod.rs Outdated Show resolved Hide resolved
@AaronKutch
Copy link
Contributor Author

AaronKutch commented Dec 19, 2019

For a summary from some assembly references:

Architectures with both a widening mul and asymmetric div:
x86-64: 128 bit mulq and divq
x86: 64 bit mul and div

Architectures with only a widening mul, we only need to check that LLVM is using the widening intructions correctly:
riscv32, riscv64
armv7, armv8

MIPS is interesting and I think we should just use the default division hierarchy of _binary_long <- _delegate <- _trifecta and let LLVM do whatever it wants and not worry the performance of that case.

I found a reference nested inside the manual for some IBM assembler that mentions concatenated registers for a division operation, but it is for power and not powerpc if I am reading it correctly. I cannot even find something on widening multiplication for powerpc64.

I could not find widening multiplication for webassembly anywhere, the closest I got to a assembly reference is this. I do not know what to do in the case of Wasm.

That's all the notable architectures. I know for certain it is worth it to do the inline assembly and use _asymmetric for u128 on x86_64, but I do not know if it is worth it to do it for anything else.

@AaronKutch
Copy link
Contributor Author

AaronKutch commented Dec 19, 2019

Is there some tests we already have or could have to test that widening multiplications optimizations work on notable architectures? It is the most important factor in the performance of some algorithms. Maybe we should move widening_mul rfc #2417 forward with some kind of solution that is optimized whether we work with LLVM or the cranelift backends?

@AaronKutch
Copy link
Contributor Author

wasm32 made it but i686 is still failing

@bjorn3
Copy link
Member

bjorn3 commented Dec 20, 2019

error: process didn't exit successfully: rustc -vV (exit code: 0xc000007b)

According to https://stackoverflow.com/questions/3378616/the-application-failed-to-initialize-properly-0xc000007b this exit code corresponds to STATUS_INVALID_IMAGE_FORMAT.

@AaronKutch
Copy link
Contributor Author

I remembered encountering an error before with error: process didn't exit successfully: rustc -vV, and it had to do with a windows-gnu toolchain that I was running through MSYS2. the PATH variable inside MSYS2 had not been set to include important dlls like libstdc++-6.dll. There was a popup window on the first run that mentioned it, but afterwards only the obscure error: process didn't exit successfully: rustc -vV showed. How is compiler-builtins introduced? Are the symbols brought in via a dll like mechanism, and it is missing these?

@bjorn3
Copy link
Member

bjorn3 commented Dec 20, 2019

The error occurs when cargo wants to know the rustc version. It is unrelated to the changes in this PR.

@AaronKutch
Copy link
Contributor Author

AaronKutch commented Dec 20, 2019

It gave me the error when I ran rustc --version
edit: I just reproduced it, and actually the error message is:

$ rustc --version
C:/Users/.../.cargo/bin/rustc.exe: error while loading shared libraries: ?: cannot open shared object file: No such file or directory

but I remember getting the more obscure error with rustc -vV somewhere

@alexcrichton
Copy link
Member

I believe that was a transient error at the time and have requeued checks.

@AaronKutch
Copy link
Contributor Author

Any updates on this?

@alexcrichton
Copy link
Member

Sorry this fell off my radar. This unfortunately though is a pretty huge PR to a lot of code I did not write myself nor do I fully understand. I can't quite tell what's a functional change and what's just a refactoring.

I'm sort of naively expecting a fast path to crop up here or there with inline asm to do various things, but it doesn't look like this PR does that. Would it be possible to pare this down to not contain organizational changes, or if there are organizational changes separate them out into commits for easier review?

@AaronKutch
Copy link
Contributor Author

The code I am replacing is a very slow binary long division algorithm that was presumably a MVP to get u128 stabilized. I completely replaced the functional part. I can't really factor out more because the old algorithm had code in different modules and inside the extern functions, and the new algorithm is self contained but depends on the whole subalgorithm chain working at once. The old blocks of extern functions also had a weird ordering and inconsistent docs which I fixed.

Also, does the #[maybe_use_optimized_c_shim] attribute replace my code with different code? I am curious about the performance differences. I have test functions and benchmarking functions, but where should I put those?

I'm sort of naively expecting a fast path to crop up here or there with inline asm to do various things, but it doesn't look like this PR does that.

I do use inline asm to optimize for x86_64. The only other place that I see inline asm being useful is for binary long division. The _binary_long algorithm I have is at least faster than the original even without assembly. I don't know if it is worth it to add assembly

Also, cargo fmt is not working for me. It gives me the error

Error: Decoding config file failed:
invalid type: integer `2018`, expected string for key `edition`
Please check your config file.

@AaronKutch
Copy link
Contributor Author

Nevermind about the formatting problem, it isn't caused by compiler-builtins. I have run rustup to update to the latest nightly, but still getting the error on all my crates. Where is the mentioned configuration file? I haven't modified RUSTUP_HOME or done anything special.

@amosonn
Copy link

amosonn commented Feb 1, 2020

My guess: your Cargo.toml (at the crate root) has for some reason edition = 2018 instead of edition = "2018".

@AaronKutch
Copy link
Contributor Author

I have no idea why this is, but trying to run cargo fmt on a crate that is under .../Desktop/ (Windows 10) will always cause that error. I simply moved the crate to .../Documents/GitHub and it started producing the original error messages.

If I run cargo fmt at the root of the crate, I get:

error: couldn't read \\?\C:\Users\Aaron Kutch\Documents\GitHub\compiler-builtins\src\..\libm\src\math\mod.rs: The filename, directory name, or volume label syntax is incorrect. (os error 123)
 --> \\?\C:\Users\Aaron Kutch\Documents\GitHub\compiler-builtins\src\math.rs:3:5
  |
3 | mod libm;
  |     ^^^^

This corresponds to these lines:

#[allow(dead_code)]
#[path = "../libm/src/math/mod.rs"]
mod libm;

I have no idea what the purpose of these, and why a #[allow(dead_code)] is needed. There is no documentation around them.

If I change my directory to the testcrate, cargo fmt runs fine.
If I change my directory to src, cargo fmt fails to find targets.
If I try to run rustfmt on src, it appears to hang. Is this a bug?
If I run rustfmt on src/int/mod.rs, it finally works.

@AaronKutch
Copy link
Contributor Author

Clippy is showing 55 warnings. Can I modernize compiler-builtins, or does it need to stay 2015 edition for backwards compatibility?

@AaronKutch
Copy link
Contributor Author

I was accidentally bringing in a dependency on a panic function. All the checks are passing now.

@AaronKutch
Copy link
Contributor Author

@alexcrichton I think my algorithms should be looked at by some numerics minded person. Do you know anyone around the LLVM project or Rust project that might might provide some input? I wonder if any of my work should be upstreamed or there is something I missed.

Also, there are some questions in this thread that haven't been answered yet. I have a fuzz tester specifically designed for division algorithms that should be put somewhere. Should it be put in compiler-builtins or somewhere else for integration testing?

@alexcrichton
Copy link
Member

Unfortunately I do not know of folks to help review this myself, but I agree it would be good to get more review before landing. I think adding CI is fine so long as it's not too burdensome.

@AaronKutch
Copy link
Contributor Author

I haven't been able to work on this until now because of college, but now I can. Do I need to do a full compiler bootstrap to test performance myself? I am wondering about the performance impact of flags like maybe_use_optimized_c_shim and want to see which is faster.

@AaronKutch
Copy link
Contributor Author

I am done with my round of changes. I think I have resolved all issues except for the one where I should use LargeInt for splitting up integers. I think I will go with the position that I will keep my code style as is, and follow up PRs can be made if someone doesn't like the style.
I have rebased on my recently merged PR improving RISC-V leading_zeros. There is still a serious performance problem with u128::leading_zeros on 64 bit targets and u64::leading_zeros on 32 bit targets however. If I look at the assembly for cargo rustc --release --target=thumbv8m.base-none-eabi -- --emit asm

pub fn u64_lz(x: u64) -> u32 {
    x.leading_zeros()
}

It generates an insane amount of magic number juggling:

assembly
_ZN14aaron_test_lib6u64_lz17h1c6fdfc79fede3f3E:
	.fnstart
	.save	{r4, r5, r7, lr}
	push	{r4, r5, r7, lr}
	.setfp	r7, sp, #8
	add	r7, sp, #8
	mov	r5, r1
	mov	r4, r0
	mov	r0, r1
	bl	__clzsi2
	cbnz	r5, .LBB0_2
	lsrs	r0, r4, #1
	orrs	r0, r4
	lsrs	r1, r0, #2
	orrs	r1, r0
	lsrs	r0, r1, #4
	orrs	r0, r1
	lsrs	r1, r0, #8
	orrs	r1, r0
	lsrs	r0, r1, #16
	orrs	r0, r1
	mvns	r0, r0
	lsrs	r1, r0, #1
	movw	r2, #21845
	movt	r2, #21845
	ands	r2, r1
	subs	r0, r0, r2
	movw	r1, #13107
	movt	r1, #13107
	lsrs	r2, r0, #2
	ands	r0, r1
	ands	r2, r1
	adds	r0, r0, r2
	lsrs	r1, r0, #4
	adds	r0, r0, r1
	movw	r1, #3855
	movt	r1, #3855
	ands	r1, r0
	movw	r0, #257
	movt	r0, #257
	muls	r0, r1, r0
	lsrs	r0, r0, #24
	adds	r0, #32
.LBB0_2:
	pop	{r4, r5, r7, pc}

If I instead manually split the integer with the equivalent function:

pub fn u64_lz(x: u64) -> u32 {
    let mut x = x;
    let mut z = 0;
    if ((x >> 32) as u32) == 0 {
        z += 32;
    } else {
        x >>= 32;
    }
    z + (x as u32).leading_zeros()
}
assembly

_ZN14aaron_test_lib6u64_lz17h1c6fdfc79fede3f3E:
	.fnstart
	.save	{r4, r6, r7, lr}
	push	{r4, r6, r7, lr}
	.setfp	r7, sp, #8
	add	r7, sp, #8
	rsbs	r2, r1, #0
	adcs	r2, r1
	cbz	r1, .LBB0_2
	mov	r0, r1
.LBB0_2:
	lsls	r4, r2, #5
	cbz	r0, .LBB0_4
	bl	__clzsi2
	adds	r0, r0, r4
	pop	{r4, r6, r7, pc}
.LBB0_4:
	movs	r0, #32
	adds	r0, r0, r4
	pop	{r4, r6, r7, pc}

It generates optimal code. A similar problem is happening on RISC-V and will probably still exist even after the effects from my PR make it into nightly. The problem is certainly in LLVM, and I found the LegalizerHelper::narrowScalarCTLZ function that appears to do the job in llvm/lib/CodeGen/GlobalISel/LegalizerHelper.cpp. There is also DAGTypeLegalizer::ExpandIntRes_CTLZ. However, they seem to do what my function is doing. I don't know what causes the problem. Does someone more familiar with LLVM know how to fix this?

@AaronKutch
Copy link
Contributor Author

I published specialized-div-rem 1.0.0 and made minor documentation adjustments. I'm confident enough to use unreachable_unchecked for divisions by zero. It will save time on the critical path. The only extra thing I want to add is __divmodti3. Even though LLVM does not have it yet, it should be included since GCC has it. Does this require changes to the testing setup?

@Amanieu
Copy link
Member

Amanieu commented Aug 15, 2020

I don't think any changes to testing are required.

@AaronKutch
Copy link
Contributor Author

I finally found a potentially good way to reimplement my algorithms in terms of generics, that involves replacing the current LargetInt trait with two more flexible traits (see issue #367). Unfortunately, I have very little free time at the moment, and it will probably be a few weeks before I can accumulate enough free time to finish this. Is there anything else preventing merging this PR as is? I don't know if it is better to merge now or wait a potentially long time.

@Amanieu Amanieu merged commit 1220e67 into rust-lang:master Sep 3, 2020
@Amanieu
Copy link
Member

Amanieu commented Sep 3, 2020

Let's merge this for now.

@leonardo-m
Copy link

I guess we should close this issue down?
rust-lang/rust#39078

@Amanieu
Copy link
Member

Amanieu commented Sep 11, 2020

@leonardo-m You can close it if you feel that the performance improvements are enough to resolve the issue.

@AaronKutch
Copy link
Contributor Author

When do the performance changes make it into nightly? The performance does not seem to have improved yet.

@est31
Copy link
Member

est31 commented Sep 12, 2020

It used to be a git submodule, but nowadays it's pushed to crates.io apparently, from where the compiler pulls it, see https://github.com/rust-lang/compiler-builtins/blob/master/PUBLISHING.md

So the answer is: whenever a new version is published on crates.io, and rustc updates it. I guess you could try making PRs for both.

@est31
Copy link
Member

est31 commented Sep 12, 2020

Wait, there is apparently no automation, so you can't do it yourself, but need to ask a maintainer to do it.

@AaronKutch
Copy link
Contributor Author

I just realized that I need the tweak the inlining more after benchmarking the div only function versus divmod in my specialized-div-rem crate. I thought I remembered doing this before and only getting a slight perf increase, but it looks like the difference is actually about 2x, probably due to the function call overhead.

@leonardo-m
Copy link

See also: rust-lang/rust#54867

@alexcrichton alexcrichton mentioned this pull request Oct 5, 2020
Dylan-DPC-zz pushed a commit to Dylan-DPC-zz/rust that referenced this pull request Oct 22, 2020
Update `compiler_builtins` to 0.1.36

So, the libc build with cargo's `build-std` feature emits a lot of warnings like:
```
 warning: a method with this name may be added to the standard library in the future
   --> /home/runner/.cargo/registry/src/github.com-1ecc6299db9ec823/compiler_builtins-0.1.35/src/int/udiv.rs:98:23
    |
98  |             q = n << (<$ty>::BITS - sr);
    |                       ^^^^^^^^^^^
...
268 |         udivmod_inner!(n, d, rem, u128)
    |         ------------------------------- in this macro invocation
    |
    = warning: once this method is added to the standard library, the ambiguity may cause an error or change in behavior!
    = note: for more information, see issue rust-lang#48919 <rust-lang/issues/48919>
    = help: call with fully qualified syntax `Int::BITS(...)` to keep using the current method
    = help: add `#![feature(int_bits_const)]` to the crate attributes to enable `num::<impl u128>::BITS`
    = note: this warning originates in a macro (in Nightly builds, run with -Z macro-backtrace for more info)
```

(You can find the full log in https://github.com/rust-lang/libc/runs/1283695796?check_suite_focus=true for example.)

0.1.36 contains rust-lang/compiler-builtins#332 so this version should remove this warning.

cc rust-lang/libc#1942
JohnTitor added a commit to JohnTitor/rust that referenced this pull request Oct 25, 2020
Update `compiler_builtins` to 0.1.36

So, the libc build with cargo's `build-std` feature emits a lot of warnings like:
```
 warning: a method with this name may be added to the standard library in the future
   --> /home/runner/.cargo/registry/src/github.com-1ecc6299db9ec823/compiler_builtins-0.1.35/src/int/udiv.rs:98:23
    |
98  |             q = n << (<$ty>::BITS - sr);
    |                       ^^^^^^^^^^^
...
268 |         udivmod_inner!(n, d, rem, u128)
    |         ------------------------------- in this macro invocation
    |
    = warning: once this method is added to the standard library, the ambiguity may cause an error or change in behavior!
    = note: for more information, see issue rust-lang#48919 <rust-lang/issues/48919>
    = help: call with fully qualified syntax `Int::BITS(...)` to keep using the current method
    = help: add `#![feature(int_bits_const)]` to the crate attributes to enable `num::<impl u128>::BITS`
    = note: this warning originates in a macro (in Nightly builds, run with -Z macro-backtrace for more info)
```

(You can find the full log in https://github.com/rust-lang/libc/runs/1283695796?check_suite_focus=true for example.)

0.1.36 contains rust-lang/compiler-builtins#332 so this version should remove this warning.

cc rust-lang/libc#1942
@AaronKutch AaronKutch deleted the issue-265 branch March 30, 2021 04:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants