Suboptimal debug codegen in `rand` #43299

alexcrichton · 2017-07-17T21:24:51Z

I found in one of my crates that when compiling in debug mode (cargo build) the largest function in the whole executable looked like it was this one. That looked pretty innocuous so I dug a little deeper. It turns out that function creates 1000 alloc instructions using this program as find.rs compiling this source:

$ rustc +nightly -V
rustc 1.20.0-nightly (b2c070787 2017-07-13)
$ rustc +nightly foo.rs --emit llvm-ir --crate-type lib -Z time-passes && ./find < ./foo.ll
1137 allocas

Trying to winnow it down further this source contains 337 allocas:

$ rustc +nightly foo.rs --emit llvm-ir --crate-type lib -Z time-passes && ./find < ./foo.ll
337 allocas

For comparison, the equivalent C++ program (I think? my C++ isn't very good) has only 86 allocas:

$ clang++ -emit-llvm -S foo.cpp -std=c++11 && ./find < ./foo.ll
86 allocas

Interestingly enough the C++ program compiles in ~0.035s whereas the Rust program compiles in ~0.2s, over 4x slower. The major passes in Rust 0.07s in translation, 0.03s in expansion, and 0.01s in item-bodies checking and LLVM passes. As to where the other 0.08s went in -Z time-passes I'm not sure!

The text was updated successfully, but these errors were encountered:

alexcrichton · 2017-07-17T21:26:52Z

Also it looks like if the inner block is repeated in place Clang doesn't compile any more allocas (it's always 86) whereas Rust's number of alloca instances will grow linearly.

kennytm · 2017-07-17T21:42:13Z

Changing a = a - e to a -= e (and all 24 operations) increases that to 433 allocas 😕

oyvindln · 2017-07-18T23:01:20Z

From a glance at the assmbly, I suspect Range::next() might be one of the issues here. (Though there are probably several things.) The function uses mem::swap to save the incremented counter value back to the iterator and return the previous value of the counter.

Now, mem::swap is simply a shim that calls ptr::swap_nonoverlapping which ultimately calls This massive function (ptr::swap_nonoverlapping_bytes). For larger types this is used to swap efficiently using SIMD, though for swapping two primitive numbers it's a bit overkill. Now this will get optimised out release mode, but in debug mode it results in a lot of redundant cruft. I presume swap is used since the range trait could theoretically be used on something that is not primitive numbers, though maybe the next function (and anything else currently using swap) should be specialised to avoid using swap for types where it's not needed. Even in release mode this seems like something that will create a lot of redundant work for the llvm optimiser.

EDIT: This doesn't seem to be the main source of the alloca statements inside the loops though, but it's probably still worth looking into.

oyvindln · 2017-07-20T11:23:10Z

So there are some differences between the cpp version you posted and the rust version.

Firstly there is the loop. In the C++ version, the loop is a loop over an array of 4 elements, rather than a range iterator as in Rust. See this. It looks as though it avoids using iterators entirely in this case (I don't see any calls to begin() end() in the output) which makes it a bit cleaner than the rust version which has to iterate by calling next().

Secondly, there are some differences in the functions on the wrapping type. In Rust the methods on Wrapping are marked #[inline(always)], so they get inlined even when not doing optimizations. In the rust version this results redundant extra temporary copies being created (e.g more alloc statements) for every invocation of these methods. (This also explains the massive increase when manually unrolling) In the cpp version, the temporaries are created inside the method call instead, which results in the IR having less alloc statements.

I made an altered cpp version that gives results that are closer to the rust version, by force-inlining the method calls and calling wrapping add/sub through a function as in rust: (I didn't bother altering the loop iteration as that would be a bit more complex.)

https://gist.github.com/oyvindln/cc65fc6c479d3347708be972798d49f7

EDIT: Fixed this ~~(The other value in Wrapping::shr/shl is anded by this in the rust version also, but I'm not sure what value pub const u64: u32 = i64; results in).~~ It looks like the value is masked twice with the same value in the rust code.

This gives me 286 allocas, which is a fair bit closer to the rust version. Interestingly, the wrapping_{add/sub} functions have two allocas in c++, and only one in rust. Maybe there is some basic processing that could be done during this inlining to avoid all the redundant copies here.

Another interesting observation is that both the rust and c++ versions treat the wrapping type as though it is an aggregate, and thus accesses the inner value by a pointer into the struct. Maybe #[repr(transparent)] could help here to let llvm know the struct could be treated as an u64 here and avoid having to use getelementptr to access the value.

alexcrichton · 2017-07-20T15:55:59Z

Oh nice find on the #[inline(always)]!

I removed the annotations and the example dropped from 337 allocas to 129 allocas (yay!). Unfortunately though Rust still shows linear growth in terms of repetition while C++ still doesn't :(

alexcrichton · 2017-07-20T18:17:31Z

I've opened #43367 to remove inline(always) annotations.

oyvindln · 2017-07-21T15:12:20Z

Removing the redundant masking operation in Wrapping:shl/shr (which calls wrapping_shl/shr which already masks anyhow) would probably also help avoid one alloca for each of those functions, at least it did when tested with a local copy of them (didn't try with the standard library itself).

Mark-Simulacrum · 2024-01-15T23:20:00Z

We appear to generate a constant 46 allocas today in debug on stable for this program, including when repeating the body of the loop many times. I think this is fixed. The original program (random generator) produces a total of 103 allocas without filtering as well, so I think we're doing much better in this space.

alexcrichton added the A-codegen Area: Code generation label Jul 17, 2017

aturon added the I-compiletime Issue: Problems and improvements with respect to compile times. label Jul 17, 2017

scottmcm mentioned this issue Jul 20, 2017

For small types, just swap directly #43351

Closed

Mark-Simulacrum added the C-enhancement Category: An issue proposing an enhancement or a PR with one. label Jul 28, 2017

jonas-schievink added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Jan 31, 2020

Mark-Simulacrum closed this as completed Jan 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suboptimal debug codegen in `rand` #43299

Suboptimal debug codegen in `rand` #43299

alexcrichton commented Jul 17, 2017

alexcrichton commented Jul 17, 2017

kennytm commented Jul 17, 2017

oyvindln commented Jul 18, 2017 •

edited

Loading

oyvindln commented Jul 20, 2017 •

edited

Loading

alexcrichton commented Jul 20, 2017

alexcrichton commented Jul 20, 2017

oyvindln commented Jul 21, 2017 •

edited

Loading

Mark-Simulacrum commented Jan 15, 2024

Suboptimal debug codegen in rand #43299

Suboptimal debug codegen in rand #43299

Comments

alexcrichton commented Jul 17, 2017

alexcrichton commented Jul 17, 2017

kennytm commented Jul 17, 2017

oyvindln commented Jul 18, 2017 • edited Loading

oyvindln commented Jul 20, 2017 • edited Loading

alexcrichton commented Jul 20, 2017

alexcrichton commented Jul 20, 2017

oyvindln commented Jul 21, 2017 • edited Loading

Mark-Simulacrum commented Jan 15, 2024

Suboptimal debug codegen in `rand` #43299

Suboptimal debug codegen in `rand` #43299

oyvindln commented Jul 18, 2017 •

edited

Loading

oyvindln commented Jul 20, 2017 •

edited

Loading

oyvindln commented Jul 21, 2017 •

edited

Loading