Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad codegen for array::map-based code #102202

Closed
newpavlov opened this issue Sep 23, 2022 · 15 comments
Closed

Bad codegen for array::map-based code #102202

newpavlov opened this issue Sep 23, 2022 · 15 comments
Labels
A-codegen Area: Code generation C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code.

Comments

@newpavlov
Copy link
Contributor

array::map results in a bad codegen compared to manually unrolled code. This issue can be demonstrated by the following code: https://rust.godbolt.org/z/T7bnTE398

Relevant issue: #86912

@newpavlov
Copy link
Contributor Author

A more straightforward implementation of array::map makes codegen identical: https://rust.godbolt.org/z/a9csPcnob But it's unclear how to properly make it panic-safe in regards to f, i.e. we need a way to properly drop created Us after f has panicked.

@the8472
Copy link
Member

the8472 commented Sep 23, 2022

We could special-case based on needs_drop::<U>?

@veber-alex
Copy link
Contributor

There is another issue here, the simple map function produces very bad assembly on nightly.

@newpavlov
Copy link
Contributor Author

@the8472
No, I think we simply need something like ReadBuf to track number of initialized elements. With no-drop types the additional code should get eliminated.

@veber-alex
It looks like a clear nightly regression. I will open an issue for it.

@the8472
Copy link
Member

the8472 commented Sep 23, 2022

I think we simply need something like ReadBuf to track number of initialized elements

The Guard type does just that.
The thing is that the surrounding code is super-generic and relies heavily on the optimizer. That it runs on two loop conditions (0..N and iter.next()) probably doesn't help, LLVM is bad at optimizing that kind of thing.

@newpavlov
Copy link
Contributor Author

I wrote variation of the Guard and it properly compiles into expected assembly. Dunno what exactly makes the difference.

@Rageking8
Copy link
Contributor

@rustbot label +A-codegen +C-bug

@rustbot rustbot added A-codegen Area: Code generation C-bug Category: This is a bug. labels Sep 24, 2022
@InBetweenNames
Copy link

What would it take to get @newpavlov's variation upstreamed? When dealing with vectorized code, it would be really nice to not have to unroll everything manually and use just use map.

@memoryruins
Copy link
Contributor

Testing with rustc 1.67.0-nightly (e1d819583 2022-12-05), the initial example using <[T; N]>::map in foo2 produces the same codegen as the manually unrolled code. Whichever change improved its codegen has not reached beta yet.

@nikic nikic added the I-slow Issue: Problems and improvements with respect to performance of generated code. label Dec 7, 2022
@HadrienG2
Copy link

HadrienG2 commented Mar 2, 2023

Another performance pitfall of array::map is that it isn't marked #[inline], and sometimes (think codegen-units != 1), the compiler will take this opportunity to fail to inline it, resulting in disaster when the mapping is very simple and at the bottom of a hot loop.

This wouldn't be so bad if it only concerned array::map, which performance-conscious people should use sparingly anyway (overly enthusiastic use tends to create lots of data movement to and from the stack, which LLVM can have a hard time optimizing). But the problem as far as I am concerned is that array::map is also currently a building block of array::from_fn, which is pretty much the standard way to safely initialize arrays of inhomogeneous values and unknown length in current Rust. And not being able to use that in performance-sensitive code is a much greater loss.

I think either array::map should be made #[inline], or the implementation of array::from_fn should be changed to refrain from relying on array::map.

@newpavlov
Copy link
Contributor Author

The code gets compiled into expected assembly on beta, so I think this issue can be closed? Or should we wait for 1.68 to be released?

@HadrienG2
Can you provide a small example which would demonstrate this issue? I thought that because of monomorphization (array::map is generic over both N and F) #[inline] on generic functions is generally redundant.

@HadrienG2
Copy link

HadrienG2 commented Mar 3, 2023

@newpavlov Monomorphization does not help you if two functions (which may be e.g. two instances of the same generic function) end up calling array::from_fn with the same callable and the same size. I'll roll out a little godbolt later, am on a phone right now.

@HadrienG2
Copy link

After struggling a while to reproduce this with godbolt, I realized that I couldn't because godbolt uses --emit-asm and --emit-asm forces codegen-units = 1, whereas this is a problem that requires multiple codegen units to reproduce.

So here's something that works locally:

  1. Create a tiny lib with this code
#[inline(always)]
fn round_trip(input: [usize; 4]) -> [usize; 4] {
    let mut iter = input.into_iter();
    std::array::from_fn(|_| iter.next().unwrap())
}

pub fn make_array(input: [usize; 4]) -> usize {
    round_trip(input)[1]
}

pub fn make_array_again(input: [usize; 4]) -> usize {
    round_trip(input)[0]
}
  1. Compile with multiple codegen units (I reproduced this with 16, but you probably need much less).

Although round_trip is correctly inlined, no inlining of the array::map inside array::from_fn is performed.

@HadrienG2
Copy link

I suspect the issue here might be that codegen-units splits code into compilation units based on how complex it is before optimization, whereas in the case of iterator-based code (or more generally from_fn or map with callables that use sufficiently complex optimizations), the code is pretty complex before optimization but will become very simple afterwards.

Which might explain why codegen-units tends to break "zero-cost abstractions" in my code unless inline is used very liberally in their implementation: these abstractions only turn into efficient code after optimization passes.

Resolving this true underlying issue that keeps biting me over and over again would be a largely separate issue that would require a mechanism to feed back information about compiled machine code to the compiler frontend so that in release mode + incremental builds, it can take more informed decisions about upfront splitting into codegen units. Think PGO, but without execution. That's likely very complex to implement though, so for now #[inline] on everything that should be a zero cost abstraction + transitively called functions is a Good Enough Hack.

@newpavlov
Copy link
Contributor Author

I will close this issue, since the particular problem referenced by the OP got resolved on Beta (I assume by a newer LLVM version). But in some situations codegen for array::map can be still quite suboptimal as discussion here and the stack usage issue demonstrate.

@HadrienG2
I think you should open a separate issue with request to add inline attributes on the methods.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-codegen Area: Code generation C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code.
Projects
None yet
Development

No branches or pull requests

9 participants