Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iter with step_by(2) performs slowly #59281

Closed
wooleyra opened this issue Mar 18, 2019 · 17 comments
Closed

Iter with step_by(2) performs slowly #59281

wooleyra opened this issue Mar 18, 2019 · 17 comments
Labels
I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@wooleyra
Copy link

wooleyra commented Mar 18, 2019

Comparing the performance between these similar functions, the one that uses the iterator with the step_by(2) function performs significantly slower than the one using a traditional "while loop".

fn is_prime_iter(num: u32) -> bool {
    if num == 2 {
        return true;
    }
    if num % 2 == 0 {
        return false;
    }
    let mut is_prime = true;
    for i in (3..num).step_by(2) {
        if num % i == 0 {
            is_prime = false;
            break;
        }
    }
    is_prime
}

fn is_prime_loop(num: u32) -> bool {
    if num == 2 {
        return true;
    }
    if num % 2 == 0 {
        return false;
    }
    let mut is_prime = true;
    let mut i = 3;
    while i < num {
        if num % i == 0 {
            is_prime = false;
            break;
        }
        i += 2;
    }
    is_prime
}

Running the experimental cargo bench --all-targets produced the following results for me using the nightly-x86_64-unknown-gnu toolchain (1.35.0-nightly):

// benchmark value: 15485867
test tests::bench_is_prime_iter  ... bench:  26,813,870 ns/iter (+/- 337,054)
test tests::bench_is_prime_loop ... bench:  21,607,357 ns/iter (+/- 242,028)
@estebank estebank added I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Mar 18, 2019
@hellow554
Copy link
Contributor

hellow554 commented Mar 19, 2019

Interesting enough, if you are using all instead of manual if/break it gets even slightly faster

# cargo bench
    Finished release [optimized] target(s) in 0.01s
     Running target/release/deps/foo-8f4575d6c09c1e6c

running 3 tests
test allbar ... bench:      89,120 ns/iter (+/- 2,359)
test iter   ... bench:     109,650 ns/iter (+/- 3,717)
test loop   ... bench:      92,457 ns/iter (+/- 2,685)

(Playground)

@scottmcm
Copy link
Member

For tiny loop bodies like this it's not unusual for internal iteration to be faster -- chain is the same.

@nikic
Copy link
Contributor

nikic commented Mar 25, 2019

Godbolt: https://rust.godbolt.org/z/LRTDre

Also related: #57517

@steveklabnik
Copy link
Member

Triage: no change

@jdahlstrom
Copy link

jdahlstrom commented Oct 24, 2021

I ran into this while implementing the inner loop of a rasterizer. Internal iteration or not, step_by generates quite inefficient assembly even when iterating over a simple integer range1 or a slice. In the case of an inclusive range, the difference to a while loop is quite extreme! 2

I find it a bit unfortunate that there's no obvious equivalent for the "classic" FOR i = a TO b STEP c (ie. for(int i = a; i < b; i += c)) loop, given thatstep_by is not zero-cost and while is not as clear as to its intention.

Edit: apparently there was a fix proposed but it had to be reverted :( 3

Footnotes

  1. https://rust.godbolt.org/z/5hYsr3rWf

  2. https://rust.godbolt.org/z/WqE7xW6aM

  3. https://internals.rust-lang.org/t/more-about-step-by/12436/14

@scottmcm
Copy link
Member

scottmcm commented Oct 24, 2021

@jdahlstrom Make sure you compare equivalent semantics. foo_while in your 2nd footnote is an infinite loop if passed u32::MAX, whereas the RangeInclusive+step_by version will properly terminate.

Of course, the "classic" loop of for(int i = a; i < b; i += c) is UB in those scenarios, which is part of why it's so fast 🙃

@jdahlstrom
Copy link

jdahlstrom commented Oct 25, 2021

@scottmcm True, and I appreciate that Rust does the right thing there, but still it seems unfortunate that the penalty for doing the right thing is so great in the common case (where u32::MAX is often not an element of the function's "practical" domain – doubly so if we're talking usize instead!) And unfortunately LLVM isn't smart enough to generate better code even if we make the special case(s) impossible by adding an appropriate assert! or iterating over a range of a wider type (ie. 0..max as u64 in the example) :(

(On the other hand, widening the induction variable of the while version in the specific case of the example actually makes LLVM ditch the loop altogether and utilize a variation of the max*(max-1)/2 formula instead!)

@scottmcm
Copy link
Member

scottmcm commented Oct 25, 2021

And unfortunately LLVM isn't smart enough to generate better code even if we make the special case(s) impossible by adding an appropriate assert!

This is a really interesting one. It looks like https://rust.godbolt.org/z/15W97roTo it has trouble removing the uadd.with.overflow into just a normal addition when it should know that it's in-range because of the assert.

Might be worth experimenting with that a bit to see whether there's a reduced example that could be an A-llvm issue.

The "obvious" version it does just fine (<https://rust.godbolt.org/z/4jhj6WTdG>) so there's something confounding going on here.

@jonasmalacofilho
Copy link

jonasmalacofilho commented Oct 25, 2021

Note: this comment has been edited to fix an incorrect assumption. The incorrect assumption was that it was rustc, not LLVM, that was optimizing the optimized() version.


The "obvious" version it does just fine (https://rust.godbolt.org/z/4jhj6WTdG) so there's something confounding going on here.

It seems that the "obvious" version is optimized by rustc, not LLVM, and that depends on it the compiler also knowing (in the function body and before the assert) the value being added to x.

/// Optimized as long as the compiler knows:
/// - (a) the value of y;
/// - and (b) that x <= u32::MAX - y.
/// Note that the order of (a) and (b) also matters.
pub fn optimized(x: u32, y: u32) -> u32 {
    assert!(y == 42); // cannot be moved down
    assert!(x <= u32::MAX - y);

    x.checked_add(y).unwrap()
}

/// The check for overflow is not yet optimized away, even though
/// x <= u32::MAX - y.
pub fn not_optimized_yet(x: u32, y: u32) -> u32 {
    assert!(x <= u32::MAX - y);

    x.checked_add(y).unwrap()
}

https://rust.godbolt.org/z/rWvrqjME1

(I used checked_add instead of overflowing_add, but I don't think that matters).

@scottmcm
Copy link
Member

It seems that the "obvious" version is optimized by rustc, not LLVM

Is it? In https://rust.godbolt.org/z/Yov3GfTqc the MIR still contains the call to core::num::<impl u32>::overflowing_add.

@jonasmalacofilho
Copy link

Is it? In https://rust.godbolt.org/z/Yov3GfTqc the MIR still contains the call to core::num::<impl u32>::overflowing_add.

My reasoning was based on my example that is optimized not calling @llvm.uadd.with.overflow.i32 in the LLVM IR. But yes, both optimized and non optimized examples contain calls to core::num::<impl u32>::checked_add() in the MIR; and the same happens in your optimized example.

But isn't LLVM IR the boundary between rustc and LLVM?

@scottmcm
Copy link
Member

But isn't LLVM IR the boundary between rustc and LLVM?

rustc emits LLVM-IR, yes, but if optimizations are on then the --emit=llvm-ir will show the IR after LLVM's optimization passes have run.

You can pass opt-level=0 to see the often-horrifying IR that rustc directly produces.

@jonasmalacofilho
Copy link

rustc emits LLVM-IR, yes, but if optimizations are on then the --emit=llvm-ir will show the IR after LLVM's optimization passes have run.

Oh, I didn't know that : \

You can pass opt-level=0 to see the often-horrifying IR that rustc directly produces.

A bit off topic (and please feel free to hide this comment), but wouldn't that also affect optimizations performed by rustc?

For example, I notice that with opt-level=0 the subtraction in assert!(x <= u32::MAX - y); apparently gets compiled with overflow checking, like in debug builds.

Would -C llvm-args=-opt-bisect-limit=0 be better to see the IR generated by rustc at the desired opt-level?

@scottmcm
Copy link
Member

apparently gets compiled with overflow checking, like in debug builds.

You can pass -C debug-assertions=n to change that one. I'm not sure what MIR opts are configured by optimization level; one can always pass -Z mir-opt-level=1 to turn those back on. (Most of them are off even in optimized builds, though, IIRC.)

The zero-optimization-fuel approach is interesting. I don't know if all the LLVM optimizations actually hook into that properly. If they do it sounds like an interesting approach, though.

@nikic
Copy link
Contributor

nikic commented Oct 25, 2021

rustc emits LLVM-IR, yes, but if optimizations are on then the --emit=llvm-ir will show the IR after LLVM's optimization passes have run.

Oh, I didn't know that : \

You can pass opt-level=0 to see the often-horrifying IR that rustc directly produces.

A bit off topic (and please feel free to hide this comment), but wouldn't that also affect optimizations performed by rustc?

For example, I notice that with opt-level=0 the subtraction in assert!(x <= u32::MAX - y); apparently gets compiled with overflow checking, like in debug builds.

Would -C llvm-args=-opt-bisect-limit=0 be better to see the IR generated by rustc at the desired opt-level?

-C no-prepopulate-passes can be used for this purpose.

@jrmuizel
Copy link
Contributor

jrmuizel commented Feb 7, 2022

Comparing checked_add in a loop vs step_by shows there's something special going on in with step_by beyond the checked_add
https://rust.godbolt.org/z/79aTac847

@the8472
Copy link
Member

the8472 commented Sep 3, 2023

#111850 fixed this.

https://rust.godbolt.org/z/h1M73Tf34

The iter and loop assembly look a bit different but in both cases there are just 3 branches in the loop:

  • panic
  • loop exit
  • backedge

A big improvement compared to the rustc 1.33 version which had tons of branches

The benchmarks from #59281 (comment) look identical now:

test allbar ... bench:     133,624 ns/iter (+/- 1,487)
test iter   ... bench:     133,673 ns/iter (+/- 1,271)
test r#loop ... bench:     133,629 ns/iter (+/- 1,067)

So I think this can be closed.

@the8472 the8472 closed this as completed Sep 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

10 participants