Iter with step_by(2) performs slowly #59281

wooleyra · 2019-03-18T19:30:47Z

Comparing the performance between these similar functions, the one that uses the iterator with the step_by(2) function performs significantly slower than the one using a traditional "while loop".

fn is_prime_iter(num: u32) -> bool {
    if num == 2 {
        return true;
    }
    if num % 2 == 0 {
        return false;
    }
    let mut is_prime = true;
    for i in (3..num).step_by(2) {
        if num % i == 0 {
            is_prime = false;
            break;
        }
    }
    is_prime
}

fn is_prime_loop(num: u32) -> bool {
    if num == 2 {
        return true;
    }
    if num % 2 == 0 {
        return false;
    }
    let mut is_prime = true;
    let mut i = 3;
    while i < num {
        if num % i == 0 {
            is_prime = false;
            break;
        }
        i += 2;
    }
    is_prime
}

Running the experimental cargo bench --all-targets produced the following results for me using the nightly-x86_64-unknown-gnu toolchain (1.35.0-nightly):

// benchmark value: 15485867
test tests::bench_is_prime_iter  ... bench:  26,813,870 ns/iter (+/- 337,054)
test tests::bench_is_prime_loop ... bench:  21,607,357 ns/iter (+/- 242,028)

The text was updated successfully, but these errors were encountered:

hellow554 · 2019-03-19T07:51:23Z

Interesting enough, if you are using all instead of manual if/break it gets even slightly faster

# cargo bench
    Finished release [optimized] target(s) in 0.01s
     Running target/release/deps/foo-8f4575d6c09c1e6c

running 3 tests
test allbar ... bench:      89,120 ns/iter (+/- 2,359)
test iter   ... bench:     109,650 ns/iter (+/- 3,717)
test loop   ... bench:      92,457 ns/iter (+/- 2,685)

(Playground)

scottmcm · 2019-03-21T05:23:49Z

For tiny loop bodies like this it's not unusual for internal iteration to be faster -- chain is the same.

nikic · 2019-03-25T20:51:16Z

Godbolt: https://rust.godbolt.org/z/LRTDre

Also related: #57517

steveklabnik · 2020-08-28T19:43:15Z

Triage: no change

jdahlstrom · 2021-10-24T11:30:10Z

I ran into this while implementing the inner loop of a rasterizer. Internal iteration or not, step_by generates quite inefficient assembly even when iterating over a simple integer range¹ or a slice. In the case of an inclusive range, the difference to a while loop is quite extreme! ²

I find it a bit unfortunate that there's no obvious equivalent for the "classic" FOR i = a TO b STEP c (ie. for(int i = a; i < b; i += c)) loop, given thatstep_by is not zero-cost and while is not as clear as to its intention.

Edit: apparently there was a fix proposed but it had to be reverted :( ³

scottmcm · 2021-10-24T20:57:20Z

@jdahlstrom Make sure you compare equivalent semantics. foo_while in your 2nd footnote is an infinite loop if passed u32::MAX, whereas the RangeInclusive+step_by version will properly terminate.

Of course, the "classic" loop of for(int i = a; i < b; i += c) is UB in those scenarios, which is part of why it's so fast 🙃

jdahlstrom · 2021-10-25T00:12:05Z

@scottmcm True, and I appreciate that Rust does the right thing there, but still it seems unfortunate that the penalty for doing the right thing is so great in the common case (where u32::MAX is often not an element of the function's "practical" domain – doubly so if we're talking usize instead!) And unfortunately LLVM isn't smart enough to generate better code even if we make the special case(s) impossible by adding an appropriate assert! or iterating over a range of a wider type (ie. 0..max as u64 in the example) :(

(On the other hand, widening the induction variable of the while version in the specific case of the example actually makes LLVM ditch the loop altogether and utilize a variation of the max*(max-1)/2 formula instead!)

scottmcm · 2021-10-25T20:00:39Z

And unfortunately LLVM isn't smart enough to generate better code even if we make the special case(s) impossible by adding an appropriate assert!

This is a really interesting one. It looks like https://rust.godbolt.org/z/15W97roTo it has trouble removing the uadd.with.overflow into just a normal addition when it should know that it's in-range because of the assert.

Might be worth experimenting with that a bit to see whether there's a reduced example that could be an A-llvm issue.

The "obvious" version it does just fine (<https://rust.godbolt.org/z/4jhj6WTdG>) so there's something confounding going on here.

jonasmalacofilho · 2021-10-25T20:29:04Z

Note: this comment has been edited to fix an incorrect assumption. The incorrect assumption was that it was rustc, not LLVM, that was optimizing the optimized() version.

The "obvious" version it does just fine (https://rust.godbolt.org/z/4jhj6WTdG) so there's something confounding going on here.

It seems that the "obvious" version ~~is optimized by rustc, not LLVM, and that~~ depends on it the compiler also knowing (in the function body and before the assert) the value being added to x.

/// Optimized as long as the compiler knows:
/// - (a) the value of y;
/// - and (b) that x <= u32::MAX - y.
/// Note that the order of (a) and (b) also matters.
pub fn optimized(x: u32, y: u32) -> u32 {
    assert!(y == 42); // cannot be moved down
    assert!(x <= u32::MAX - y);

    x.checked_add(y).unwrap()
}

/// The check for overflow is not yet optimized away, even though
/// x <= u32::MAX - y.
pub fn not_optimized_yet(x: u32, y: u32) -> u32 {
    assert!(x <= u32::MAX - y);

    x.checked_add(y).unwrap()
}

—https://rust.godbolt.org/z/rWvrqjME1

(I used checked_add instead of overflowing_add, but I don't think that matters).

scottmcm · 2021-10-25T20:47:47Z

It seems that the "obvious" version is optimized by rustc, not LLVM

Is it? In https://rust.godbolt.org/z/Yov3GfTqc the MIR still contains the call to core::num::<impl u32>::overflowing_add.

jonasmalacofilho · 2021-10-25T20:57:57Z

Is it? In https://rust.godbolt.org/z/Yov3GfTqc the MIR still contains the call to core::num::<impl u32>::overflowing_add.

My reasoning was based on my example that is optimized not calling @llvm.uadd.with.overflow.i32 in the LLVM IR. But yes, both optimized and non optimized examples contain calls to core::num::<impl u32>::checked_add() in the MIR; and the same happens in your optimized example.

But isn't LLVM IR the boundary between rustc and LLVM?

scottmcm · 2021-10-25T21:01:30Z

But isn't LLVM IR the boundary between rustc and LLVM?

rustc emits LLVM-IR, yes, but if optimizations are on then the --emit=llvm-ir will show the IR after LLVM's optimization passes have run.

You can pass opt-level=0 to see the often-horrifying IR that rustc directly produces.

jonasmalacofilho · 2021-10-25T21:23:49Z

rustc emits LLVM-IR, yes, but if optimizations are on then the --emit=llvm-ir will show the IR after LLVM's optimization passes have run.

Oh, I didn't know that : \

You can pass opt-level=0 to see the often-horrifying IR that rustc directly produces.

A bit off topic (and please feel free to hide this comment), but wouldn't that also affect optimizations performed by rustc?

For example, I notice that with opt-level=0 the subtraction in assert!(x <= u32::MAX - y); apparently gets compiled with overflow checking, like in debug builds.

Would -C llvm-args=-opt-bisect-limit=0 be better to see the IR generated by rustc at the desired opt-level?

scottmcm · 2021-10-25T21:52:05Z

apparently gets compiled with overflow checking, like in debug builds.

You can pass -C debug-assertions=n to change that one. I'm not sure what MIR opts are configured by optimization level; one can always pass -Z mir-opt-level=1 to turn those back on. (Most of them are off even in optimized builds, though, IIRC.)

The zero-optimization-fuel approach is interesting. I don't know if all the LLVM optimizations actually hook into that properly. If they do it sounds like an interesting approach, though.

nikic · 2021-10-25T22:02:42Z

rustc emits LLVM-IR, yes, but if optimizations are on then the --emit=llvm-ir will show the IR after LLVM's optimization passes have run.

Oh, I didn't know that : \

You can pass opt-level=0 to see the often-horrifying IR that rustc directly produces.

A bit off topic (and please feel free to hide this comment), but wouldn't that also affect optimizations performed by rustc?

For example, I notice that with opt-level=0 the subtraction in assert!(x <= u32::MAX - y); apparently gets compiled with overflow checking, like in debug builds.

Would -C llvm-args=-opt-bisect-limit=0 be better to see the IR generated by rustc at the desired opt-level?

-C no-prepopulate-passes can be used for this purpose.

jrmuizel · 2022-02-07T17:23:48Z

Comparing checked_add in a loop vs step_by shows there's something special going on in with step_by beyond the checked_add
https://rust.godbolt.org/z/79aTac847

the8472 · 2023-09-03T19:03:51Z

#111850 fixed this.

https://rust.godbolt.org/z/h1M73Tf34

The iter and loop assembly look a bit different but in both cases there are just 3 branches in the loop:

panic
loop exit
backedge

A big improvement compared to the rustc 1.33 version which had tons of branches

The benchmarks from #59281 (comment) look identical now:

test allbar ... bench:     133,624 ns/iter (+/- 1,487)
test iter   ... bench:     133,673 ns/iter (+/- 1,271)
test r#loop ... bench:     133,629 ns/iter (+/- 1,067)

So I think this can be closed.

estebank added I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Mar 18, 2019

lambda mentioned this issue May 7, 2019

Useless while loop with a step of 2 doesn't optimize away #60616

Closed

the8472 closed this as completed Sep 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Iter with step_by(2) performs slowly #59281

Iter with step_by(2) performs slowly #59281

wooleyra commented Mar 18, 2019 •

edited

Loading

hellow554 commented Mar 19, 2019 •

edited

Loading

scottmcm commented Mar 21, 2019

nikic commented Mar 25, 2019 •

edited

Loading

steveklabnik commented Aug 28, 2020

jdahlstrom commented Oct 24, 2021 •

edited

Loading

scottmcm commented Oct 24, 2021 •

edited

Loading

jdahlstrom commented Oct 25, 2021 •

edited

Loading

scottmcm commented Oct 25, 2021 •

edited

Loading

jonasmalacofilho commented Oct 25, 2021 •

edited

Loading

scottmcm commented Oct 25, 2021

jonasmalacofilho commented Oct 25, 2021

scottmcm commented Oct 25, 2021

jonasmalacofilho commented Oct 25, 2021

scottmcm commented Oct 25, 2021

nikic commented Oct 25, 2021

jrmuizel commented Feb 7, 2022

the8472 commented Sep 3, 2023

Iter with step_by(2) performs slowly #59281

Iter with step_by(2) performs slowly #59281

Comments

wooleyra commented Mar 18, 2019 • edited Loading

hellow554 commented Mar 19, 2019 • edited Loading

scottmcm commented Mar 21, 2019

nikic commented Mar 25, 2019 • edited Loading

steveklabnik commented Aug 28, 2020

jdahlstrom commented Oct 24, 2021 • edited Loading

Footnotes

scottmcm commented Oct 24, 2021 • edited Loading

jdahlstrom commented Oct 25, 2021 • edited Loading

scottmcm commented Oct 25, 2021 • edited Loading

jonasmalacofilho commented Oct 25, 2021 • edited Loading

scottmcm commented Oct 25, 2021

jonasmalacofilho commented Oct 25, 2021

scottmcm commented Oct 25, 2021

jonasmalacofilho commented Oct 25, 2021

scottmcm commented Oct 25, 2021

nikic commented Oct 25, 2021

jrmuizel commented Feb 7, 2022

the8472 commented Sep 3, 2023

wooleyra commented Mar 18, 2019 •

edited

Loading

hellow554 commented Mar 19, 2019 •

edited

Loading

nikic commented Mar 25, 2019 •

edited

Loading

jdahlstrom commented Oct 24, 2021 •

edited

Loading

scottmcm commented Oct 24, 2021 •

edited

Loading

jdahlstrom commented Oct 25, 2021 •

edited

Loading

scottmcm commented Oct 25, 2021 •

edited

Loading

jonasmalacofilho commented Oct 25, 2021 •

edited

Loading