Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible miscompilation in >= 1.24.0 #49336

Closed
bheisler opened this issue Mar 24, 2018 · 15 comments
Closed

Possible miscompilation in >= 1.24.0 #49336

bheisler opened this issue Mar 24, 2018 · 15 comments
Labels
C-bug Category: This is a bug. P-high High priority regression-from-stable-to-stable Performance or correctness regression from one stable version to another. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@bheisler
Copy link

A user reported this to Criterion.rs - bheisler/criterion.rs#133

The short version is this - when the user's test crate's benchmarks are run on 1.24.0 or newer, Criterion.rs calculates some values incorrectly even though the code looks correct to me. They claim to have reproduced this behavior on Arch Linux, Windows and Raspbian. I have only been able to confirm it on Windows. Each of us has confirmed this behavior on multiple machines.

I was initially reluctant to call this a miscompilation, but when I started investigating it, any change I made caused the bug to stop happening - inserting println's, or even pushing values into a vector and printing them later caused the bug to disappear. Eventually I tried simply adding a call to test::black_box, which should have no observable effect on the code except to inhibit certain compiler optimizations. That also caused the bug to stop occurring. It may still be a bug in my code, but if so I can't find it.

I've tried to create a minimal test case, but was unsuccessful. This bug is quite fragile.

Steps to reproduce:

  • Clone https://github.com/mbillingr/criterion-test.rs
  • Edit the Cargo.toml file to disable the default features for criterion (this isn't necessary but will save some compilation time)
  • With 1.23.0-x86_64-pc-windows-msvc, run cargo bench --bench my_benchmark -- --verbose.
    • Criterion.rs will run two benchmarks and report two iteration counts. The second should be significantly smaller than the first - this is the desired behavior. For example:
Benchmarking fib 1: Collecting 100 samples in estimated 1.0000 s (2469702500 iterations)
...
Benchmarking fib 2: Collecting 100 samples in estimated 1.0000 s (132784700 iterations)
  • Switch to 1.24.1 and run the benchmark command again. Note that this time, the second iteration count is roughly the same as the first (this is the unexpected behavior):
Benchmarking fib 1: Collecting 100 samples in estimated 1.0000 s (2518899600 iterations)
...
Benchmarking fib 2: Collecting 100 samples in estimated 1.0000 s (2514900000 iterations)
  • Optional: Switch to a nightly compiler and verify that the unexpected behavior persists. Clone the rustc_fix branch of https://github.com/japaric/criterion.rs and modify the Cargo.toml file of criterion-test.rs to use that instead. This patch merely enables the test feature/crate and inserts one call to test::black_box in routine.rs:warm_up. Verify that the expected behavior is restored.
@est31
Copy link
Member

est31 commented Mar 24, 2018

Manual bisection log using rustup:

good rustc 1.24.0-nightly (edbd7d2 2017-12-20)
bad rustc 1.25.0-nightly (73ac5d6 2018-01-11)
bad rustc 1.24.0-nightly (9389e23 2017-12-31)
good rustc 1.24.0-nightly (c284f88 2017-12-24)
bad rustc 1.24.0-nightly (1abeb43 2017-12-27) (nightly-2017-12-28 for rustup)
good rustc 1.24.0-nightly (cddc4a6 2017-12-26) (nightly-2017-12-27 for rustup)

So the change is inside the commit range cddc4a6...1abeb43 .... I only see bugfixes and ICE fixes...

Old state:

$ cargo +nightly-2017-12-27 bench --bench my_benchmark -- --verbose
    Finished release [optimized] target(s) in 0.0 secs
     Running target/release/deps/my_benchmark-c6ffda0acddb4b02
Benchmarking fib 1
Benchmarking fib 1: Warming up for 3.0000 s
Benchmarking fib 1: Collecting 100 samples in estimated 1.0000 s (1316272400 iterations)
Benchmarking fib 1: Analyzing
fib 1                   time:   [767.18 ps 778.06 ps 789.83 ps]
                        change: [-3.1129% -1.4570% +0.2070%] (p = 0.09 > 0.05)
                        No change in performance detected.
Found 17 outliers among 100 measurements (17.00%)
  2 (2.00%) high mild
  15 (15.00%) high severe
slope  [767.18 ps 789.83 ps] R^2            [0.7407196 0.7387507]
mean   [764.74 ps 782.08 ps] std. dev.      [29.638 ps 57.337 ps]
median [752.23 ps 756.69 ps] med. abs. dev. [3.0748 ps 9.0096 ps]

Benchmarking fib 2
Benchmarking fib 2: Warming up for 3.0000 s
Benchmarking fib 2: Collecting 100 samples in estimated 1.0000 s (224447250 iterations)
Benchmarking fib 2: Analyzing
fib 2                   time:   [4.3392 ns 4.3744 ns 4.4208 ns]
                        change: [-0.6421% +1.1830% +2.7979%] (p = 0.19 > 0.05)
                        No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe
slope  [4.3392 ns 4.4208 ns] R^2            [0.8011943 0.7965177]
mean   [4.3847 ns 4.4583 ns] std. dev.      [130.47 ps 247.30 ps]
median [4.3365 ns 4.4123 ns] med. abs. dev. [111.97 ps 189.57 ps]

New state:

$ cargo +nightly-2017-12-28 bench --bench my_benchmark -- --verbose
    Finished release [optimized] target(s) in 0.0 secs
     Running target/release/deps/my_benchmark-08eeef1a988add44
Benchmarking fib 1
Benchmarking fib 1: Warming up for 3.0000 s
Benchmarking fib 1: Collecting 100 samples in estimated 1.0000 s (3763674100 iterations)
Benchmarking fib 1: Analyzing
fib 1                   time:   [266.56 ps 267.97 ps 269.51 ps]
                        change: [-65.621% -65.152% -64.693%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) low mild
  2 (2.00%) high mild
  3 (3.00%) high severe
slope  [266.56 ps 269.51 ps] R^2            [0.9507674 0.9500646]
mean   [267.64 ps 271.29 ps] std. dev.      [5.0293 ps 13.253 ps]
median [267.92 ps 269.51 ps] med. abs. dev. [2.5710 ps 5.5603 ps]

Benchmarking fib 2
Benchmarking fib 2: Warming up for 3.0000 s
Benchmarking fib 2: Collecting 100 samples in estimated 1.0000 s (3787974700 iterations)
Benchmarking fib 2: Analyzing
fib 2                   time:   [4.9535 ns 4.9735 ns 4.9952 ns]
                        change: [+11.531% +12.592% +13.578%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
slope  [4.9535 ns 4.9952 ns] R^2            [0.9572156 0.9568266]
mean   [4.9577 ns 4.9946 ns] std. dev.      [76.681 ps 110.54 ps]
median [4.9522 ns 4.9900 ns] med. abs. dev. [55.560 ps 100.25 ps]

@kennytm kennytm added T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. regression-from-stable-to-stable Performance or correctness regression from one stable version to another. C-bug Category: This is a bug. labels Mar 24, 2018
@est31
Copy link
Member

est31 commented Mar 24, 2018

As @kennytm rightfully pointed out, this lies within the 90 day limit. So we can in fact compare single commits!

Further bisection gives me:

bad 71ed31f
good e8098c5
good 3fd27b2

So the change is caused by 3fd27b2...71ed31f , PR #47007 by @eddyb . cc @rust-lang/compiler

@eddyb
Copy link
Member

eddyb commented Mar 24, 2018

Usually I only see a performance impact from some changes. Is there unsafe code involved? My fix could've let LLVM optimize some cases it never did before, which might break unsafe code.

@bheisler
Copy link
Author

There is unsafe code elsewhere in Criterion.rs, yes. Not in this part of the code though.

@nikomatsakis
Copy link
Contributor

So it seems like there are only two possible ways forward:

  • Continue attempting to reduce the problem somehow (perhaps starting from the generated LLVM and using creduce or something like that?)
  • Work backwards from the broken execution and try to see why it continues going around the loop -- I suppose we could set a breakpoint for one of the middle iterations, or just study and compare the generated code.

@nikomatsakis
Copy link
Contributor

We're going to call this P-high for now and revisit next week. I think the most plausible way to get to the bottom of this bug though is to compare the generated code with and without the workaround and try to see what optimization is being inhibited.

@nikomatsakis
Copy link
Contributor

triage: P-high

@eddyb
Copy link
Member

eddyb commented Apr 5, 2018

I looked at the unsafe code in Criterion, briefly, and none of it seems capable of anything like this, so I'm assuming LLVM bug or unfixed Rust bug that affects optimizations.
I don't even think the alignment changed in #47007, only some LLVM optimizations unblocked.

@bheisler
Copy link
Author

bheisler commented Apr 7, 2018

I took another shot at minimizing this, taking a different approach. Rather than try to construct a minimal case from scratch, I instead made a branch of Criterion.rs and started deleting everything I could delete without masking the bug or breaking the test case. I can't guarantee that this is a minimal case, but it's much closer to it than I was able to provide before. It also compiles much faster than previously, which should make it easier to iterate.

Some things I discovered while doing so:

  • Unsafe code was not at fault - the bug still occurs with all unsafe code removed.
  • None of Criterion.rs' dependencies are at fault - the bug still occurs with all dependencies except std removed.
  • Moving the Routine trait and associated code from routine.rs to lib.rs causes the bug to stop occurring.

Updated steps to reproduce:

  • Clone rustc_minimal branch of https://github.com/japaric/criterion.rs/
  • Clone https://github.com/mbillingr/criterion-test.rs
  • Edit the Cargo.toml file of criterion-test.rs to use the path to criterion.rs
  • With 1.23.0-x86_64-pc-windows-msvc, run cargo bench --bench my_benchmark in criterion-test.rs.
    • Criterion.rs will run two benchmark warmups and report two iteration counts. The second should be significantly smaller than the first - this is the desired behavior. For example:
      fib 1: Calculated iterations: (2422959700 iterations)
      fib 3: Calculated iterations: (127573100 iterations)
  • Switch to 1.24.1 and run the benchmark command again. Note that this time, the second iteration count is roughly the same as the first (this is the unexpected behavior):
    fib 1: Calculated iterations: (2520631750 iterations)
    fib 3: Calculated iterations: (2521313500 iterations)

I hope this helps. If you need any further help tracking this down, let me know.

@ishitatsuyuki ishitatsuyuki added the P-high High priority label Apr 8, 2018
@est31
Copy link
Member

est31 commented Apr 8, 2018

Had a look at this again. TLDR: This doesn't seem like a miscompilation.

Looking at @mbillingr's example code:

fn fibonacci(n: u64) -> u64 {
    match n {
        0 => 1,
        1 => 1,
        n => fibonacci(n-1) + fibonacci(n-2),
    }
}
fn criterion_benchmark1(c: &mut Criterion) {
    c.bench_function("fib 1", |b| b.iter(|| fibonacci(1)));
}

fn criterion_benchmark2(c: &mut Criterion) {
    c.bench_function("fib 2", |b| b.iter(|| fibonacci(3)));
}

We can see benchmarks being performed on fibonacci calculating functions. With this code I see the following behaviour:

before:
fib 1: Calculated iterations: 1248410500
fib 2: Calculated iterations: 176871200
after:
fib 1: Calculated iterations: 3885823500
fib 2: Calculated iterations: 3739878500

here, "before" means nightly-2017-12-27 and "after" means nightly-2017-12-28 from rustup.

Now let's change the two functions a bit by evaluating the fibonacci expression:

fn criterion_benchmark1(c: &mut Criterion) {
    c.bench_function("fib 1", |b| b.iter(|| 1));
}

fn criterion_benchmark2(c: &mut Criterion) {
    c.bench_function("fib 2", |b| b.iter(|| 3));
}

We get:

before:
fib 1: Calculated iterations: 1282801000
fib 2: Calculated iterations: 1242027300
after:
fib 1: Calculated iterations: 3879278700
fib 2: Calculated iterations: 3869022150

Let's change it again to use black_box:

#![feature(test)]

extern crate test;

use test::black_box;

fn criterion_benchmark1(c: &mut Criterion) {
    c.bench_function("fib 1", |b| b.iter(|| fibonacci(black_box(3))));
}

fn criterion_benchmark2(c: &mut Criterion) {
    c.bench_function("fib 2", |b| b.iter(|| fibonacci(black_box(3))));
}
before:
fib 1: Calculated iterations: 473144600
fib 2: Calculated iterations: 206302600
after:
fib 1: Calculated iterations: 647743300
fib 2: Calculated iterations: 222977700

So for me it seems that @eddyb 's change enabled an optimization that allowed LLVM to eval the fibonacci(INTEGER_LITERAL) expression at compile time. Basically just LLVM doing its job.

@bheisler
Copy link
Author

bheisler commented Apr 8, 2018

Forgive me if I've missed something, but I don't think that explains everything. If it is just LLVM correctly optimizing the code, then why does black-boxing an apparently-unrelated value in the warmup routine break the optimization? For that matter, it doesn't explain why moving the warmup function into the lib module without changing it also breaks the optimization.

Edited to add:
There's another important aspect that remains unexplained. Criterion.rs uses this iteration calculation to estimate how many times the benchmarked function should be called in order to take up a pre-configured measurement time. The original bug report that led to my raising this issue was that the incorrect iteration count was causing Criterion.rs to take much longer (minutes when it should be seconds) than expected in the measurement phase because it was iterating the benchmark many more times than it should have. If, as you suggest, the change is simply because an optimization made the fibonacci function faster and the iteration count is correct, then why is that iteration count resulting in an observed measurement time taking so much longer than it should be? That could only happen if the fibonacci function was optimized more during warmup than it is during measurement.

@est31
Copy link
Member

est31 commented Apr 8, 2018

If it is just LLVM correctly optimizing the code, then why does black-boxing an apparently-unrelated value in the warmup routine break the optimization?

Optimizers often have weird behaviour. They rely on metrics on whether to optimize something or not.

That could only happen if the fibonacci function was optimized more during warmup than it is during measurement.

Very much possible. LLVM could first inline and then optimize it to a constant.

@bheisler
Copy link
Author

bheisler commented Apr 8, 2018

I'm starting to think you might be right. I had discounted the possibility that the benchmarked function might get inlined all the way into warmup and measurement because I thought Rust couldn't inline functions across crates without #[inline], but when I add #[inline(never)] to the Bencher iteration methods that also seems to produce the expected results.

@eddyb
Copy link
Member

eddyb commented Apr 10, 2018

I thought Rust couldn't inline functions across crates without #[inline]

Depends, are there any type parameters? If so, you can't compile without instantiating cross-crate.

@bheisler
Copy link
Author

Ah, of course. You're correct, I'd forgotten about that. I think you're right, which suggests that the solution to this is to add a few carefully-chosen #[inline(never)] annotations. I've already tried that, and it does seem to work. Thanks for the help, and sorry for the false alarm!

GauravBholaris pushed a commit to GauravBholaris/criterion.rs that referenced this issue Jul 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category: This is a bug. P-high High priority regression-from-stable-to-stable Performance or correctness regression from one stable version to another. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

6 participants