Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Number of iterations does not scale #133

Closed
mbillingr opened this issue Mar 21, 2018 · 8 comments
Closed

Number of iterations does not scale #133

mbillingr opened this issue Mar 21, 2018 · 8 comments

Comments

@mbillingr
Copy link

mbillingr commented Mar 21, 2018

I tried to get started with Criterion by basically copy&pasting the Getting Started guide into a new cargo project. Then, running cargo bench never finished collecting samples.

By reducing the fibonacci number and the measurement time I finally managed to get results.

This is the result for fibonacci(5):

     Running target/release/deps/my_benchmark-78226dbc85d4a50d
Benchmarking fib 20
Benchmarking fib 20: Warming up for 3.0000 s
Benchmarking fib 20: Collecting 100 samples in estimated 1.0000 s (3776268800 iterations)
Benchmarking fib 20: Analyzing
fib 20                  time:   [264.64 ps 265.14 ps 265.68 ps]
                        change: [-0.3371% +0.1094% +0.5162%] (p = 0.63 > 0.05)
                        No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
  5 (5.00%) high mild
  5 (5.00%) high severe
slope  [264.64 ps 265.68 ps] R^2            [0.9907413 0.9906445]
mean   [264.90 ps 266.51 ps] std. dev.      [2.1733 ps 5.7740 ps]
median [263.74 ps 264.71 ps] med. abs. dev. [0.3246 ps 1.6775 ps]

cargo bench --bench my_benchmark -- --verbose  18.72s user 0.30s system 189% cpu 10.044 total

And for This is the result for fibonacci(10):

     Running target/release/deps/my_benchmark-78226dbc85d4a50d
Benchmarking fib 20
Benchmarking fib 20: Warming up for 3.0000 s
Benchmarking fib 20: Collecting 100 samples in estimated 1.0000 s (3777415150 iterations)
Benchmarking fib 20: Analyzing
fib 20                  time:   [169.19 ns 169.66 ns 170.14 ns]
                        change: [+63543.59036036591% +63803.46521022125% +64058.9672116039%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 20 outliers among 100 measurements (20.00%)
  4 (4.00%) high mild
  16 (16.00%) high severe
slope  [169.19 ns 170.14 ns] R^2            [0.9864471 0.9864016]
mean   [169.33 ns 170.21 ns] std. dev.      [1.6465 ns 2.7644 ns]
median [168.69 ns 168.72 ns] med. abs. dev. [50.328 ps 177.27 ps]

cargo bench --bench my_benchmark -- --verbose  656.87s user 0.32s system 101% cpu 10:49.57 total

The total cargo run times (12.72s and 656.87s) include compilation and analysis time, but these should be mostly negligible compared to the time spent collecting samples, especially in the second case.

Interestingly, the number of iterations is almost the same in both runs (3776M and 3777M). Since the function is supposed to run slower in the second run, I assume there should be less iterations. Could there be a problem with estimating the required number of iterations from the warm-up phase?


Update

I was able to reproduce the behavior on my laptop. Both system run stable-x86_64-unknown-linux-gnu toolchain with rustc 1.24.1. Here is the code I used: https://github.com/mbillingr/criterion-test.rs

mbillingr added a commit to mbillingr/criterion-test.rs that referenced this issue Mar 21, 2018
@bheisler
Copy link
Owner

Hey, thanks for trying out Criterion.rs.

I checked out your test code. When I run it locally, it behaves as expected and completes the benchmarks in the expected amount of time. I tried changing the constants to 10 and 20 as well as some other experiments, but Criterion.rs continued to behave as expected.

I did my testing with the stable-x86_64-unknown-linux-gnu build1.24.1 as well.

I'd like to help you, but I can't think of anything more I can do without being able to reproduce this issue myself. The only thing I can think of is that something might cause the CPU or the benchmark process to slow down drastically right after the warmup phase is completed. Power control settings, maybe, or interference from other processes? I'm just guessing at this point.

Sorry I couldn't be more helpful. If you can find any more information, please share it.

@mbillingr
Copy link
Author

Thank you for looking into the issue. I was both hoping and afraid this was a localized problem specific to me :)

This happens on two independent machines; a desktop PC and a notebook so I would almost rule out power saving issues. The only common thing about them is they run Arch Linux configured by me, which is something to look into.

I will keep you updated when I find out anything useful.

@mbillingr
Copy link
Author

mbillingr commented Mar 23, 2018

Here comes the update...

I ran my tests on a raspberry with the same results again. So it does not seem to be related to Arch Linux at least.

I then downgraded to the 1.23.0 toolchain and got the expected results. I only tried this on one machine yet, but this seems seems to strongly point at the toolchain.

Searching for time in the release notes only revealed rust-lang/rust#46828. No idea if this is related.


Update

Tested on Windows too with the same results. Here is an overview of what I tried and the results I got:

System 1.23.0 1.24.0 1.24.1 Nightly
Arch Linux ✔️
Raspian ✔️
Windows ✔️

Results: ✔️ = works; ❌ = does not work; ❔ = not tested
Can't test on toolchains < 1.23.0 because the pest dependency relies on a 1.23 feature.

@bheisler
Copy link
Owner

bheisler commented Mar 24, 2018

Curiouser and curiouser.

I've run this test on one of my own machines which runs Arch on 1.23, 1.24.1 and nightly and still was not able to reproduce this issue. However, I tried running it on Windows with 1.24.1 and I was able to reproduce this issue. However, it does not appear with 1.23 even on Windows.

Very strange. I will continue digging into this as soon as I have some time to do so. I'm very reluctant to call this a compiler regression, but it does seem to be correlated with the compiler version.

Edit: Have confirmed this on two different Windows machines, but running the Linux compiler (whether on Arch or through WSL) behaves as expected.

@bheisler
Copy link
Owner

I should probably go to bed, but this is too strange.

I've narrowed this down to the routine.rs:warm_up function. It looks like it's calculating the number of iterations incorrectly (total_iters is some very large value) except that if I insert a println or push the total_iters or elapsed into a vec in order to investigate it, it stops happening and goes back to the expected behavior.

Stranger still, if I change it to this:

let elapsed = ::test::black_box(start.elapsed());

It suddenly starts working correctly, even though black_box is supposed to have no observable effect on the behavior of code.

This is starting to look like a compiler regression.

@Eh2406
Copy link
Contributor

Eh2406 commented Mar 24, 2018

This does sound like a compiler regression. My unasked for 2c is that it is time to open a issue on the rust repo. They are smart, knowledge people with good tools to track this down. and they don't bite.

@mbillingr
Copy link
Author

Could the compiler under certain conditions optimize away the the benchmark closure in the warmup phase? But if so, why doesn't it optimize away the sampling run?

Just wildly speculating...

@mbillingr
Copy link
Author

Works for me too. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants