Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression with nightly benchmarks. #47062

Closed
pitdicker opened this issue Dec 29, 2017 · 17 comments
Closed

Performance regression with nightly benchmarks. #47062

pitdicker opened this issue Dec 29, 2017 · 17 comments
Labels
C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. P-high High priority regression-from-stable-to-beta Performance or correctness regression from stable to beta. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Milestone

Comments

@pitdicker
Copy link
Contributor

The benchmarks for rand and for my crate small-rngs have regressed a lot with recent nightlies. A few extra branches are introduced at the end of the benchmark loop.

Relevant assembly with nightly 0077d12 2017-12-14:

       │     → callq  std::time::Instant::now
       │       xor    %r8d,%r8d
       │       lea    0xb0(%rsp),%rdi
       │       lea    0x8(%rsp),%rcx
       │       nop
       │4b0:   mov    $0x3e8,%esi
       │       data16 nopw %cs:0x0(%rax,%rax,1)
 22,85 │4c0:   mov    %rbp,%rax
  8,48 │       mul    %r13
       │       xor    %rbp,%r14
  1,01 │       rol    $0x37,%rbp
       │       mov    %rbp,%rbx
 23,14 │       xor    %r14,%rbx
       │       mov    %r14,%rbp
  1,29 │       shl    $0xe,%rbp
  2,06 │       xor    %rbx,%rbp
 20,55 │       rol    $0x24,%r14
  0,05 │       shld   $0x20,%rax,%rdx
  1,36 │       mov    %rdx,0x8(%rsp)
  3,13 │       dec    %rsi
       │     ↑ jne    4c0
  0,03 │       inc    %r8
  0,02 │       cmp    %r12,%r8
       │     ↑ jne    4b0
       │       mov    %r15,%rsi
       │     → callq  std::time::Instant::elapsed

With nightly 77efd68 2017-12-15:

       │     → callq  std::time::Instant::now
       │       mov    %rax,0x10(%rsp)
       │       mov    %rdx,0x18(%rsp)
       │       xor    %eax,%eax
       │       movabs $0x2545f4914f6cdd1d,%r9
       │       mov    %rsp,%r10
       │       cmp    %r14,%rax
       │     ↓ jb     48
       │     ↓ jmp    51
       │       data16 data16 nopw %cs:0x0(%rax,%rax,1)
       │ 40:   mov    %r8,%rax
       │       cmp    %r14,%rax
       │     ↓ jae    51
  0,02 │ 48:   mov    %rax,%r8
       │       add    $0x1,%r8
       │     ↓ jae    60
       │ 51:   mov    %rax,%r8
       │       xor    %ecx,%ecx
       │       xor    %eax,%eax
       │     ↓ jmp    6d
       │       nop
       │ 60:   movq   $0x1,(%rsp)
       │       mov    $0x1,%ecx
  0,02 │ 6d:   mov    %rax,(%rsp,%rcx,8)
  0,01 │       cmpq   $0x1,(%rsp)
       │     ↓ jne    10d
       │       xor    %eax,%eax
       │       cmp    $0x3e7,%rax
       │     ↓ jbe    d3
       │     ↓ jmp    dc
       │       nop
  0,00 │ 90:   mov    (%r15),%rdi
  9,24 │       mov    (%rdi),%rcx
       │       mov    %rcx,%rax
  9,94 │       mul    %r9
  0,00 │       mov    0x8(%rdi),%rsi
  0,84 │       xor    %rcx,%rsi
  0,01 │       rol    $0x37,%rcx
 14,55 │       xor    %rsi,%rcx
       │       mov    %rsi,%rbx
  1,64 │       shl    $0xe,%rbx
  6,80 │       xor    %rcx,%rbx
  6,76 │       mov    %rbx,(%rdi)
  0,07 │       rol    $0x24,%rsi
  5,16 │       mov    %rsi,0x8(%rdi)
  0,22 │       shld   $0x20,%rax,%rdx
  6,97 │       mov    %rdx,(%rsp)
       │       mov    %r11,%rax
  5,33 │       cmp    $0x3e7,%rax
       │     ↓ ja     dc
       │ d3:   mov    %rax,%r11
 13,70 │       add    $0x1,%r11
       │     ↓ jae    f0
       │ dc:   mov    %rax,%r11
  0,02 │       xor    %ecx,%ecx
       │       xor    %eax,%eax
  0,01 │     ↓ jmp    fd
       │       data16 nopw %cs:0x0(%rax,%rax,1)
       │ f0:   movq   $0x1,(%rsp)
  6,77 │       mov    $0x1,%ecx
  8,61 │ fd:   mov    %rax,(%rsp,%rcx,8)
  0,13 │       cmpq   $0x1,(%rsp)
  3,19 │     ↑ je     90
       │     ↑ jmpq   40
       │10d:   lea    0x10(%rsp),%rdi
       │     → callq  std::time::Instant::elapsed
@est31
Copy link
Member

est31 commented Dec 29, 2017

Commit range: 0077d12...77efd68

PRs that could have caused it: #46623 #46701 cc @eddyb

@est31
Copy link
Member

est31 commented Dec 29, 2017

Also maybe #46537 but it is supposed to be off by default.

@est31
Copy link
Member

est31 commented Dec 29, 2017

This bug looks very much like #46923 and #46897 , both of which are fixed on rustc 1.24.0-nightly (1abeb43 2017-12-27). @pitdicker can you check whether the issue still reproduces?

@kennytm kennytm added C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. labels Dec 29, 2017
@pitdicker
Copy link
Contributor Author

pitdicker commented Dec 29, 2017

Wow, that is fast @est31 😄
With nightly (77e189c 2017-12-28) I can still reproduce it.

@eddyb
Copy link
Member

eddyb commented Dec 30, 2017

It can't be #46701 unless #[repr(simd)] types and newtypes thereof are involved.
We have to look at LLVM IR. @pitdicker How did you get the assembly you posted here?

@nikomatsakis nikomatsakis added regression-from-stable-to-nightly Performance or correctness regression from stable to nightly. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Jan 3, 2018
@nikomatsakis
Copy link
Contributor

I'm tagging this as a regression (it seems to be?)

@pitdicker
Copy link
Contributor Author

Sorry for the late reply.

This assembly happens to be from the gen_u64_xorshift_mt_64 becnhmark of the small_rngs crate, with the i128_support feature. But any other benchmark from it has similar results.

I did it a bit naïve: the assembly is copied from perf record and perf report, where I assume the parts between the timer functions are of interest.

@nikomatsakis
Copy link
Contributor

@pitdicker Is it possible to make a small, standalone example? That would no doubt be very helpful.

I'm going to call this P-high, but I'm not sure who to assign it to.

@nikomatsakis
Copy link
Contributor

triage: P-high

@rust-highfive rust-highfive added the P-high High priority label Jan 4, 2018
@pitdicker
Copy link
Contributor Author

Does this count as 'small, standalone'?

#[bench]
fn tiny_xorshift(b: &mut Bencher) {
    let mut y = 2463534242u32;
    b.bytes = 4 * 1000;
    b.iter(|| {
        for _ in 0..1000 {
            y ^= y << 13;
            y ^= y >> 17;
            y ^= y << 5;
            black_box(y);
        }
    });
}

Before:

test tiny_xorshift              ... bench:       1,586 ns/iter (+/- 4) = 2522 MB/s

After:

test tiny_xorshift              ... bench:       2,796 ns/iter (+/- 24) = 1430 MB/s

It has the same regression without the extra loop, but runs too fast to measure well:

#[bench]
fn tiny_xorshift(b: &mut Bencher) {
    let mut y = 2463534242u32;
    b.bytes = 4;
    b.iter(|| {
        y ^= y << 13;
        y ^= y >> 17;
        y ^= y << 5;
        black_box(y);
    });
}

@eddyb
Copy link
Member

eddyb commented Jan 4, 2018

I tried reproducing without the bench harness and I ended up at this, which doesn't repro:

#![crate_type = "lib"]
#![feature(test)]
extern crate test;
pub fn tiny_xorshift() {
    let mut y = 2463534242u32;
    let mut f = || {
        for _ in 0..1000 {
            y ^= y << 13;
            y ^= y >> 17;
            y ^= y << 5;
            test::black_box(y);
        }
    };
    for _ in 0..1000 {
        f();
    }
}

@pitdicker This is a pretty simple benchmark, so I'm wondering, is the bench harness regressed?

@pitdicker
Copy link
Contributor Author

That would be my guess. There are two somewhat recent PR's that touch the bench harness, but they don't seem to fit the nightly: #46636 and #46620

@Mark-Simulacrum Mark-Simulacrum added this to the 1.25 milestone Jan 15, 2018
@dtolnay dtolnay added regression-from-stable-to-beta Performance or correctness regression from stable to beta. and removed regression-from-stable-to-nightly Performance or correctness regression from stable to nightly. labels Jan 21, 2018
@pietroalbini pietroalbini modified the milestones: 1.25, 1.24 Jan 24, 2018
@eddyb
Copy link
Member

eddyb commented Jan 25, 2018

#46623 doesn't affect the code being benchmarked directly, but it does make the LLVM IR different enough that it's hard to distinguish any one specific change.
@est31 mentioned #46923 and #46897, and while this looks similar, @pitdicker says the nightly those got fixed on doesn't fix this issue.

Locally I get these results:

  • nightly-2017-12-15 (0077d12 2017-12-14): 1,833 ns
  • nightly-2017-12-16 (77efd68 2017-12-15): 6,866 ns
  • nightly-2017-12-29 (77e189c 2017-12-28): 3,230 ns
    • with -C codegen-units=1: 1,831 ns
  • nightly-2018-01-24 (4e3901d 2018-01-23): 3,223 ns
    • with -C codegen-units=1: 1,836 ns

@Mark-Simulacrum What's the easiest way to confirm that #46623 caused a regression here, and find where it got (partially) fixed?

EDIT: @nagisa pointed out that we started doing multiple codegen units which have an impact on top of the original regression (see #47665), obscuring the fix (for which @est31 was right AFAICT).

@eddyb
Copy link
Member

eddyb commented Jan 25, 2018

Closing as dupe of #46897 / #47665.

@Mark-Simulacrum
Copy link
Member

Since it was merged less than 90 days ago, we should have CI artifacts on both sides of #46623 (before/after) which you could run and get numbers from. Is that what you are looking for? As to where it got fixed, well, you could either run rust-bisect since it's within 90 days or go by nightly and then run bisect or its equivalent on the PRs within that nightly commit range.

Since this is a timing thing, it is probably easiest to use rust-bisect just for the install-sysroot command and use git bisect in rust-lang/rust manually. Shouldn't be too hard.

@pitdicker
Copy link
Contributor Author

@nagisa pointed out that we started doing multiple codegen units which have an impact on top of the original regression (see #47665), obscuring the fix (for which @est31 was right AFAICT).

Great, I can confirm I get the same performance with benchmarks as before with

export RUSTFLAGS="-C codegen-units=1"
cargo bench

(but it does not sound like a great permanent solution)

@eddyb
Copy link
Member

eddyb commented Jan 25, 2018

@pitdicker Right, which is why #47665 is still open - feel free to track / discuss on that issue!

@pitdicker pitdicker mentioned this issue Mar 7, 2018
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. P-high High priority regression-from-stable-to-beta Performance or correctness regression from stable to beta. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

9 participants