Performance regression with nightly benchmarks. #47062

pitdicker · 2017-12-29T15:54:01Z

The benchmarks for rand and for my crate small-rngs have regressed a lot with recent nightlies. A few extra branches are introduced at the end of the benchmark loop.

Relevant assembly with nightly 0077d12 2017-12-14:

       │     → callq  std::time::Instant::now
       │       xor    %r8d,%r8d
       │       lea    0xb0(%rsp),%rdi
       │       lea    0x8(%rsp),%rcx
       │       nop
       │4b0:   mov    $0x3e8,%esi
       │       data16 nopw %cs:0x0(%rax,%rax,1)
 22,85 │4c0:   mov    %rbp,%rax
  8,48 │       mul    %r13
       │       xor    %rbp,%r14
  1,01 │       rol    $0x37,%rbp
       │       mov    %rbp,%rbx
 23,14 │       xor    %r14,%rbx
       │       mov    %r14,%rbp
  1,29 │       shl    $0xe,%rbp
  2,06 │       xor    %rbx,%rbp
 20,55 │       rol    $0x24,%r14
  0,05 │       shld   $0x20,%rax,%rdx
  1,36 │       mov    %rdx,0x8(%rsp)
  3,13 │       dec    %rsi
       │     ↑ jne    4c0
  0,03 │       inc    %r8
  0,02 │       cmp    %r12,%r8
       │     ↑ jne    4b0
       │       mov    %r15,%rsi
       │     → callq  std::time::Instant::elapsed

With nightly 77efd68 2017-12-15:

       │     → callq  std::time::Instant::now
       │       mov    %rax,0x10(%rsp)
       │       mov    %rdx,0x18(%rsp)
       │       xor    %eax,%eax
       │       movabs $0x2545f4914f6cdd1d,%r9
       │       mov    %rsp,%r10
       │       cmp    %r14,%rax
       │     ↓ jb     48
       │     ↓ jmp    51
       │       data16 data16 nopw %cs:0x0(%rax,%rax,1)
       │ 40:   mov    %r8,%rax
       │       cmp    %r14,%rax
       │     ↓ jae    51
  0,02 │ 48:   mov    %rax,%r8
       │       add    $0x1,%r8
       │     ↓ jae    60
       │ 51:   mov    %rax,%r8
       │       xor    %ecx,%ecx
       │       xor    %eax,%eax
       │     ↓ jmp    6d
       │       nop
       │ 60:   movq   $0x1,(%rsp)
       │       mov    $0x1,%ecx
  0,02 │ 6d:   mov    %rax,(%rsp,%rcx,8)
  0,01 │       cmpq   $0x1,(%rsp)
       │     ↓ jne    10d
       │       xor    %eax,%eax
       │       cmp    $0x3e7,%rax
       │     ↓ jbe    d3
       │     ↓ jmp    dc
       │       nop
  0,00 │ 90:   mov    (%r15),%rdi
  9,24 │       mov    (%rdi),%rcx
       │       mov    %rcx,%rax
  9,94 │       mul    %r9
  0,00 │       mov    0x8(%rdi),%rsi
  0,84 │       xor    %rcx,%rsi
  0,01 │       rol    $0x37,%rcx
 14,55 │       xor    %rsi,%rcx
       │       mov    %rsi,%rbx
  1,64 │       shl    $0xe,%rbx
  6,80 │       xor    %rcx,%rbx
  6,76 │       mov    %rbx,(%rdi)
  0,07 │       rol    $0x24,%rsi
  5,16 │       mov    %rsi,0x8(%rdi)
  0,22 │       shld   $0x20,%rax,%rdx
  6,97 │       mov    %rdx,(%rsp)
       │       mov    %r11,%rax
  5,33 │       cmp    $0x3e7,%rax
       │     ↓ ja     dc
       │ d3:   mov    %rax,%r11
 13,70 │       add    $0x1,%r11
       │     ↓ jae    f0
       │ dc:   mov    %rax,%r11
  0,02 │       xor    %ecx,%ecx
       │       xor    %eax,%eax
  0,01 │     ↓ jmp    fd
       │       data16 nopw %cs:0x0(%rax,%rax,1)
       │ f0:   movq   $0x1,(%rsp)
  6,77 │       mov    $0x1,%ecx
  8,61 │ fd:   mov    %rax,(%rsp,%rcx,8)
  0,13 │       cmpq   $0x1,(%rsp)
  3,19 │     ↑ je     90
       │     ↑ jmpq   40
       │10d:   lea    0x10(%rsp),%rdi
       │     → callq  std::time::Instant::elapsed

The text was updated successfully, but these errors were encountered:

est31 · 2017-12-29T16:24:40Z

Commit range: 0077d12...77efd68

PRs that could have caused it: #46623 #46701 cc @eddyb

est31 · 2017-12-29T16:25:36Z

Also maybe #46537 but it is supposed to be off by default.

est31 · 2017-12-29T16:27:40Z

This bug looks very much like #46923 and #46897 , both of which are fixed on rustc 1.24.0-nightly (1abeb43 2017-12-27). @pitdicker can you check whether the issue still reproduces?

pitdicker · 2017-12-29T16:39:02Z

Wow, that is fast @est31 😄
With nightly (77e189c 2017-12-28) I can still reproduce it.

eddyb · 2017-12-30T03:01:28Z

It can't be #46701 unless #[repr(simd)] types and newtypes thereof are involved.
We have to look at LLVM IR. @pitdicker How did you get the assembly you posted here?

nikomatsakis · 2018-01-03T18:46:22Z

I'm tagging this as a regression (it seems to be?)

pitdicker · 2018-01-03T19:33:38Z

Sorry for the late reply.

This assembly happens to be from the gen_u64_xorshift_mt_64 becnhmark of the small_rngs crate, with the i128_support feature. But any other benchmark from it has similar results.

I did it a bit naïve: the assembly is copied from perf record and perf report, where I assume the parts between the timer functions are of interest.

nikomatsakis · 2018-01-04T15:47:04Z

@pitdicker Is it possible to make a small, standalone example? That would no doubt be very helpful.

I'm going to call this P-high, but I'm not sure who to assign it to.

nikomatsakis · 2018-01-04T15:47:12Z

triage: P-high

pitdicker · 2018-01-04T16:58:59Z

Does this count as 'small, standalone'?

#[bench]
fn tiny_xorshift(b: &mut Bencher) {
    let mut y = 2463534242u32;
    b.bytes = 4 * 1000;
    b.iter(|| {
        for _ in 0..1000 {
            y ^= y << 13;
            y ^= y >> 17;
            y ^= y << 5;
            black_box(y);
        }
    });
}

Before:

test tiny_xorshift              ... bench:       1,586 ns/iter (+/- 4) = 2522 MB/s

After:

test tiny_xorshift              ... bench:       2,796 ns/iter (+/- 24) = 1430 MB/s

It has the same regression without the extra loop, but runs too fast to measure well:

#[bench]
fn tiny_xorshift(b: &mut Bencher) {
    let mut y = 2463534242u32;
    b.bytes = 4;
    b.iter(|| {
        y ^= y << 13;
        y ^= y >> 17;
        y ^= y << 5;
        black_box(y);
    });
}

eddyb · 2018-01-04T19:10:11Z

I tried reproducing without the bench harness and I ended up at this, which doesn't repro:

#![crate_type = "lib"]
#![feature(test)]
extern crate test;
pub fn tiny_xorshift() {
    let mut y = 2463534242u32;
    let mut f = || {
        for _ in 0..1000 {
            y ^= y << 13;
            y ^= y >> 17;
            y ^= y << 5;
            test::black_box(y);
        }
    };
    for _ in 0..1000 {
        f();
    }
}

@pitdicker This is a pretty simple benchmark, so I'm wondering, is the bench harness regressed?

pitdicker · 2018-01-04T19:56:48Z

That would be my guess. There are two somewhat recent PR's that touch the bench harness, but they don't seem to fit the nightly: #46636 and #46620

eddyb · 2018-01-25T12:12:17Z

#46623 doesn't affect the code being benchmarked directly, but it does make the LLVM IR different enough that it's hard to distinguish any one specific change.
@est31 mentioned #46923 and #46897, ~~and while this looks similar, @pitdicker says the nightly those got fixed on doesn't fix this issue~~.

Locally I get these results:

nightly-2017-12-15 (0077d12 2017-12-14): 1,833 ns
nightly-2017-12-16 (77efd68 2017-12-15): 6,866 ns
nightly-2017-12-29 (77e189c 2017-12-28): 3,230 ns
- with -C codegen-units=1: 1,831 ns
nightly-2018-01-24 (4e3901d 2018-01-23): 3,223 ns
- with -C codegen-units=1: 1,836 ns

~~@Mark-Simulacrum What's the easiest way to confirm that #46623 caused a regression here, and find where it got (partially) fixed?~~

EDIT: @nagisa pointed out that we started doing multiple codegen units which have an impact on top of the original regression (see #47665), obscuring the fix (for which @est31 was right AFAICT).

eddyb · 2018-01-25T15:43:45Z

Closing as dupe of #46897 / #47665.

Mark-Simulacrum · 2018-01-25T19:04:32Z

Since it was merged less than 90 days ago, we should have CI artifacts on both sides of #46623 (before/after) which you could run and get numbers from. Is that what you are looking for? As to where it got fixed, well, you could either run rust-bisect since it's within 90 days or go by nightly and then run bisect or its equivalent on the PRs within that nightly commit range.

Since this is a timing thing, it is probably easiest to use rust-bisect just for the install-sysroot command and use git bisect in rust-lang/rust manually. Shouldn't be too hard.

pitdicker · 2018-01-25T19:38:12Z

@nagisa pointed out that we started doing multiple codegen units which have an impact on top of the original regression (see #47665), obscuring the fix (for which @est31 was right AFAICT).

Great, I can confirm I get the same performance with benchmarks as before with

export RUSTFLAGS="-C codegen-units=1"
cargo bench

(but it does not sound like a great permanent solution)

eddyb · 2018-01-25T19:59:48Z

@pitdicker Right, which is why #47665 is still open - feel free to track / discuss on that issue!

kennytm added C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. labels Dec 29, 2017

nikomatsakis added regression-from-stable-to-nightly Performance or correctness regression from stable to nightly. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Jan 3, 2018

rust-highfive added the P-high High priority label Jan 4, 2018

Mark-Simulacrum added this to the 1.25 milestone Jan 15, 2018

dtolnay added regression-from-stable-to-beta Performance or correctness regression from stable to beta. and removed regression-from-stable-to-nightly Performance or correctness regression from stable to nightly. labels Jan 21, 2018

pietroalbini modified the milestones: 1.25, 1.24 Jan 24, 2018

eddyb closed this as completed Jan 25, 2018

nagisa mentioned this issue Jan 25, 2018

codegen-units + ThinLTO is not as good as codegen-units = 1 #47745

Open

eddyb mentioned this issue Feb 1, 2018

LTO: 10x regression #47866

Closed

pitdicker mentioned this issue Mar 7, 2018

Rng::iter rust-random/rand#275

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance regression with nightly benchmarks. #47062

Performance regression with nightly benchmarks. #47062

pitdicker commented Dec 29, 2017

est31 commented Dec 29, 2017

est31 commented Dec 29, 2017

est31 commented Dec 29, 2017

pitdicker commented Dec 29, 2017 •

edited

Loading

eddyb commented Dec 30, 2017

nikomatsakis commented Jan 3, 2018

pitdicker commented Jan 3, 2018

nikomatsakis commented Jan 4, 2018

nikomatsakis commented Jan 4, 2018

pitdicker commented Jan 4, 2018

eddyb commented Jan 4, 2018

pitdicker commented Jan 4, 2018

eddyb commented Jan 25, 2018 •

edited

Loading

eddyb commented Jan 25, 2018

Mark-Simulacrum commented Jan 25, 2018

pitdicker commented Jan 25, 2018

eddyb commented Jan 25, 2018

Performance regression with nightly benchmarks. #47062

Performance regression with nightly benchmarks. #47062

Comments

pitdicker commented Dec 29, 2017

est31 commented Dec 29, 2017

est31 commented Dec 29, 2017

est31 commented Dec 29, 2017

pitdicker commented Dec 29, 2017 • edited Loading

eddyb commented Dec 30, 2017

nikomatsakis commented Jan 3, 2018

pitdicker commented Jan 3, 2018

nikomatsakis commented Jan 4, 2018

nikomatsakis commented Jan 4, 2018

pitdicker commented Jan 4, 2018

eddyb commented Jan 4, 2018

pitdicker commented Jan 4, 2018

eddyb commented Jan 25, 2018 • edited Loading

eddyb commented Jan 25, 2018

Mark-Simulacrum commented Jan 25, 2018

pitdicker commented Jan 25, 2018

eddyb commented Jan 25, 2018

pitdicker commented Dec 29, 2017 •

edited

Loading

eddyb commented Jan 25, 2018 •

edited

Loading