Rust insufficiently optimizes loop { match { } } state machines #80630

tuxmark5 · 2021-01-02T20:39:51Z

This is how a typical state machine might look in Rust:

#[inline(never)]
fn print(s: &str) {
    println!("{}", s);
}

#[inline(never)]
fn exec(mut s: i32) {
    loop {
        match s {
            0 => { print("0"); s = 2; },
            1 => { print("1"); s = 3; },
            2 => { print("2"); s = 1; },
            _ => return
        }
    }
}

fn main() {
    exec(0);
}

One would expect that the first jump from function start into one of the states (0, 1, 2 or _) would be compiled as a jumptable or a series of cmp/conditional jump instructions and once the execution is in one of the states the remaining jumps would be deterministic (non-conditional) jumps.

And this is how equivalent code in gcc is optimized.

However Rust/LLVM at the end of each state performs needless comparisons to determine to which label to jump to next, even though the jump target can be computed statically. You can inspect the generated assembly in Rust Playground.

This is probably more of a LLVM issue rather than Rust directly, but it's important to get this fixed/optimized, because now it's impossible to efficiently write such state machines in Rust (because there is no goto)

Meta

Tested with all versions available in Rust playground:

1.49.0
1.50.0-beta.2 (2020-12-31 25b3db3)
1.51.0-nightly (2020-12-30 e226704)

The text was updated successfully, but these errors were encountered:

the8472 · 2021-01-02T22:43:13Z

One would expect that the first jump from function start into one of the states (0, 1, 2 or _) would be compiled as a jumptable or a series of cmp/conditional jump instructions

Atleast that part happens once you add more states.

nagisa · 2021-01-02T23:13:39Z

Didn't look too deeply into it but the IR fed to LLVM around the loop-match construct is not that different between rustc and clang. However with clang it becomes a switchtable at llvm-ir level, and we rely on the codegen backend to do so with rust https://godbolt.org/z/9Erv94

FWIW from what I can tell the additional comparisons can be blamed on s = <new> state setting and the phi that results from it.

oli-obk · 2021-01-04T00:44:33Z

I thought we already had an optimization for this

    bb1: {
        switchInt(_1) -> [0_i32: bb3, 1_i32: bb5, 2_i32: bb7, otherwise: bb2]; // scope 0 at src/lib.rs:10:13: 10:14
    }
    bb4: {
        StorageDead(_3);                 // scope 0 at src/lib.rs:10:29: 10:30
        StorageDead(_4);                 // scope 0 at src/lib.rs:10:30: 10:31
        StorageDead(_2);                 // scope 0 at src/lib.rs:10:30: 10:31
        _1 = const 2_i32;                // scope 0 at src/lib.rs:10:32: 10:37
        goto -> bb1;                     // scope 0 at src/lib.rs:9:9: 14:10
    }

just looking locally at this, the goto -> bb1 should become goto -> bb7. Basically if a block assigns a constant to a local and then jumps to a block which does not touch that local and ends up conditionally branching on the local, then we can copy all statements from the branching block to the current block and make it an unconditional jump to the right target.

simonvandel · 2021-01-04T08:41:17Z

AFAICT this is optimized with #80475

nikic · 2021-01-04T08:52:04Z

LLVM can optimize this with the -C llvm-args=-jump-threading-across-loop-headers option. By default LLVM does not perform jump threading across loop headers, because it cannot reliably determine profitability for this case right now. Doing this naively results in natural loops becoming irreducible, which will pessimize all following loop optimizations.

camelid · 2021-03-20T22:20:52Z

AFAICT this is optimized with #80475

That PR is merged now. Did that fix this issue?

tuxmark5 · 2021-03-21T12:12:57Z

Testing it with (2021-03-20 61edfd5) on the playground shows no improvement.

oli-obk · 2021-03-22T09:57:44Z

Testing it with (2021-03-20 61edfd5) on the playground shows no improvement.

The playground does not actually run mir optimizations of level 3 or above, so you would need to run this locally with nightly and use -Zmir-opt-level=3

Doing this naively results in natural loops becoming irreducible, which will pessimize all following loop optimizations.

This is worrysome. Our optimization as written won't do anything fancy, so it will make loops more complex. The perf results on that optimization are mostly in the noise (ignoring the CTFE stress tests), though without a dedicated compile-time and runtime test, it's hard to tell. Considering that the compiler bootstrap did not regress, I don't think the opt messes with anything too badly.

camelid · 2021-03-23T19:03:17Z

Testing it with (2021-03-20 61edfd5) on the playground shows no improvement.

The playground does not actually run mir optimizations of level 3 or above, so you would need to run this locally with nightly and use -Zmir-opt-level=3

Or you can use Godbolt: https://rust.godbolt.org/z/doK6W9jPT

(I don't what the Assembly is supposed to look like so I can't tell if this is fixed or not.)

tuxmark5 · 2021-03-23T19:47:58Z

Unfortunately it still doesn't seem fixed. Assembly from godbolt link:

example::exec:  << The start of exec() function in rust
        sub     rsp, 8
        mov     dword ptr [rsp + 4], edi
.LBB2_1: << The block that implements the 'switch' 
        mov     eax, dword ptr [rsp + 4]
        mov     dword ptr [rsp], eax
        test    eax, eax
        je      .LBB2_3
        jmp     .LBB2_9
.LBB2_9: 
        mov     eax, dword ptr [rsp]
        sub     eax, 1
        je      .LBB2_5
        jmp     .LBB2_10
.LBB2_10:
        mov     eax, dword ptr [rsp]
        sub     eax, 2
        je      .LBB2_7
        jmp     .LBB2_2
.LBB2_2:
        pop     rax
        ret
.LBB2_3: << case 0
        lea     rdi, [rip + .L__unnamed_2]
        mov     esi, 1
        call    qword ptr [rip + example::print@GOTPCREL]
        mov     dword ptr [rsp + 4], 2 << explicit store to stack: s = 2
        jmp     .LBB2_1 << BAD: jumps back to the 'switch', but s value is known
.LBB2_5: << case 1
        lea     rdi, [rip + .L__unnamed_3]
        mov     esi, 1
        call    qword ptr [rip + example::print@GOTPCREL]
        mov     dword ptr [rsp + 4], 3 << explicit store to stack: s = 3
        jmp     .LBB2_1 << BAD: jumps back to the 'switch' , but s value is known
.LBB2_7: << case 2
        lea     rdi, [rip + .L__unnamed_4]
        mov     esi, 1
        call    qword ptr [rip + example::print@GOTPCREL]
        mov     dword ptr [rsp + 4], 1 << explicit store to stack: s = 1
        jmp     .LBB2_1 << BAD: jumps back to the 'switch' , but s value is known

I've annotated it a bit with << to make it easier to understand. Basically all the match arms eventually jump back to the 'switch' block, where s is compared/tested.

oli-obk · 2021-03-25T12:47:54Z

Maybe LLVM "de-optimizes" our MIR again. Not sure, we'll have to look at the MIR before LLVM, at LLVM-IR without opts and at LLVM-IR with opts to be sure. I'm not sure how to coax godbolt into doing MIR dumps though.

the8472 · 2021-03-25T13:20:25Z

Add --emit=mir

tuxmark5 · 2021-03-25T13:33:24Z

The same can be seen in MIR too:

bb0: {
    switchInt(_1) -> [0_i32: bb2, 1_i32: bb4, 2_i32: bb6, otherwise: bb1]; // scope 0 at ./example.rs:10:13: 10:14
}

bb1: {
    return;                          // scope 0 at ./example.rs:16:2: 16:2
}

bb2: {
    _4 = const "0";                  // scope 0 at ./example.rs:10:26: 10:29
    _3 = _4;                         // scope 0 at ./example.rs:10:26: 10:29
    _2 = print(move _3) -> bb3;      // scope 0 at ./example.rs:10:20: 10:30
}

bb3: {
    _1 = const 2_i32;                // scope 0 at ./example.rs:10:32: 10:37
    goto -> bb0;                     // scope 0 at ./example.rs:9:9: 14:10
}

bb4: {
    _7 = const "1";                  // scope 0 at ./example.rs:11:26: 11:29
    _6 = _7;                         // scope 0 at ./example.rs:11:26: 11:29
    _5 = print(move _6) -> bb5;      // scope 0 at ./example.rs:11:20: 11:30
}

bb5: {
    _1 = const 3_i32;                // scope 0 at ./example.rs:11:32: 11:37
    goto -> bb0;                     // scope 0 at ./example.rs:9:9: 14:10
}

bb6: {
    _10 = const "2";                 // scope 0 at ./example.rs:12:26: 12:29
    _9 = _10;                        // scope 0 at ./example.rs:12:26: 12:29
    _8 = print(move _9) -> bb7;      // scope 0 at ./example.rs:12:20: 12:30
}

bb7: {
    _1 = const 1_i32;                // scope 0 at ./example.rs:12:32: 12:37
    goto -> bb0;                     // scope 0 at ./example.rs:9:9: 14:10
}

After seeing the MIR I thought maybe the issue was the extra "suffix" block after each print operation, so i switcher around assignment and print (instead of doing print("0"); s = 2; in match arms I tried s = 2; print("0");). This yielded simpler MIR, but still the same issue was present:

bb0: {
    switchInt(_1) -> [0_i32: bb2, 1_i32: bb3, 2_i32: bb4, otherwise: bb1]; // scope 0 at ./example.rs:10:13: 10:14
}

bb1: {
    return;                          // scope 0 at ./example.rs:16:2: 16:2
}

bb2: {
    _1 = const 2_i32;                // scope 0 at ./example.rs:10:20: 10:25
    _4 = const "0";                  // scope 0 at ./example.rs:10:33: 10:36
    _3 = _4;                         // scope 0 at ./example.rs:10:33: 10:36
    _2 = print(move _3) -> bb0;      // scope 0 at ./example.rs:10:27: 10:37
}

bb3: {
    _1 = const 3_i32;                // scope 0 at ./example.rs:11:20: 11:25
    _7 = const "1";                  // scope 0 at ./example.rs:11:33: 11:36
    _6 = _7;                         // scope 0 at ./example.rs:11:33: 11:36
    _5 = print(move _6) -> bb0;      // scope 0 at ./example.rs:11:27: 11:37
}

bb4: {
    _1 = const 1_i32;                // scope 0 at ./example.rs:12:20: 12:25
    _10 = const "2";                 // scope 0 at ./example.rs:12:33: 12:36
    _9 = _10;                        // scope 0 at ./example.rs:12:33: 12:36
    _8 = print(move _9) -> bb0;      // scope 0 at ./example.rs:12:27: 12:37
}

oli-obk · 2021-03-25T13:41:13Z

I just looked at the optimization itself. I do not think the issue is related to the print. In fact the order as written is the only one the optimization targets right now. So I'm wondering if the MIR that the optimization sees has more statements that get filtered out later. In order to debug this I believe we need some new mir-opt tests and potentially some tracing statements in the optimization

cynecx · 2021-05-05T18:38:49Z

A new optimization pass is currently in review upstream: https://reviews.llvm.org/D99205

nikic · 2021-08-28T13:45:41Z

With new DFA jump threading pass: https://rust.godbolt.org/z/v1a4c689E

tuxmark5 · 2021-08-28T16:50:12Z

With new DFA jump threading pass: https://rust.godbolt.org/z/v1a4c689E

Optimization breaks if you mark exec as pub.

EDIT: It only seems to work, when LLVM can infer the initial value/state of the state machine, which doesn't happen often in practice when dealing with state machines. Marking exec as pub means that it can be called from other compilation units, hence there is no way to know the initial value of the s.

amehsan · 2022-01-11T02:38:44Z

With new DFA jump threading pass: https://rust.godbolt.org/z/v1a4c689E

Optimization breaks if you mark exec as pub.

EDIT: It only seems to work, when LLVM can infer the initial value/state of the state machine, which doesn't happen often in practice when dealing with state machines. Marking exec as pub means that it can be called from other compilation units, hence there is no way to know the initial value of the s.

I have not looked into this test case (just saw this issue today), but the limitation that you have mentioned here is not hard to fix. We will look into it and get back to you

Responding to a feature request from the Rust community: rust-lang/rust#80630 void foo(X) { for (...) switch (X) case A X = B case B X = C } Even though the initial switch value is non-constant, the switch statement can still be threaded: the initial value will hit the switch statement but the rest of the state changes will proceed by jumping unconditionally. The early predictability check is relaxed to allow unpredictable values anywhere, but later, after the paths through the switch statement have been enumerated, no non-constant state values are allowed along the paths. Any state value not along a path will be an initial switch value, which can be safely ignored. Differential Revision: https://reviews.llvm.org/D124394

bryanpkc · 2023-05-05T15:07:22Z

@tuxmark5 Does D124394 solve this issue for you?

tuxmark5 · 2023-05-05T15:32:42Z

@bryanpkc This looks like exactly what I need.

tuxmark5 added the C-bug Category: This is a bug. label Jan 2, 2021

nagisa added the A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. label Jan 2, 2021

jyn514 added the I-slow Issue: Problems and improvements with respect to performance of generated code. label Jan 3, 2021

oli-obk added the A-mir-opt Area: MIR optimizations label Jan 4, 2021

Noratrieb added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Apr 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rust insufficiently optimizes loop { match { } } state machines #80630

Rust insufficiently optimizes loop { match { } } state machines #80630

tuxmark5 commented Jan 2, 2021

the8472 commented Jan 2, 2021

nagisa commented Jan 2, 2021 •

edited

Loading

oli-obk commented Jan 4, 2021

simonvandel commented Jan 4, 2021

nikic commented Jan 4, 2021

camelid commented Mar 20, 2021

tuxmark5 commented Mar 21, 2021

oli-obk commented Mar 22, 2021

camelid commented Mar 23, 2021 •

edited

Loading

tuxmark5 commented Mar 23, 2021 •

edited

Loading

oli-obk commented Mar 25, 2021

the8472 commented Mar 25, 2021

tuxmark5 commented Mar 25, 2021

oli-obk commented Mar 25, 2021

cynecx commented May 5, 2021

nikic commented Aug 28, 2021

tuxmark5 commented Aug 28, 2021 •

edited

Loading

amehsan commented Jan 11, 2022

bryanpkc commented May 5, 2023

tuxmark5 commented May 5, 2023

Rust insufficiently optimizes loop { match { } } state machines #80630

Rust insufficiently optimizes loop { match { } } state machines #80630

Comments

tuxmark5 commented Jan 2, 2021

Meta

the8472 commented Jan 2, 2021

nagisa commented Jan 2, 2021 • edited Loading

oli-obk commented Jan 4, 2021

simonvandel commented Jan 4, 2021

nikic commented Jan 4, 2021

camelid commented Mar 20, 2021

tuxmark5 commented Mar 21, 2021

oli-obk commented Mar 22, 2021

camelid commented Mar 23, 2021 • edited Loading

tuxmark5 commented Mar 23, 2021 • edited Loading

oli-obk commented Mar 25, 2021

the8472 commented Mar 25, 2021

tuxmark5 commented Mar 25, 2021

oli-obk commented Mar 25, 2021

cynecx commented May 5, 2021

nikic commented Aug 28, 2021

tuxmark5 commented Aug 28, 2021 • edited Loading

amehsan commented Jan 11, 2022

bryanpkc commented May 5, 2023

tuxmark5 commented May 5, 2023

nagisa commented Jan 2, 2021 •

edited

Loading

camelid commented Mar 23, 2021 •

edited

Loading

tuxmark5 commented Mar 23, 2021 •

edited

Loading

tuxmark5 commented Aug 28, 2021 •

edited

Loading