Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Built-in u128 performance compared to extprim crate #39078

Closed
leonardo-m opened this issue Jan 15, 2017 · 23 comments
Closed

Built-in u128 performance compared to extprim crate #39078

leonardo-m opened this issue Jan 15, 2017 · 23 comments
Labels
C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code.

Comments

@leonardo-m
Copy link

This is a performance bug report.

This little program solves a problem derived from this (a ten times larger problem):
https://projecteuler.net/problem=55

The solution (this is not the fastest solution for this problem, but it's a easy to understand one):

#![feature(i128_type)]

fn is_lychrel(mut n: u128) -> bool {
    for _ in 0 .. 50 {
        n += n.to_string().chars().rev().collect::<String>().parse().unwrap();
        if n.to_string() == n.to_string().chars().rev().collect::<String>() {
            return false;
        }
    }
    true
}

fn e55() -> usize {
    (0u32 .. 100_000).filter(|&i| is_lychrel(i.into())).count()
}

fn main() {
    println!("{}", e55() == 6208);
}

On my PC it runs in about 1.12 seconds.

Using an external crate for the u128 bit values, with the same code (extprim = "1.1.0"):

extern crate extprim;
use extprim::u128::u128;

fn is_lychrel(mut n: u128) -> bool {
    for _ in 0 .. 50 {
        n += n.to_string().chars().rev().collect::<String>().parse().unwrap();
        if n.to_string() == n.to_string().chars().rev().collect::<String>() {
            return false;
        }
    }
    true
}

fn e55() -> usize {
    (0u32 .. 100_000).filter(|&i| is_lychrel(i.into())).count()
}

fn main() {
    println!("{}", e55() == 6208);
}

This runs in about 0.72 seconds.

@est31
Copy link
Member

est31 commented Jan 15, 2017

On which target do you run this, and which one was it compiled for? Also, how does the to_string implementation look like? Is it different in the two examples?

@est31
Copy link
Member

est31 commented Jan 15, 2017

cc #35118
@nagisa

@leonardo-m
Copy link
Author

Right. I am using a 64 bit Windows GNU target.

@leonardo-m
Copy link
Author

I think this is the formatter for that crate:

impl fmt::Display for u128 {
    fn fmt(&self, formatter: &mut fmt::Formatter) -> fmt::Result {
        if self.hi == 0 {
            self.lo.fmt(formatter)
        } else {
            const TEN19: u128 = u128 { lo: 10000000000000000000, hi: 0 };

            let mut buffer = [0u8; 39];
            let mut buf = FormatBuffer::new(&mut buffer);

            let (mid, lo) = div_rem(*self, TEN19);
            if mid.hi == 0 {
                try!(write!(&mut buf, "{}{:019}", mid.lo, lo.lo));
            } else {
                let (hi, mid) = div_rem(mid, TEN19);
                try!(write!(&mut buf, "{}{:019}{:019}", hi.lo, mid.lo, lo.lo));
            }

            formatter.pad_integral(true, "", unsafe { buf.into_str() })
        }
    }
}

Where the div_rem uses a function udivmodti4 from here:

https://github.com/kennytm/extprim/blob/master/src/compiler_rt.rs

@nagisa
Copy link
Member

nagisa commented Jan 15, 2017

There’s at most two places where the difference in runtime could come:

  1. Implementation of the intrinsic function(s);
  2. Overhead due to ABI differences (unlikely to be 2x penalty).

Or some combination of two. While the linked udivmodti4 intrinsic seems fairly similar, one can spot a number of potentially relevant differences (also signature violations) as well.

perf also shows about 55% of runtime is spent on a few particularly hot instructions on my machine. Notably there’s a shr that’s as hot as the hottest divisions, which seems to suggest some pathological case. Some gentle massaging of the intrinsic could help to improve performance of such code immensely.

This issue is applicable to intrinsics in general; all of them could benefit from having some of their low hanging fruit plucked.

@est31
Copy link
Member

est31 commented Jan 15, 2017

There is at least one place to do a possible speed optimisation: https://github.com/rust-lang/rust/blob/master/src/libcompiler_builtins/lib.rs#L233

Now that we have a benchmark we can actually compare it. However it should also be applied to the libcompiler_builtins crate, as otherwise the time was wasted once that crate gets merged.

@Mark-Simulacrum Mark-Simulacrum added the I-slow Issue: Problems and improvements with respect to performance of generated code. label May 19, 2017
@Mark-Simulacrum Mark-Simulacrum added C-enhancement Category: An issue proposing an enhancement or a PR with one. and removed C-enhancement Category: An issue proposing an enhancement or a PR with one. labels Jul 26, 2017
@steveklabnik
Copy link
Member

Triage: when trying to make a benchmark test out of this, it seemed like the test runner hung... when creating a simple main function that did the calculation, I'm still seeing similar results to @leonardo-m

@leonardo-m
Copy link
Author

I have updated the first test code (note that N_ITER should be 60, unlike in the old version):

// Compiled with:
// -O -Z mir-opt-level=3 -C target-cpu=native -C lto -C panic=abort -C codegen-units=1

fn euler55b() -> usize {
    const LIMIT: u128 = 100_000;
    const N_ITER: usize = 60;

    fn reverse(n: u128) -> String { n.to_string().chars().rev().collect() }

    fn is_lychrel(mut n: u128) -> bool {
        for _ in 0 .. N_ITER {
            n += reverse(n).parse::<u128>().unwrap();
            if n.to_string() == reverse(n) {
                return false;
            }
        }
        true
    }

    (0 .. LIMIT).filter(|&i| is_lychrel(i)).count()
}

fn main() {
    println!("{}", euler55b() == 6_091);
}

Now it runs in about 0.86 seconds, on the same machine (rustc 1.48.0-nightly d006f57 2020-08-28).

@AaronKutch
Copy link
Contributor

AaronKutch commented Sep 12, 2020

Did my code outperform the extprim crate? If so, then the only major remaining problem with divisions is #44545.
Edit: I realized that I need the tweak the inlining more to almost double the performance of __divti3

@AaronKutch
Copy link
Contributor

Actually, it does not seem that the performance changes have made it to nightly yet

@leonardo-m
Copy link
Author

I see no performance change yet (rustc 1.48.0-nightly dbb73f8 2020-09-12).

@Amanieu
Copy link
Member

Amanieu commented Sep 13, 2020

I published a new version of compiler-builtins with the new division code.

@josephlr
Copy link
Contributor

@Amanieu I looks like Rust isn't getting the asm-related improvements because the "asm" feature isn't being enabled by alloc/std/etc...

I think we should setup compiler-builtins to enable that feature by default.

@AaronKutch
Copy link
Contributor

It doesn't look like Rust is getting any improvements at all. I don't see any benchmark improvements with the latest nightly. I could know for sure if I had a way to look at the assembly of __divti3, but the assembly for that isn't shown in a normal cargo rustc --release -- --emit='asm'

@Amanieu
Copy link
Member

Amanieu commented Oct 15, 2020

You can try dumping the disassembly of the final executable with objdump -d <file> or llvm-objdump -d <file>. Search for __divti3 in the output.

@AaronKutch
Copy link
Contributor

When dividing random 128 bit integers in a test program, it doesn't generate more than 1 divq instruction (that is probably for the output formatting), but I would expect several. Is the latest nightly still not using the most recent compiler-builtins version? Is there a Cargo.toml with a specified compiler-builtins version somewhere or is it supposed to pull the latest one for every nightly build?

@AaronKutch
Copy link
Contributor

I should clarify that I should see divq instructions being used regardless of whether the "asm" feature is enabled or not, because I am just using assembly to get more use out of divq than LLVM normally would.

@Amanieu
Copy link
Member

Amanieu commented Oct 16, 2020

You need to update the compiler-builtins version in Cargo.lock of rust-lang/rust.

@AaronKutch
Copy link
Contributor

I think we should wait until the "asm" feature flag issue is resolved, compiler-builtins PR #384 is merged (which is important for fixing an issue with cranelift and other performance issues), and after that a new version of compiler-builtins is released (which will also include the division tweaks I made).

@AaronKutch
Copy link
Contributor

It looks like some of the perf changes (but not all the things since compiler-builtins 1.36) have made it into nightly, what do the benchmarks look like?

@leonardo-m
Copy link
Author

leonardo-m commented Nov 21, 2020

With the latest Nightly the benchmark above (of Aug 29) can't even be compiled with the specified compiler switches (-Z mir-opt-level=3 fails).

While the original benchmark at the top of this page takes now 0.42 seconds on the same PC and OS.

@AaronKutch
Copy link
Contributor

The perf changes have made it into stable Rust by now, what does it look like?

@leonardo-m
Copy link
Author

leonardo-m commented Mar 30, 2021

Now the code at Aug 29 runs in 0.50-0.52 seconds, and the original (buggy) code runs in 0.43-0.47 seconds (this variability is caused by different compilation flags). I've tried the code that uses extprim, and the performance is the same. So we can close this issue down.

The next LLVM improvement step could be to optimize remainder/divisions by small constants:

pub fn foo(x: u128) -> u8 {
    (x % 10) as _
}

rustc 1.53.0-nightly 4a20eb6 2021-03-28 gives:

foo:
    sub     rsp, 8
    mov     edx, 10
    xor     ecx, ecx
    call    qword ptr [rip + __umodti3@GOTPCREL]
    pop     rcx
    ret

g++ (trunk) 11.0.1 20210329 gives:

#include <stdint.h>
uint8_t foo(const __uint128_t x) {
    return (uint8_t)(x % 10);
}
foo(unsigned __int128):
    mov     r9, rdi
    mov     r8, rsi
    mov     rsi, rdi
    mov     rcx, r9
    mov     rdi, r8
    add     rcx, r8
    movabs  r8, -3689348814741910323
    adc     rcx, 0
    xor     r11d, r11d
    mov     rax, rcx
    mul     r8
    mov     rax, rdx
    and     rdx, -4
    shr     rax, 2
    add     rdx, rax
    mov     rax, r9
    sub     rcx, rdx
    mov     rdx, rdi
    sub     rax, rcx
    sbb     rdx, r11
    mov     r11d, 10
    mov     rcx, rdx
    movabs  rdx, -3689348814741910324
    imul    rcx, r8
    imul    rdx, rax
    add     rcx, rdx
    mul     r8
    add     rdx, rcx
    shrd    rax, rdx, 1
    mul     r11
    sub     rsi, rax
    mov     eax, esi
    ret

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code.
Projects
None yet
Development

No branches or pull requests

8 participants