Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large performance loss due to lacking LTO #39

Open
benruijl opened this issue Oct 18, 2022 · 5 comments
Open

Large performance loss due to lacking LTO #39

benruijl opened this issue Oct 18, 2022 · 5 comments

Comments

@benruijl
Copy link
Contributor

The low-level operations on f128 numbers in C, __addtf3 etc, are wrapped around using the Wrapper type in f128.c. This causes overhead of about a factor 1.5 to a factor 2, as can be seen from this flamegraph: flamegraph.

In C/C++, this substantial loss can be mitigated by compiling the C library with lto:

gcc -O3 -flto -lgfortran -lquadmath -Bstatic -c f128.c
gcc-ar crf libf128.a f128.o
g++ -O3 test.c libf128.a -flto -lquadmath -o test

where test.c is a benchmark script:

#include <cstdio>
#include <quadmath.h>
#include <cstdio>
typedef __float128 f128;

typedef union _Wrapper {
  f128 value;
  unsigned __int128 dat;
  char dat_alt[16];
} __attribute__ ((aligned (16))) Wrapper;

extern "C" {
    Wrapper f64_to_f128(double);
    void f128_to_str(Wrapper, int, char*, const char*);
    Wrapper f128_add(Wrapper*, Wrapper*);
    Wrapper f128_sub(Wrapper*, Wrapper*);
    Wrapper f128_mul(Wrapper*, Wrapper*);
    Wrapper f128_div(Wrapper*, Wrapper*);
}


int main() {
    Wrapper a = f64_to_f128(2.);
    Wrapper b = f64_to_f128(3.);
    Wrapper c = f64_to_f128(4.);
    Wrapper d = f64_to_f128(5.);
    for (long int i = 0; i  < (long int)10000000; i++) {
        a = f128_add(&a, &b);
        a = f128_sub(&a, &c);
        a = f128_mul(&a, &d);
        a = f128_div(&a, &c);
    }

    printf("%f", a);

    return 0;
}

I am trying to achieve a similar performance boost in Rust, but I am struggling and wondering if it's even possible since Rust compiles with LLVM and we need to use g++ instead of clang for the quadmath extension.

I tried adding .flag("-flto") to the build script, but that causes linking errors (presumably because the LLVM linker cannot read g++ LTO info). Adding .flag("-ffat-lto-objects") does restore compilation but only because LLVM can now opt to not use LTO.

Does anyone here know a solution?

@jkarns275
Copy link
Owner

I think your c compiler in this case would have to be clang, and you may have to tweak the compilation flags a bit.

this may be of use: https://doc.rust-lang.org/rustc/linker-plugin-lto.html

@benruijl
Copy link
Contributor Author

benruijl commented Oct 20, 2022

I tried to change the C compiler (as I also reported here), but the code crashes for some reason.

I tried to get f128 to compile with clang, making sure it finds quadmath:

CPATH=/usr/lib/gcc/x86_64-pc-linux-gnu/12.2.0/include/ clang f128.c -flto=thin -c -o ./f128clang.o -O2
ar crus libf128.a f128clang.o
CPATH=/usr/lib/gcc/x86_64-pc-linux-gnu/12.2.0/include/ clang test.c libf128.a -flto=thin -O2 -o test_clang

this works for my C test script test.c and the call overhead is removed.

Changing the f128 crate build script to:

println!(r"clang src/f128.c -flto=full -lquadmath -c -o ./f128clang.o -O2");
println!(r"ar crus libf128.a f128clang.o");

and running the small project

[package]
name = "f128perf"
version = "0.1.0"
edition = "2021"

[profile.dev]
opt-level = 2
lto = "fat"

[profile.release]
lto = "fat"

[dependencies]
f128 = {path="../f128"}
num-traits = "*"

and

use num_traits::cast::FromPrimitive;

fn main() {
    let mut a = f128::f128::from_f64(2.).unwrap();
    let b = f128::f128::from_f64(3.).unwrap();

    for _ in 0..10000000 {
        a = a + b - b;
        a *= b;
        a = a / b;
    }

    println!("a={}", a);
}

with

 RUSTFLAGS="-Clinker-plugin-lto -Clinker=clang -Clink-arg=-fuse-ld=lld -L ../f128/src" cargo build --release

it does compile but it gives a runtime crash (signal SIGSEGV (Address boundary error)):

#0  0x0000555555595c49 in f64_to_f128 ()
#1  0x00005555555646b9 in f128::f128_t::{impl#7}::from_f64 (n=2) at /home/ben/Sync/Research/f128/src/f128_t.rs:420
#2  f128perf::main () at /home/ben/Sync/Research/f128perf/src/main.rs:4

with rustc 1.61.0 and clang version 14.0.6 on x86_64-pc-linux-gnu.

This crash doens't go away when I tried -flto=thin.

Do you have any idea what may cause this crash?

@jkarns275
Copy link
Owner

Tried taking a stab at this tonight and I ended up getting a different segfault. I'll dig into it this Saturday.

@jkarns275
Copy link
Owner

Looking at this again - I suspect you may have to compile the crate yourself, modifying f128_internal's Cargo.toml with the appropriate LTO flags. Have you tried this already?

@jkarns275
Copy link
Owner

@benruijl I recently compiled f128 crate without any changes and dug into the disassembly.

The generated assembly for f128_internal::ffi::logq_f is as follows when i ran cargo build --release:

000000000005be00 <logq_f>:
   5be00:	48 83 ec 18          	sub    $0x18,%rsp
   5be04:	48 89 3c 24          	mov    %rdi,(%rsp)
   5be08:	48 89 74 24 08       	mov    %rsi,0x8(%rsp)
   5be0d:	66 0f 6f 04 24       	movdqa (%rsp),%xmm0
   5be12:	e8 b9 83 fb ff       	call   141d0 <logq@plt>
   5be17:	0f 29 04 24          	movaps %xmm0,(%rsp)
   5be1b:	48 8b 04 24          	mov    (%rsp),%rax
   5be1f:	48 8b 54 24 08       	mov    0x8(%rsp),%rdx
   5be24:	48 83 c4 18          	add    $0x18,%rsp
   5be28:	c3                   	ret
   5be29:	0f 1f 80 00 00 00 00 	nopl   0x0(%rax)

Where the call to logq@plt is an intermediary to directly call the function in quadmath:

00000000000141d0 <logq@plt>:
   141d0:	ff 25 62 3d 0d 00    	jmp    *0xd3d62(%rip)        # e7f38 <logq@QUADMATH_1.0>
   141d6:	68 1a 00 00 00       	push   $0x1a
   141db:	e9 40 fe ff ff       	jmp    14020 <_init+0x20>

This makes me think that the rust compiler is doing a competent job at LTO. I think the issue is the way some things are called, and the way data is passed to FFI functions.

Looking more into the generated assembler though, I think that both the C and rust compilers are hesitant to use 128 bit registers to pass function values just because they're 128 bits in size:

   5be0d:	66 0f 6f 04 24       	movdqa (%rsp),%xmm0

Something like this occurs for every function call for each argument, when if our interfaces were completely transparent it would simply call the function without tweaking the registers (I think?).

However, rustc and any competent c compiler will use the SSE registers to pass 128 bit primitives -- i128 / __int128 in gcc. This leads me to believe that we can simply replace the wrapper type entirely.

This solution is very biased towards x86 ISAs however. I'm not familiar enough with other popular ISAs (e.g. apple M1) to really say if this would work, and it would likely depend on compilers doing the same thing.

I guess the conclusion is: LTO works, the compilers are doing exactly what we tell them to do, it just so happens that we're telling them to do something sort of silly. If you still have a use for this library, I will attempt to make this happen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants