You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Implementing this operation using inline assembly with the expected machine code output:
#[inline(never)]
#[no_mangle]
pub fn mp_mul_asm(a: &[u64], b: &[u64]) -> Vec<u64> {
unsafe {
let len = a.len();
assert_eq!(len, b.len());
let mut c = vec![0; len * 2];
asm!("
# start iteration of `b_i`: `len`-1 downto 0
lea -1($3), %rsi
1: # every iteration of `b_i`
# load `b_elem`
mov ($1, %rsi, 8), %rdx
# clear `carry_lo`, `carry_hi`
xor %r9, %r9
# start iteration of `a_i`+1: `len` downto 1
mov $3, %rcx
jmp 3f
2: # the end of every iteration of `a_i`+1, except the last iteration
# no displacement, RCX has already been decremented
mov %r10, (%r11, %rcx, 8)
3: # every iteration of `a_i`+1
# compute c + b_i
lea ($2, %rsi, 8), %r11
# compute a[a_i]*b_elem
mulx -8($0, %rcx, 8), %r8, %r9
# add low word
mov (%r11, %rcx, 8), %r10
adcx %r8, %r10
mov %r10, (%r11, %rcx, 8)
# add high word
mov -8(%r11, %rcx, 8), %r10
adox %r9, %r10
# this is done later to be able to add in last carry in last iteration of `a_i`+1
# mov %r10, -8(%r11, %rcx, 8)
# advance `a_i`+1
lea -1(%rcx), %rcx
jrcxz 4f
jmp 2b
4: # the end of the last iteration of `a_i`+1
adc $$0, %r10
mov %r10, (%r11)
# advance `b_i`
dec %rsi
jns 1b
"
:
: "r"(a.as_ptr()), "r"(b.as_ptr()), "r"(c.as_mut_ptr()), "r"(len)
: "rsi", "rcx", "rdx", "r8", "r9", "r10", "r11", "flags"
);
c
}
}
and benchmarking both (rust-lang/stdarch#666 (comment)) shows a significant performance difference: 390ns/iteration for the inline assembly version vs 507ns/iteration for the one using llvm.x86.addcarryx.u64.
It appears that LLVM always replaces llvm.x86.addcarryx.u64 with a polyfill based on llvm.x86.addcarry.u64 and then fails to emit adcx instructions.
The text was updated successfully, but these errors were encountered:
The X86 backend isn't currently set up to model the C flag and O flag separately. We model all of the flags as one register. Because of this we can't interleave the flag dependencies. We would need to do something about that before it makes sense to implement _addcarryx_u64 as anything other than plain adc.
Even that's not enough. We need to use jrcxz for the loop control and an lea for the index adjustment. And we need to keep flags alive across basic block boundaries which I don't think we usually do.
Extended Description
MP multiply is documented by Intel as one of the main usecases of addcarryx intrinsics (https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf).
We implement this in Rust as:
which produces the following LLVM-IR after optimizations (https://rust.godbolt.org/z/EJFEHB):
which gets compiled down to the following machine code:
Implementing this operation using inline assembly with the expected machine code output:
and benchmarking both (rust-lang/stdarch#666 (comment)) shows a significant performance difference: 390ns/iteration for the inline assembly version vs 507ns/iteration for the one using
llvm.x86.addcarryx.u64
.It appears that LLVM always replaces
llvm.x86.addcarryx.u64
with a polyfill based onllvm.x86.addcarry.u64
and then fails to emit adcx instructions.The text was updated successfully, but these errors were encountered: