Investigate ways to improve basic primitive speed #198

jon-chuang · 2021-02-04T10:53:56Z

With some simple changes converting biginteger primitive iter ops to their unrolled counterparts, one obtains some decent results.

Hence, the hypothesis is that implementing assembly for these functions could improve these results further.

Here are the results (with some anomalies I believe due to overly aggressive optimisations in the benchmark code):

My guess is that currently, gurvy is still not as good as mcl were mcl to improve its field mul and squaring.

This is certainly interesting and I don't have enough knowledge to say how one can obtain optimisations to the point of MCL's speed, since the improvements above do not account for the full difference in speeds.

Further, our field mul and squaring is more or less on par with gurvy's (however, they now have an update, which is apparently faster than before, but I have not seen a benchmark showing how much faster).

However, apart from the above changes, I will leave investigating further low-level optimisations to someone else who may be interested.

One interesting observation is that with 11 M/S, if the cost of a G1 mixed add were truly dominated by M/S, at 28ns per M/S, the time required would be ~ 300ns, however, this is clearly not the case, and shows that there is something wrong with the additions, subtractions and doublings, of which there are many.

In theory, they should cost [1/(2 * limbs), 1/limbs] as much as M/S. They are approximately as dense as M/S, so their cost should be, according to estimation, ~1/8 for 6 limbs, leading to a total time of 340ns. But due to the inefficient implementation, their total cost is > 1/4 for 6 limbs, leading to a total time of 415ns. That's because presently, the costs are in [1/limbs, 2/limbs] instead.

A timing of 340ns (relative to my processor speed) would put us as a contender for top place alongside mcl and constantine for this particular function.

Thus, it is my view that much of the discrepancy with other libraries for group ops can be accounted for with more efficient additive and mul by const ops.

Pratyush · 2021-02-04T18:33:43Z

See #199

I think rust-lang/stdarch#666 also affects addition and subtraction, so we end up with non-optimal assembly:

jon-chuang · 2021-02-05T01:19:09Z

Hmm, although this (adcx, adox etc) should affect mul due to requiring two carry chains, in theory, llvm can deal with single carry chains (as with adc).

However, there are many questionable instructions here, such as movzx and setb. This could be artifacts from using u128. Hence, to obtain a cleaner assembly, I will investigate simply implementing assembly versions of the key functions.

I may also investigate if it is worth implementing cmp_less and cmp_greater functions to reduce the generality bloat required to define cmp.

Pratyush · 2021-02-05T01:28:27Z

In that case, maybe we can just directly use the intrinsic? https://doc.rust-lang.org/core/arch/x86_64/fn._addcarryx_u64.html

jon-chuang · 2021-02-05T01:33:53Z

Possibly.

However, since we are already using an assembly framework, it may pay to have greater control over the assembly that is produced.

For instance for sub_noborrow as well.

jon-chuang · 2021-02-05T01:37:41Z

I think I will revamp the assembly framework so there are less weird artifacts from the time I went a little crazy with the proc macro stuff and trying to have a DSL syntax.

Overall, I think the framework is an ok alternative to generating .S, but now it's rather chaotic and ad hoc.

Pratyush · 2021-02-05T01:45:06Z

For sub_noborrow, we have https://doc.rust-lang.org/core/arch/x86_64/fn._subborrow_u64.html

jon-chuang · 2021-02-05T01:46:40Z

Hmm fair enough. I guess it would be worth investigating both options.

jon-chuang · 2021-02-05T02:16:29Z

So there are some positive results regarding the intrinsics:

Here is the assembly produced for the following code snippet:

#[inline(never)]
fn sub_noborrow(x: &mut [u64; 6], other: &[u64; 6]) -> bool {
    let mut borrow = 0u8;

    unsafe {
        borrow = core::arch::x86_64::_subborrow_u64(borrow, x[0], other[0], &mut x[0]);
        borrow = core::arch::x86_64::_subborrow_u64(borrow, x[1], other[1], &mut x[1]);
        borrow = core::arch::x86_64::_subborrow_u64(borrow, x[2], other[2], &mut x[2]);
        borrow = core::arch::x86_64::_subborrow_u64(borrow, x[3], other[3], &mut x[3]);
        borrow = core::arch::x86_64::_subborrow_u64(borrow, x[4], other[4], &mut x[4]);
        borrow = core::arch::x86_64::_subborrow_u64(borrow, x[5], other[5], &mut x[5]);
    }

    borrow != 0
}

        movq	(%rsi), %rax
	subq	%rax, (%rdi)
	movq	8(%rsi), %rax
	sbbq	%rax, 8(%rdi)
	movq	16(%rsi), %rax
	sbbq	%rax, 16(%rdi)
	movq	24(%rsi), %rax
	sbbq	%rax, 24(%rdi)
	movq	32(%rsi), %rax
	sbbq	%rax, 32(%rdi)
	movq	40(%rsi), %rax
	sbbq	%rax, 40(%rdi)
	setb	%al
	retq

Not quite there but still encouraging.

Pratyush · 2021-02-05T02:21:08Z

I get similar results for addition:

pub fn add_nocarry2(a: &mut [u64; 4], b: &[u64; 4]) -> bool {
    let mut carry = 0;

    for i in 0..4 {
        unsafe { 
            carry = core::arch::x86_64::_addcarry_u64(carry, a[i], b[i], &mut a[i]);
        }
    }

    carry != 0
}

produces this:

        mov     rax, qword ptr [rsi]
        add     qword ptr [rdi], rax
        mov     rax, qword ptr [rsi + 8]
        adc     qword ptr [rdi + 8], rax
        mov     rax, qword ptr [rsi + 16]
        adc     qword ptr [rdi + 16], rax
        mov     rax, qword ptr [rsi + 24]
        adc     qword ptr [rdi + 24], rax
        setb    al
        ret

jon-chuang · 2021-02-05T02:32:51Z

hmm, did you do this with unroll? If not, it's nice that the loops are unrolled by llvm in that case. I think it might not have worked for the mul/square case as the loops were nested.

Pratyush · 2021-02-05T02:34:33Z

This is without unroll: https://godbolt.org/z/zv5435

Pratyush · 2021-02-05T02:39:07Z

Opened a PR for this: #202

jon-chuang · 2021-02-05T02:39:20Z

With new benchmarks using intrinsics, fq::sub_assign and fq::add_assign are blazingly fast (~1.1-1.2ns), whereas fq::double is still pitifully slow (6.3ns).

I have feature gated the use of intrinsics behind use_asm and deemed that no further meddling with assembly would be required for the former 2 functions.

Now I will focus attention on double.

jon-chuang · 2021-02-05T03:00:57Z

I may also investigate if it is worth implementing cmp_less and cmp_greater functions to reduce the generality bloat required to define cmp.

This is ill-conceived as comparing greater than in a limb can also help shortcut a less than comparison. So current impl is optimal imo.

jon-chuang · 2021-02-05T03:12:30Z

Here are some results converting the mul2 impl to adding the two bigintegers together using the new blazing fast intrinsics.

Here are the overall changes in perf against the old Fq.

Notice that even mul and square benefit due to improved performance for reduce.

jon-chuang · 2021-02-05T03:21:36Z

Here are the changes to bls12_381::full_pairing

Scaled to the https://hackmd.io/@zkteam/eccbench results, our full pairing would take 1.03ms, which is better but still can be improved via algorithmic improvements.

jon-chuang · 2021-02-05T03:35:13Z

Unfortunately, mul2 using self addition results in ugly assembly code, hence I think we need to resort to using assembly

jon-chuang · 2021-02-05T04:40:33Z

Actually, here is a comparison between Fp add_assign and double's assemblies:

double_in_place:

.Lfunc_begin234:
	.cfi_startproc
	pushq	%rbx
	.cfi_def_cfa_offset 16
	.cfi_offset %rbx, -16
	movq	%rdi, %rax
	movabsq	$1873798617647539866, %r8
	movq	(%rdi), %r10
	movq	8(%rdi), %r9
	addq	%r10, %r10
	movq	%r10, (%rdi)
	adcq	%r9, %r9
	movq	%r9, 8(%rdi)
	movq	16(%rdi), %rdi
	adcq	%rdi, %rdi
	movq	%rdi, 16(%rax)
	movq	24(%rax), %rcx
	adcq	%rcx, %rcx
	movq	%rcx, 24(%rax)
	movq	32(%rax), %rdx
	adcq	%rdx, %rdx
	movq	%rdx, 32(%rax)
	movq	40(%rax), %rsi
	adcq	%rsi, %rsi
	movq	%rsi, 40(%rax)
	cmpq	%r8, %rsi
	jb	.LBB234_12
	movabsq	$5412103778470702295, %r11
	jne	.LBB234_11
	cmpq	%r11, %rdx
	jb	.LBB234_12
	jne	.LBB234_11
	movabsq	$7239337960414712511, %rbx
	cmpq	%rbx, %rcx
	jb	.LBB234_12
	jne	.LBB234_11
	movabsq	$7435674573564081700, %rbx
	cmpq	%rbx, %rdi
	jb	.LBB234_12
	jne	.LBB234_11
	movabsq	$2210141511517208575, %rbx
	cmpq	%rbx, %r9
	jb	.LBB234_12
	jne	.LBB234_11
	movabsq	$-5044313057631688021, %rbx
	cmpq	%rbx, %r10
	jb	.LBB234_12
.LBB234_11:
	movabsq	$-5044313057631688021, %rbx
	subq	%rbx, %r10
	movabsq	$2210141511517208575, %rbx
	sbbq	%rbx, %r9
	movq	%r10, (%rax)
	movabsq	$7435674573564081700, %rbx
	sbbq	%rbx, %rdi
	movq	%r9, 8(%rax)
	movabsq	$7239337960414712511, %rbx
	sbbq	%rbx, %rcx
	movq	%rdi, 16(%rax)
	movq	%rcx, 24(%rax)
	sbbq	%r11, %rdx
	movq	%rdx, 32(%rax)
	sbbq	%r8, %rsi
	movq	%rsi, 40(%rax)
.LBB234_12:
	popq	%rbx
	.cfi_def_cfa_offset 8
	retq
.Lfunc_end234:

add_assign

.Lfunc_begin219:
	.cfi_startproc
	pushq	%rbx
	.cfi_def_cfa_offset 16
	.cfi_offset %rbx, -16
	movq	(%rdi), %r9
	movq	8(%rdi), %r8
	addq	(%rsi), %r9
	movq	%r9, (%rdi)
	adcq	8(%rsi), %r8
	movq	%r8, 8(%rdi)
	movq	16(%rdi), %r10
	adcq	16(%rsi), %r10
	movq	%r10, 16(%rdi)
	movq	24(%rdi), %rax
	adcq	24(%rsi), %rax
	movq	%rax, 24(%rdi)
	movq	32(%rdi), %rcx
	adcq	32(%rsi), %rcx
	movq	%rcx, 32(%rdi)
	movq	40(%rdi), %rdx
	adcq	40(%rsi), %rdx
	movabsq	$1873798617647539866, %rsi
	movq	%rdx, 40(%rdi)
	cmpq	%rsi, %rdx
	jb	.LBB219_12
	movabsq	$5412103778470702295, %r11
	jne	.LBB219_11
	cmpq	%r11, %rcx
	jb	.LBB219_12
	jne	.LBB219_11
	movabsq	$7239337960414712511, %rbx
	cmpq	%rbx, %rax
	jb	.LBB219_12
	jne	.LBB219_11
	movabsq	$7435674573564081700, %rbx
	cmpq	%rbx, %r10
	jb	.LBB219_12
	jne	.LBB219_11
	movabsq	$2210141511517208575, %rbx
	cmpq	%rbx, %r8
	jb	.LBB219_12
	jne	.LBB219_11
	movabsq	$-5044313057631688021, %rbx
	cmpq	%rbx, %r9
	jb	.LBB219_12
.LBB219_11:
	movabsq	$-5044313057631688021, %rbx
	subq	%rbx, %r9
	movabsq	$2210141511517208575, %rbx
	sbbq	%rbx, %r8
	movq	%r9, (%rdi)
	movabsq	$7435674573564081700, %rbx
	sbbq	%rbx, %r10
	movq	%r8, 8(%rdi)
	movabsq	$7239337960414712511, %rbx
	sbbq	%rbx, %rax
	movq	%r10, 16(%rdi)
	movq	%rax, 24(%rdi)
	sbbq	%r11, %rcx
	movq	%rcx, 32(%rdi)
	sbbq	%rsi, %rdx
	movq	%rdx, 40(%rdi)
.LBB219_12:
	popq	%rbx
	.cfi_def_cfa_offset 8
	retq
.Lfunc_end219:

They are virtually identical, but there is a discrepancy in their benchmarks.

Notice however that there is already inefficiency in moving data to and from cache/registers.

This furthers the conviction that one should move towards a full .S style ASM system, and convert more functions into pure ASM.

jon-chuang · 2021-02-05T05:06:24Z

I've decided to stop trusting the benchmarks for particular field/biginteger methods, and trust more the overall benchmarks for big functions, there seems to be some weird discrepancy, and I can't pinpoint the source.

jon-chuang · 2021-02-05T05:20:10Z

Here are some comparisons with gurvy on my Ryzen 4800H (4.2GHz boost clock):
Gurvy timings in brackets.
bls12_381 Jacobian:
add 517ns (509ns)
mixed add 393ns (366ns)
double 257ns (286ns)

So it appears we are not too far away as a matter of fact, and the gap is at most 10%.

In fact, some of these results seem to suggest that the benchmarks here https://hackmd.io/@zkteam/eccbench are made erroneously and may have used the wrong timings (extended jacobian rather than jacobian repr). (this fact has now been confirmed)

However, we are still lagging significantly in G2 timings. This is at least partly because Fp towers are still laggard.

ValarDragon · 2021-02-06T03:27:10Z

Something worth investigating is why is there such a high discrepancy between fq12 and fq additions. Per the benchmarks in OP, fq12::add_assign / fq::add_assign = 22.76 whereas we'd expect this ratio to be 12.

Maybe inlining isn't functioning properly here

mratsim · 2021-02-10T11:14:43Z

Some comments:

LLVM will never generate adcx/adox even with the intrinsics after this patch https://reviews.llvm.org/D51754
addcarry (without x) does generate optimal code
Clang has a specialization __builtin_addcll I'm not sure if Rust exposes it, it would be portable to non-x86.

You can benchmark Constantine the following way, assuming you have Nim 1.4.2 installed.

git clone https://github.com/mratsim/constantine
cd constantine
nimble bench_fp_clang
nimble bench_fp2_clang
nimble bench_fp12_clang
nimble bench_ec_g1_clang
nimble bench_ec_g2_clang
nimble bench_pairing_bls12_381_clang

At one point I will add a per curve recap as well.

There will be inline assembly generated in the nimcache folder

However, we are still lagging significantly in G2 timings. This is at least partly because Fp towers are still laggard.

In my case I had a significant issue with Nim parameter passing with concepts, after upstream fix nim-lang/Nim#16897 or dropping concepts altogether (so that I don't need to wait for a new compiler, mratsim/constantine#153) I improved pairing perf by 28%.

Pratyush · 2021-02-10T18:24:10Z

btw @mratsim this is a great resource, thanks! https://github.com/mratsim/constantine/blob/master/docs/optimizations.md

Pratyush mentioned this issue Feb 5, 2021

Use intrinsics for bigint ops. #202

Closed

6 tasks

jon-chuang mentioned this issue Feb 5, 2021

Improve primitive biginteger operations #204

Closed

6 tasks

ValarDragon mentioned this issue Feb 5, 2021

Optimized multiplication by constant routines #206

Open

Pratyush mentioned this issue Feb 5, 2021

Add intrinsics for addition and subtraction #208

Closed

6 tasks

jon-chuang mentioned this issue Feb 6, 2021

Unroll biginteger loops and reduce copies #205

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate ways to improve basic primitive speed #198

Investigate ways to improve basic primitive speed #198

jon-chuang commented Feb 4, 2021 •

edited

Loading

Pratyush commented Feb 4, 2021

jon-chuang commented Feb 5, 2021

Pratyush commented Feb 5, 2021

jon-chuang commented Feb 5, 2021 •

edited

Loading

jon-chuang commented Feb 5, 2021 •

edited

Loading

Pratyush commented Feb 5, 2021

jon-chuang commented Feb 5, 2021

jon-chuang commented Feb 5, 2021 •

edited

Loading

Pratyush commented Feb 5, 2021 •

edited

Loading

jon-chuang commented Feb 5, 2021

Pratyush commented Feb 5, 2021

Pratyush commented Feb 5, 2021

jon-chuang commented Feb 5, 2021 •

edited

Loading

jon-chuang commented Feb 5, 2021

jon-chuang commented Feb 5, 2021 •

edited

Loading

jon-chuang commented Feb 5, 2021 •

edited

Loading

jon-chuang commented Feb 5, 2021

jon-chuang commented Feb 5, 2021 •

edited

Loading

jon-chuang commented Feb 5, 2021

jon-chuang commented Feb 5, 2021 •

edited

Loading

ValarDragon commented Feb 6, 2021 •

edited

Loading

mratsim commented Feb 10, 2021

Pratyush commented Feb 10, 2021

Investigate ways to improve basic primitive speed #198

Investigate ways to improve basic primitive speed #198

Comments

jon-chuang commented Feb 4, 2021 • edited Loading

Pratyush commented Feb 4, 2021

jon-chuang commented Feb 5, 2021

Pratyush commented Feb 5, 2021

jon-chuang commented Feb 5, 2021 • edited Loading

jon-chuang commented Feb 5, 2021 • edited Loading

Pratyush commented Feb 5, 2021

jon-chuang commented Feb 5, 2021

jon-chuang commented Feb 5, 2021 • edited Loading

Pratyush commented Feb 5, 2021 • edited Loading

jon-chuang commented Feb 5, 2021

Pratyush commented Feb 5, 2021

Pratyush commented Feb 5, 2021

jon-chuang commented Feb 5, 2021 • edited Loading

jon-chuang commented Feb 5, 2021

jon-chuang commented Feb 5, 2021 • edited Loading

jon-chuang commented Feb 5, 2021 • edited Loading

jon-chuang commented Feb 5, 2021

jon-chuang commented Feb 5, 2021 • edited Loading

jon-chuang commented Feb 5, 2021

jon-chuang commented Feb 5, 2021 • edited Loading

ValarDragon commented Feb 6, 2021 • edited Loading

mratsim commented Feb 10, 2021

Pratyush commented Feb 10, 2021

jon-chuang commented Feb 4, 2021 •

edited

Loading

jon-chuang commented Feb 5, 2021 •

edited

Loading

jon-chuang commented Feb 5, 2021 •

edited

Loading

jon-chuang commented Feb 5, 2021 •

edited

Loading

Pratyush commented Feb 5, 2021 •

edited

Loading

jon-chuang commented Feb 5, 2021 •

edited

Loading

jon-chuang commented Feb 5, 2021 •

edited

Loading

jon-chuang commented Feb 5, 2021 •

edited

Loading

jon-chuang commented Feb 5, 2021 •

edited

Loading

jon-chuang commented Feb 5, 2021 •

edited

Loading

ValarDragon commented Feb 6, 2021 •

edited

Loading