Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate ways to improve basic primitive speed #198

Open
jon-chuang opened this issue Feb 4, 2021 · 23 comments
Open

Investigate ways to improve basic primitive speed #198

jon-chuang opened this issue Feb 4, 2021 · 23 comments

Comments

@jon-chuang
Copy link
Contributor

jon-chuang commented Feb 4, 2021

With some simple changes converting biginteger primitive iter ops to their unrolled counterparts, one obtains some decent results.

Hence, the hypothesis is that implementing assembly for these functions could improve these results further.

Here are the results (with some anomalies I believe due to overly aggressive optimisations in the benchmark code):
Screenshot from 2021-02-04 18-51-34
Screenshot from 2021-02-04 18-49-26

My guess is that currently, gurvy is still not as good as mcl were mcl to improve its field mul and squaring.

This is certainly interesting and I don't have enough knowledge to say how one can obtain optimisations to the point of MCL's speed, since the improvements above do not account for the full difference in speeds.

Further, our field mul and squaring is more or less on par with gurvy's (however, they now have an update, which is apparently faster than before, but I have not seen a benchmark showing how much faster).

However, apart from the above changes, I will leave investigating further low-level optimisations to someone else who may be interested.


One interesting observation is that with 11 M/S, if the cost of a G1 mixed add were truly dominated by M/S, at 28ns per M/S, the time required would be ~ 300ns, however, this is clearly not the case, and shows that there is something wrong with the additions, subtractions and doublings, of which there are many.

In theory, they should cost [1/(2 * limbs), 1/limbs] as much as M/S. They are approximately as dense as M/S, so their cost should be, according to estimation, ~1/8 for 6 limbs, leading to a total time of 340ns. But due to the inefficient implementation, their total cost is > 1/4 for 6 limbs, leading to a total time of 415ns. That's because presently, the costs are in [1/limbs, 2/limbs] instead.

A timing of 340ns (relative to my processor speed) would put us as a contender for top place alongside mcl and constantine for this particular function.

Thus, it is my view that much of the discrepancy with other libraries for group ops can be accounted for with more efficient additive and mul by const ops.

@Pratyush
Copy link
Member

Pratyush commented Feb 4, 2021

See #199

I think rust-lang/stdarch#666 also affects addition and subtraction, so we end up with non-optimal assembly:

add_nocarry assembly

@jon-chuang
Copy link
Contributor Author

Hmm, although this (adcx, adox etc) should affect mul due to requiring two carry chains, in theory, llvm can deal with single carry chains (as with adc).

However, there are many questionable instructions here, such as movzx and setb. This could be artifacts from using u128. Hence, to obtain a cleaner assembly, I will investigate simply implementing assembly versions of the key functions.

I may also investigate if it is worth implementing cmp_less and cmp_greater functions to reduce the generality bloat required to define cmp.

@Pratyush
Copy link
Member

Pratyush commented Feb 5, 2021

In that case, maybe we can just directly use the intrinsic? https://doc.rust-lang.org/core/arch/x86_64/fn._addcarryx_u64.html

@jon-chuang
Copy link
Contributor Author

jon-chuang commented Feb 5, 2021

Possibly.

However, since we are already using an assembly framework, it may pay to have greater control over the assembly that is produced.

For instance for sub_noborrow as well.

@jon-chuang
Copy link
Contributor Author

jon-chuang commented Feb 5, 2021

I think I will revamp the assembly framework so there are less weird artifacts from the time I went a little crazy with the proc macro stuff and trying to have a DSL syntax.

Overall, I think the framework is an ok alternative to generating .S, but now it's rather chaotic and ad hoc.

@Pratyush
Copy link
Member

Pratyush commented Feb 5, 2021

@jon-chuang
Copy link
Contributor Author

Hmm fair enough. I guess it would be worth investigating both options.

@jon-chuang
Copy link
Contributor Author

jon-chuang commented Feb 5, 2021

So there are some positive results regarding the intrinsics:
Screenshot from 2021-02-05 10-18-34

Here is the assembly produced for the following code snippet:

#[inline(never)]
fn sub_noborrow(x: &mut [u64; 6], other: &[u64; 6]) -> bool {
    let mut borrow = 0u8;

    unsafe {
        borrow = core::arch::x86_64::_subborrow_u64(borrow, x[0], other[0], &mut x[0]);
        borrow = core::arch::x86_64::_subborrow_u64(borrow, x[1], other[1], &mut x[1]);
        borrow = core::arch::x86_64::_subborrow_u64(borrow, x[2], other[2], &mut x[2]);
        borrow = core::arch::x86_64::_subborrow_u64(borrow, x[3], other[3], &mut x[3]);
        borrow = core::arch::x86_64::_subborrow_u64(borrow, x[4], other[4], &mut x[4]);
        borrow = core::arch::x86_64::_subborrow_u64(borrow, x[5], other[5], &mut x[5]);
    }

    borrow != 0
}
        movq	(%rsi), %rax
	subq	%rax, (%rdi)
	movq	8(%rsi), %rax
	sbbq	%rax, 8(%rdi)
	movq	16(%rsi), %rax
	sbbq	%rax, 16(%rdi)
	movq	24(%rsi), %rax
	sbbq	%rax, 24(%rdi)
	movq	32(%rsi), %rax
	sbbq	%rax, 32(%rdi)
	movq	40(%rsi), %rax
	sbbq	%rax, 40(%rdi)
	setb	%al
	retq

Not quite there but still encouraging.

@Pratyush
Copy link
Member

Pratyush commented Feb 5, 2021

I get similar results for addition:

pub fn add_nocarry2(a: &mut [u64; 4], b: &[u64; 4]) -> bool {
    let mut carry = 0;

    for i in 0..4 {
        unsafe { 
            carry = core::arch::x86_64::_addcarry_u64(carry, a[i], b[i], &mut a[i]);
        }
    }

    carry != 0
}

produces this:

        mov     rax, qword ptr [rsi]
        add     qword ptr [rdi], rax
        mov     rax, qword ptr [rsi + 8]
        adc     qword ptr [rdi + 8], rax
        mov     rax, qword ptr [rsi + 16]
        adc     qword ptr [rdi + 16], rax
        mov     rax, qword ptr [rsi + 24]
        adc     qword ptr [rdi + 24], rax
        setb    al
        ret

@jon-chuang
Copy link
Contributor Author

hmm, did you do this with unroll? If not, it's nice that the loops are unrolled by llvm in that case. I think it might not have worked for the mul/square case as the loops were nested.

@Pratyush
Copy link
Member

Pratyush commented Feb 5, 2021

This is without unroll: https://godbolt.org/z/zv5435

@Pratyush
Copy link
Member

Pratyush commented Feb 5, 2021

Opened a PR for this: #202

@jon-chuang
Copy link
Contributor Author

jon-chuang commented Feb 5, 2021

With new benchmarks using intrinsics, fq::sub_assign and fq::add_assign are blazingly fast (~1.1-1.2ns), whereas fq::double is still pitifully slow (6.3ns).

I have feature gated the use of intrinsics behind use_asm and deemed that no further meddling with assembly would be required for the former 2 functions.

Now I will focus attention on double.

@jon-chuang
Copy link
Contributor Author

I may also investigate if it is worth implementing cmp_less and cmp_greater functions to reduce the generality bloat required to define cmp.

This is ill-conceived as comparing greater than in a limb can also help shortcut a less than comparison. So current impl is optimal imo.

@jon-chuang
Copy link
Contributor Author

jon-chuang commented Feb 5, 2021

Here are some results converting the mul2 impl to adding the two bigintegers together using the new blazing fast intrinsics.
Screenshot from 2021-02-05 11-11-28
Here are the overall changes in perf against the old Fq.
Screenshot from 2021-02-05 11-11-00

Notice that even mul and square benefit due to improved performance for reduce.

@jon-chuang
Copy link
Contributor Author

jon-chuang commented Feb 5, 2021

Here are the changes to bls12_381::full_pairing
Screenshot from 2021-02-05 11-21-24

Scaled to the https://hackmd.io/@zkteam/eccbench results, our full pairing would take 1.03ms, which is better but still can be improved via algorithmic improvements.

@jon-chuang
Copy link
Contributor Author

Unfortunately, mul2 using self addition results in ugly assembly code, hence I think we need to resort to using assembly

@jon-chuang
Copy link
Contributor Author

jon-chuang commented Feb 5, 2021

Actually, here is a comparison between Fp add_assign and double's assemblies:

double_in_place:

.Lfunc_begin234:
	.cfi_startproc
	pushq	%rbx
	.cfi_def_cfa_offset 16
	.cfi_offset %rbx, -16
	movq	%rdi, %rax
	movabsq	$1873798617647539866, %r8
	movq	(%rdi), %r10
	movq	8(%rdi), %r9
	addq	%r10, %r10
	movq	%r10, (%rdi)
	adcq	%r9, %r9
	movq	%r9, 8(%rdi)
	movq	16(%rdi), %rdi
	adcq	%rdi, %rdi
	movq	%rdi, 16(%rax)
	movq	24(%rax), %rcx
	adcq	%rcx, %rcx
	movq	%rcx, 24(%rax)
	movq	32(%rax), %rdx
	adcq	%rdx, %rdx
	movq	%rdx, 32(%rax)
	movq	40(%rax), %rsi
	adcq	%rsi, %rsi
	movq	%rsi, 40(%rax)
	cmpq	%r8, %rsi
	jb	.LBB234_12
	movabsq	$5412103778470702295, %r11
	jne	.LBB234_11
	cmpq	%r11, %rdx
	jb	.LBB234_12
	jne	.LBB234_11
	movabsq	$7239337960414712511, %rbx
	cmpq	%rbx, %rcx
	jb	.LBB234_12
	jne	.LBB234_11
	movabsq	$7435674573564081700, %rbx
	cmpq	%rbx, %rdi
	jb	.LBB234_12
	jne	.LBB234_11
	movabsq	$2210141511517208575, %rbx
	cmpq	%rbx, %r9
	jb	.LBB234_12
	jne	.LBB234_11
	movabsq	$-5044313057631688021, %rbx
	cmpq	%rbx, %r10
	jb	.LBB234_12
.LBB234_11:
	movabsq	$-5044313057631688021, %rbx
	subq	%rbx, %r10
	movabsq	$2210141511517208575, %rbx
	sbbq	%rbx, %r9
	movq	%r10, (%rax)
	movabsq	$7435674573564081700, %rbx
	sbbq	%rbx, %rdi
	movq	%r9, 8(%rax)
	movabsq	$7239337960414712511, %rbx
	sbbq	%rbx, %rcx
	movq	%rdi, 16(%rax)
	movq	%rcx, 24(%rax)
	sbbq	%r11, %rdx
	movq	%rdx, 32(%rax)
	sbbq	%r8, %rsi
	movq	%rsi, 40(%rax)
.LBB234_12:
	popq	%rbx
	.cfi_def_cfa_offset 8
	retq
.Lfunc_end234:

add_assign

.Lfunc_begin219:
	.cfi_startproc
	pushq	%rbx
	.cfi_def_cfa_offset 16
	.cfi_offset %rbx, -16
	movq	(%rdi), %r9
	movq	8(%rdi), %r8
	addq	(%rsi), %r9
	movq	%r9, (%rdi)
	adcq	8(%rsi), %r8
	movq	%r8, 8(%rdi)
	movq	16(%rdi), %r10
	adcq	16(%rsi), %r10
	movq	%r10, 16(%rdi)
	movq	24(%rdi), %rax
	adcq	24(%rsi), %rax
	movq	%rax, 24(%rdi)
	movq	32(%rdi), %rcx
	adcq	32(%rsi), %rcx
	movq	%rcx, 32(%rdi)
	movq	40(%rdi), %rdx
	adcq	40(%rsi), %rdx
	movabsq	$1873798617647539866, %rsi
	movq	%rdx, 40(%rdi)
	cmpq	%rsi, %rdx
	jb	.LBB219_12
	movabsq	$5412103778470702295, %r11
	jne	.LBB219_11
	cmpq	%r11, %rcx
	jb	.LBB219_12
	jne	.LBB219_11
	movabsq	$7239337960414712511, %rbx
	cmpq	%rbx, %rax
	jb	.LBB219_12
	jne	.LBB219_11
	movabsq	$7435674573564081700, %rbx
	cmpq	%rbx, %r10
	jb	.LBB219_12
	jne	.LBB219_11
	movabsq	$2210141511517208575, %rbx
	cmpq	%rbx, %r8
	jb	.LBB219_12
	jne	.LBB219_11
	movabsq	$-5044313057631688021, %rbx
	cmpq	%rbx, %r9
	jb	.LBB219_12
.LBB219_11:
	movabsq	$-5044313057631688021, %rbx
	subq	%rbx, %r9
	movabsq	$2210141511517208575, %rbx
	sbbq	%rbx, %r8
	movq	%r9, (%rdi)
	movabsq	$7435674573564081700, %rbx
	sbbq	%rbx, %r10
	movq	%r8, 8(%rdi)
	movabsq	$7239337960414712511, %rbx
	sbbq	%rbx, %rax
	movq	%r10, 16(%rdi)
	movq	%rax, 24(%rdi)
	sbbq	%r11, %rcx
	movq	%rcx, 32(%rdi)
	sbbq	%rsi, %rdx
	movq	%rdx, 40(%rdi)
.LBB219_12:
	popq	%rbx
	.cfi_def_cfa_offset 8
	retq
.Lfunc_end219:

They are virtually identical, but there is a discrepancy in their benchmarks.

Notice however that there is already inefficiency in moving data to and from cache/registers.

This furthers the conviction that one should move towards a full .S style ASM system, and convert more functions into pure ASM.

@jon-chuang
Copy link
Contributor Author

I've decided to stop trusting the benchmarks for particular field/biginteger methods, and trust more the overall benchmarks for big functions, there seems to be some weird discrepancy, and I can't pinpoint the source.

@jon-chuang
Copy link
Contributor Author

jon-chuang commented Feb 5, 2021

Here are some comparisons with gurvy on my Ryzen 4800H (4.2GHz boost clock):
Gurvy timings in brackets.
bls12_381 Jacobian:
add 517ns (509ns)
mixed add 393ns (366ns)
double 257ns (286ns)

So it appears we are not too far away as a matter of fact, and the gap is at most 10%.

In fact, some of these results seem to suggest that the benchmarks here https://hackmd.io/@zkteam/eccbench are made erroneously and may have used the wrong timings (extended jacobian rather than jacobian repr). (this fact has now been confirmed)

However, we are still lagging significantly in G2 timings. This is at least partly because Fp towers are still laggard.

@ValarDragon
Copy link
Member

ValarDragon commented Feb 6, 2021

Something worth investigating is why is there such a high discrepancy between fq12 and fq additions. Per the benchmarks in OP, fq12::add_assign / fq::add_assign = 22.76 whereas we'd expect this ratio to be 12.

Maybe inlining isn't functioning properly here

@mratsim
Copy link

mratsim commented Feb 10, 2021

Some comments:

  • LLVM will never generate adcx/adox even with the intrinsics after this patch https://reviews.llvm.org/D51754
  • addcarry (without x) does generate optimal code
  • Clang has a specialization __builtin_addcll I'm not sure if Rust exposes it, it would be portable to non-x86.

You can benchmark Constantine the following way, assuming you have Nim 1.4.2 installed.

git clone https://github.com/mratsim/constantine
cd constantine
nimble bench_fp_clang
nimble bench_fp2_clang
nimble bench_fp12_clang
nimble bench_ec_g1_clang
nimble bench_ec_g2_clang
nimble bench_pairing_bls12_381_clang

At one point I will add a per curve recap as well.

There will be inline assembly generated in the nimcache folder
image

However, we are still lagging significantly in G2 timings. This is at least partly because Fp towers are still laggard.

In my case I had a significant issue with Nim parameter passing with concepts, after upstream fix nim-lang/Nim#16897 or dropping concepts altogether (so that I don't need to wait for a new compiler, mratsim/constantine#153) I improved pairing perf by 28%.

@Pratyush
Copy link
Member

btw @mratsim this is a great resource, thanks! https://github.com/mratsim/constantine/blob/master/docs/optimizations.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants