What is needed to make this crate ready #1

dignifiedquire · 2019-03-08T13:55:24Z

Hey, I am interested in using this code, as it is quite a lot faster than the implementation I currently rely on (https://github.com/rustcrypto/hashes). So I was wondering what it would take to get this code ready for publishing and use from your perspective. Happy to help if there is something that I can do.

oconnor663 · 2019-03-08T16:48:29Z

I've been substantially reworking blake2b_simd (https://github.com/oconnor663/blake2b_simd/tree/compress4_loop), and my plan is to try to combine the two crates once that's done. I'm trying to figure out the right way to expose a hash_many interface that takes any number of inputs from the caller, of any length, and does the best it can. But that ends up being a lot more complicated than update8.

One thing I could really use help with, is how to approach code sharing between BLAKE2b and BLAKE2s. I've seen people use giant macros to define the entire crate, such that you can substitute in u32 or u64. But a) giant macros make me sad, and b) the various SIMD implementations make things more difficult, with different intrinsic names and explicit array unpackings of different lengths. So my current best plan is to duplicate a lot of code. Let me know if you have any better ideas :-D

dignifiedquire · 2019-03-08T16:53:54Z

One thing I could really use help with, is how to approach code sharing between BLAKE2b and BLAKE2s.

I'll take a look at the latest code and see if I have any ideas.

One other question, have you given any thought to being compatible https://crates.io/crates/digest from the rustcrypto project? this would be really useful as it makes it really nice in many places to abstract over the hash function.

oconnor663 · 2019-03-08T17:01:00Z

When I looked at it before, it wasn't clear to me that a complete implementation of BLAKE2 would fit well with the traits there. There's the variable output length, the other input parameters, and the final node flag. Those could be exposed through concrete methods apart from the traits, of course, but I wasn't sure the result would be graceful. For example, if I construct a digest with a short output length, but the digest type itself implements FixedOutput, then you're in the awkward situation where calling FixedOutput::fixed_result gives you more output bytes than you're supposed to have.

There's probably a reasonable answer to all of this, but I haven't thought about it long enough.

dignifiedquire · 2019-03-08T17:10:51Z

I was just trying to figure out how to add more configuration into the blake impl in there: RustCrypto/hashes#75, I'll give it some more thought how this could fit in there effectively.

I believe the current approach taken by the rustcrypto blake impl is that only the standard sizes support fixedoutput, and the configurable version gives you variable output, which seems to be good enough for most use cases. Now it would be great of course if you could do something like Blake2b::new_fixed(24) and this gives you a thing that implements fixed output of size 24, but I am not sure rust can do this quite yet.

oconnor663 · 2019-03-08T17:15:02Z

Yeah I assume a lot of this is going to get overhauled when Rust finally ships const generics.

dignifiedquire · 2019-04-16T16:54:02Z

I have found an avx512 impl of blake2s in wireguard, which we could use as inspiration for such an impl: https://git.zx2c4.com/WireGuard/tree/src/crypto/zinc/blake2s/blake2s-x86_64.S, as well as compare their avx2 impl against this one (unless you already did)
In regards to duplication, I think duplicating the functions is probably the best way to keep a navigatable implementation for now. What I would personally prefer to is use macros for things like rounds inside functions, as that makes things easier to read and understand in my experience.

oconnor663 · 2019-04-16T17:30:17Z

It looks like WireGuard is GPL. Since my crates are intended to be MIT, I'm afraid I can't use GPL code for inspiration. (Unless an OSS lawyer comes along and tells me I have that wrong?) It's awesome that it's there though.

I'm getting very close to a big blake2b_simd release, with the new hash_many / update_many interface. Maybe I'll get it out today. Once it's out I'll be able to start merging these two implementations.

dignifiedquire · 2019-04-16T17:32:34Z

It looks like WireGuard is GPL. Since my crates are intended to be MIT, I'm afraid I can't use GPL code for inspiration. (Unless an OSS lawyer comes along and tells me I have that wrong?) It's awesome that it's there though.

Yeah that is tricky to navigate :/ I have heard advice in both directions. Knowing it exists is great though, in any case.

I'm getting very close to a big blake2b_simd release, with the new hash_many / update_many interface. Maybe I'll get it out today. Once it's out I'll be able to start merging these two implementations.

awesome

dignifiedquire · 2019-04-16T17:49:58Z

Not promising anything, but if I find the time I'll take a stab converting the existing impl to avx512 myself, I really want to know how fast this can go.

oconnor663 · 2019-04-16T17:57:47Z

If you look at avx2.rs:blake2s_round_8x and sse41.rs:round_4 (inconsistently named at the moment), you'll see that the round functions of the AVX2 implementation and the SSE4.1 implementation are essentially identical, just with different vector widths throughout. It should be pretty straightforward to do the same thing with AVX512 widths. And yes, I'm very curious what the overall throughput is going to be.

By the way, this is where I think the hash_many / bao APIs start to get really interesting. With AVX2 they have equivalent throughput to BLAKE2bp/BLAKE2bp. But with AVX512, they start to pull ahead, because BLAKE2bp/BLAKE2sp have fixed parallelism degrees and can't take advantage of the wider vectors.

oconnor663 · 2019-04-17T14:40:55Z

Maybe I'll get it out today.

lol no, ran into an issue with secret keys that's got me reworking the API again

dignifiedquire · 2019-04-17T15:36:23Z

well, I am working on adding the avx512 intrinsics to rust in the first place, no "quick experiment" :D

dignifiedquire · 2019-04-18T15:57:38Z

First benchmarks are looking very promising:

test bench_blake2s_avx2_compress8                   ... bench:         260 ns/iter (+/- 0) = 1969 MB/s
test bench_blake2s_avx2_compress8_transposed_all    ... bench:         215 ns/iter (+/- 0) = 2381 MB/s
test bench_blake2s_avx2_compress8_vectorized        ... bench:         227 ns/iter (+/- 0) = 2255 MB/s
test bench_blake2s_avx512_compress16_transposed_all ... bench:         206 ns/iter (+/- 0) = 4970 MB/s

Code: https://github.com/dignifiedquire/blake2s_simd/blob/feat/avx512/src/avx512.rs
Stdsimd fork: https://github.com/dignifiedquire/stdsimd/tree/feat/avx512f

oconnor663 · 2019-04-18T17:05:23Z

I...wow...that's kind of beyond my wildest dreams for the instruction set. How can it be more than a factor of two? Does that mean it's faster to put a 256-bit vector in a 512-bit register and waste half of it, than it is to work on it in a 256-bit register?

dignifiedquire · 2019-04-18T17:08:03Z

there are two possible reasons

(1) I messed up somewhere
(2) there are new instructions for rotation in avx512f, which make the rotations more efficient

oconnor663 · 2019-04-18T17:16:46Z

which make the rotations more efficient

Good point! God damn I'm so excited to try this out.

dignifiedquire · 2019-04-18T19:27:41Z

Those instructions are making a good difference, I added a version to the sse41 version for compress, with the following result:

before

test bench_blake2s_sse41_compress                   ... bench:         113 ns/iter (+/- 0) = 566 MB/s
test bench_blake2s_sse41_compress4_transposed       ... bench:         227 ns/iter (+/- 0) = 1127 MB/s

after

test bench_blake2s_sse41_compress                   ... bench:          96 ns/iter (+/- 0) = 666 MB/s
test bench_blake2s_sse41_compress4_transposed       ... bench:         205 ns/iter (+/- 0) = 1248 MB/s

oconnor663 · 2019-04-18T20:18:37Z

Oh interesting, I might be confused. Does AVX512 add rotation instructions that apply to 128-bit vectors? I didn't realize that.

dignifiedquire · 2019-04-18T20:31:03Z

Does AVX512 add rotation instructions that apply to 128-bit vectors? I didn't realize that.

It does: https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=33,34,4990,4709,5146,5137&text=_mm_ror

dignifiedquire · 2019-04-18T20:44:20Z

I think it should also be possible to construct an 8x version of blake2s and an 4x version for blake2b using avx512, by combining things inside a round, in the same way the SSE41 implementation works, or am I missing sth? This would be very nice, because it would allow to use the speed ups for Blake2bp and Blake2sp.

Edit: With the difference that only 2 ops instead of 4 ops are combined per round.

oconnor663 · 2019-04-18T20:47:50Z

Yes, totally possible. The BLAKE2bp/sp case sounds like the best reason to do it, since that's restricted to 4/8 inputs. There could also be some speedups for the update_many API or for bao in the case where the caller doesn't have enough inputs to drive compress16, though our flashiest benchmarks will be getting all of their performance out of compress16.

oconnor663 · 2019-04-18T20:51:29Z

I'm really interested to see what speedup we can get for compress1 with these rotations too. Speeding up standard BLAKE2b/s is a big deal, since that's so commonly used. It would also help performance for bao in the very short input case.

oconnor663 · 2019-05-12T06:15:56Z

Finally finished the blake2b_simd rework: https://github.com/oconnor663/blake2b_simd/releases/tag/0.5.0

dignifiedquire · 2019-05-17T11:22:25Z

great stuff 💯

@oconnor663 any chance you could publish the current master of this crate to crates.io? I am trying to get the filecoin-project/rust-fil-proofs crates published and we are already using this, pulling it through git.

oconnor663 · 2019-05-17T12:36:25Z

I'm a little hesitant to publish this repo as-is, since the current version isn't well tested. Would it be alright if I tried to get a new version out within the next few days, based on the current blake2b_simd code?

dignifiedquire · 2019-05-17T13:55:27Z

That would be really cool, if you can do that!

dignifiedquire · 2019-05-20T17:48:37Z

@oconnor663 how are thing going? anything I can help with?

oconnor663 · 2019-05-20T17:58:04Z

It's going well. You can follow along in this branch: https://github.com/oconnor663/blake2b_simd/tree/copying_blake2s

Fingers crossed I'll have a first published release of blake2s_simd tomorrow.

oconnor663 · 2019-05-21T21:41:31Z

We're live! https://crates.io/crates/blake2s_simd

dignifiedquire · 2019-05-21T23:28:28Z

Awesome, thank you!!!

oconnor663 · 2019-05-22T02:03:41Z

Let me know how the new interface works for you, especially if you touch anything in many::.

oconnor663 · 2019-06-22T21:37:18Z

@dignifiedquire have you kept working on the AVX512 stuff? I've noticed an issue when I've tried to do it myself, and it seems to repro on your branch too. If I write a function like this:

#[target_feature(enable = "avx512f")]
pub unsafe fn example(a: __m512i) -> __m512i {
    _mm512_ror_epi32(a, 1)
}

The assembly output is this:

 mov     rax, rdi
 vprorq  zmm0, zmmword, ptr, [rsi], 1
 vmovdqa64 zmmword, ptr, [rdi], zmm0
 vzeroupper
 ret

I think that vprorq is wrong. It should be vprord, right? For 32-bit "doublewords"? This seems to give wrong answers when I test it too.

Did you ever notice this corrupting your output? Is this some kind of LLVM bug

dignifiedquire · 2019-06-23T14:32:54Z

I haven’t had time to continue yet unfortunately :/ I am not sure if I had the same issue or not, are you using my simd fork for the instructions? it could also be that sth in the bindings between rust and llvm gets screwed up

…

On 22. Jun 2019, 23:37 +0200, oconnor663 ***@***.***>, wrote: @dignifiedquire have you kept working on the AVX512 stuff? I've noticed an issue when I've tried to do it myself, and it seems to repro on your branch. If I write a function like this: #[target_feature(enable = "avx512f")] pub unsafe fn example(a: __m512i) -> __m512i { _mm512_ror_epi32(a, 1) } The assembly output is this: mov rax, rdi vprorq zmm0, zmmword, ptr, [rsi], 1 vmovdqa64 zmmword, ptr, [rdi], zmm0 vzeroupper ret I think that vprorq is wrong. It should be vprord nor vprorq, right? For 32-bit "doublewords"? This seems to give wrong answers when I test it too. Did you ever notice this corrupting your output? Is this some kind of LLVM bug — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

oconnor663 · 2019-06-23T14:35:50Z

Yeah I was experimenting with some bindings of my own and comparing that to your feat/avx512f branch.

oconnor663 · 2019-06-23T18:40:57Z

Oh I think I figured it out. The llvm.fshr intrinsic is polymorphic on the argument type, and __mm512i is interpreted as i64x8. But if I explicitly pass i32x16 to it then I get the right behavior.

dignifiedquire mentioned this issue May 17, 2019

Implement a better branching and release flow for rust-fil-proofs filecoin-project/rust-fil-proofs#507

Closed

oconnor663 closed this as completed May 21, 2019

What is needed to make this crate ready #1

What is needed to make this crate ready #1

Comments

dignifiedquire commented Mar 8, 2019

oconnor663 commented Mar 8, 2019

dignifiedquire commented Mar 8, 2019

oconnor663 commented Mar 8, 2019

dignifiedquire commented Mar 8, 2019 • edited Loading

oconnor663 commented Mar 8, 2019

dignifiedquire commented Apr 16, 2019 • edited Loading

oconnor663 commented Apr 16, 2019

dignifiedquire commented Apr 16, 2019

dignifiedquire commented Apr 16, 2019

oconnor663 commented Apr 16, 2019

oconnor663 commented Apr 17, 2019

dignifiedquire commented Apr 17, 2019

dignifiedquire commented Apr 18, 2019

oconnor663 commented Apr 18, 2019

dignifiedquire commented Apr 18, 2019

oconnor663 commented Apr 18, 2019

dignifiedquire commented Apr 18, 2019

oconnor663 commented Apr 18, 2019

dignifiedquire commented Apr 18, 2019

dignifiedquire commented Apr 18, 2019 • edited Loading

oconnor663 commented Apr 18, 2019

oconnor663 commented Apr 18, 2019

oconnor663 commented May 12, 2019

dignifiedquire commented May 17, 2019

oconnor663 commented May 17, 2019 • edited Loading

dignifiedquire commented May 17, 2019

dignifiedquire commented May 20, 2019

oconnor663 commented May 20, 2019

oconnor663 commented May 21, 2019

dignifiedquire commented May 21, 2019

oconnor663 commented May 22, 2019 • edited Loading

oconnor663 commented Jun 22, 2019 • edited Loading

dignifiedquire commented Jun 23, 2019 via email

oconnor663 commented Jun 23, 2019

oconnor663 commented Jun 23, 2019

dignifiedquire commented Mar 8, 2019 •

edited

Loading

dignifiedquire commented Apr 16, 2019 •

edited

Loading

dignifiedquire commented Apr 18, 2019 •

edited

Loading

oconnor663 commented May 17, 2019 •

edited

Loading

oconnor663 commented May 22, 2019 •

edited

Loading

oconnor663 commented Jun 22, 2019 •

edited

Loading