Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is needed to make this crate ready #1

Closed
dignifiedquire opened this issue Mar 8, 2019 · 35 comments
Closed

What is needed to make this crate ready #1

dignifiedquire opened this issue Mar 8, 2019 · 35 comments

Comments

@dignifiedquire
Copy link

Hey, I am interested in using this code, as it is quite a lot faster than the implementation I currently rely on (https://github.com/rustcrypto/hashes). So I was wondering what it would take to get this code ready for publishing and use from your perspective. Happy to help if there is something that I can do.

@oconnor663
Copy link
Owner

I've been substantially reworking blake2b_simd (https://github.com/oconnor663/blake2b_simd/tree/compress4_loop), and my plan is to try to combine the two crates once that's done. I'm trying to figure out the right way to expose a hash_many interface that takes any number of inputs from the caller, of any length, and does the best it can. But that ends up being a lot more complicated than update8.

One thing I could really use help with, is how to approach code sharing between BLAKE2b and BLAKE2s. I've seen people use giant macros to define the entire crate, such that you can substitute in u32 or u64. But a) giant macros make me sad, and b) the various SIMD implementations make things more difficult, with different intrinsic names and explicit array unpackings of different lengths. So my current best plan is to duplicate a lot of code. Let me know if you have any better ideas :-D

@dignifiedquire
Copy link
Author

One thing I could really use help with, is how to approach code sharing between BLAKE2b and BLAKE2s.

I'll take a look at the latest code and see if I have any ideas.

One other question, have you given any thought to being compatible https://crates.io/crates/digest from the rustcrypto project? this would be really useful as it makes it really nice in many places to abstract over the hash function.

@oconnor663
Copy link
Owner

When I looked at it before, it wasn't clear to me that a complete implementation of BLAKE2 would fit well with the traits there. There's the variable output length, the other input parameters, and the final node flag. Those could be exposed through concrete methods apart from the traits, of course, but I wasn't sure the result would be graceful. For example, if I construct a digest with a short output length, but the digest type itself implements FixedOutput, then you're in the awkward situation where calling FixedOutput::fixed_result gives you more output bytes than you're supposed to have.

There's probably a reasonable answer to all of this, but I haven't thought about it long enough.

@dignifiedquire
Copy link
Author

dignifiedquire commented Mar 8, 2019

I was just trying to figure out how to add more configuration into the blake impl in there: RustCrypto/hashes#75, I'll give it some more thought how this could fit in there effectively.

I believe the current approach taken by the rustcrypto blake impl is that only the standard sizes support fixedoutput, and the configurable version gives you variable output, which seems to be good enough for most use cases. Now it would be great of course if you could do something like Blake2b::new_fixed(24) and this gives you a thing that implements fixed output of size 24, but I am not sure rust can do this quite yet.

@oconnor663
Copy link
Owner

Yeah I assume a lot of this is going to get overhauled when Rust finally ships const generics.

@dignifiedquire
Copy link
Author

dignifiedquire commented Apr 16, 2019

  1. I have found an avx512 impl of blake2s in wireguard, which we could use as inspiration for such an impl: https://git.zx2c4.com/WireGuard/tree/src/crypto/zinc/blake2s/blake2s-x86_64.S, as well as compare their avx2 impl against this one (unless you already did)

  2. In regards to duplication, I think duplicating the functions is probably the best way to keep a navigatable implementation for now. What I would personally prefer to is use macros for things like rounds inside functions, as that makes things easier to read and understand in my experience.

@oconnor663
Copy link
Owner

It looks like WireGuard is GPL. Since my crates are intended to be MIT, I'm afraid I can't use GPL code for inspiration. (Unless an OSS lawyer comes along and tells me I have that wrong?) It's awesome that it's there though.

I'm getting very close to a big blake2b_simd release, with the new hash_many / update_many interface. Maybe I'll get it out today. Once it's out I'll be able to start merging these two implementations.

@dignifiedquire
Copy link
Author

It looks like WireGuard is GPL. Since my crates are intended to be MIT, I'm afraid I can't use GPL code for inspiration. (Unless an OSS lawyer comes along and tells me I have that wrong?) It's awesome that it's there though.

Yeah that is tricky to navigate :/ I have heard advice in both directions. Knowing it exists is great though, in any case.

I'm getting very close to a big blake2b_simd release, with the new hash_many / update_many interface. Maybe I'll get it out today. Once it's out I'll be able to start merging these two implementations.

awesome

@dignifiedquire
Copy link
Author

Not promising anything, but if I find the time I'll take a stab converting the existing impl to avx512 myself, I really want to know how fast this can go.

@oconnor663
Copy link
Owner

If you look at avx2.rs:blake2s_round_8x and sse41.rs:round_4 (inconsistently named at the moment), you'll see that the round functions of the AVX2 implementation and the SSE4.1 implementation are essentially identical, just with different vector widths throughout. It should be pretty straightforward to do the same thing with AVX512 widths. And yes, I'm very curious what the overall throughput is going to be.

By the way, this is where I think the hash_many / bao APIs start to get really interesting. With AVX2 they have equivalent throughput to BLAKE2bp/BLAKE2bp. But with AVX512, they start to pull ahead, because BLAKE2bp/BLAKE2sp have fixed parallelism degrees and can't take advantage of the wider vectors.

@oconnor663
Copy link
Owner

Maybe I'll get it out today.

lol no, ran into an issue with secret keys that's got me reworking the API again

@dignifiedquire
Copy link
Author

well, I am working on adding the avx512 intrinsics to rust in the first place, no "quick experiment" :D

@dignifiedquire
Copy link
Author

First benchmarks are looking very promising:

test bench_blake2s_avx2_compress8                   ... bench:         260 ns/iter (+/- 0) = 1969 MB/s
test bench_blake2s_avx2_compress8_transposed_all    ... bench:         215 ns/iter (+/- 0) = 2381 MB/s
test bench_blake2s_avx2_compress8_vectorized        ... bench:         227 ns/iter (+/- 0) = 2255 MB/s
test bench_blake2s_avx512_compress16_transposed_all ... bench:         206 ns/iter (+/- 0) = 4970 MB/s

Code: https://github.com/dignifiedquire/blake2s_simd/blob/feat/avx512/src/avx512.rs
Stdsimd fork: https://github.com/dignifiedquire/stdsimd/tree/feat/avx512f

@oconnor663
Copy link
Owner

I...wow...that's kind of beyond my wildest dreams for the instruction set. How can it be more than a factor of two? Does that mean it's faster to put a 256-bit vector in a 512-bit register and waste half of it, than it is to work on it in a 256-bit register?

@dignifiedquire
Copy link
Author

there are two possible reasons

(1) I messed up somewhere
(2) there are new instructions for rotation in avx512f, which make the rotations more efficient

@oconnor663
Copy link
Owner

which make the rotations more efficient

Good point! God damn I'm so excited to try this out.

@dignifiedquire
Copy link
Author

Those instructions are making a good difference, I added a version to the sse41 version for compress, with the following result:

before

test bench_blake2s_sse41_compress                   ... bench:         113 ns/iter (+/- 0) = 566 MB/s
test bench_blake2s_sse41_compress4_transposed       ... bench:         227 ns/iter (+/- 0) = 1127 MB/s

after

test bench_blake2s_sse41_compress                   ... bench:          96 ns/iter (+/- 0) = 666 MB/s
test bench_blake2s_sse41_compress4_transposed       ... bench:         205 ns/iter (+/- 0) = 1248 MB/s

@oconnor663
Copy link
Owner

Oh interesting, I might be confused. Does AVX512 add rotation instructions that apply to 128-bit vectors? I didn't realize that.

@dignifiedquire
Copy link
Author

Does AVX512 add rotation instructions that apply to 128-bit vectors? I didn't realize that.

It does: https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=33,34,4990,4709,5146,5137&text=_mm_ror

@dignifiedquire
Copy link
Author

dignifiedquire commented Apr 18, 2019

I think it should also be possible to construct an 8x version of blake2s and an 4x version for blake2b using avx512, by combining things inside a round, in the same way the SSE41 implementation works, or am I missing sth? This would be very nice, because it would allow to use the speed ups for Blake2bp and Blake2sp.

Edit: With the difference that only 2 ops instead of 4 ops are combined per round.

@oconnor663
Copy link
Owner

Yes, totally possible. The BLAKE2bp/sp case sounds like the best reason to do it, since that's restricted to 4/8 inputs. There could also be some speedups for the update_many API or for bao in the case where the caller doesn't have enough inputs to drive compress16, though our flashiest benchmarks will be getting all of their performance out of compress16.

@oconnor663
Copy link
Owner

I'm really interested to see what speedup we can get for compress1 with these rotations too. Speeding up standard BLAKE2b/s is a big deal, since that's so commonly used. It would also help performance for bao in the very short input case.

@oconnor663
Copy link
Owner

Finally finished the blake2b_simd rework: https://github.com/oconnor663/blake2b_simd/releases/tag/0.5.0

@dignifiedquire
Copy link
Author

great stuff 💯

@oconnor663 any chance you could publish the current master of this crate to crates.io? I am trying to get the filecoin-project/rust-fil-proofs crates published and we are already using this, pulling it through git.

@oconnor663
Copy link
Owner

oconnor663 commented May 17, 2019

I'm a little hesitant to publish this repo as-is, since the current version isn't well tested. Would it be alright if I tried to get a new version out within the next few days, based on the current blake2b_simd code?

@dignifiedquire
Copy link
Author

That would be really cool, if you can do that!

@dignifiedquire
Copy link
Author

@oconnor663 how are thing going? anything I can help with?

@oconnor663
Copy link
Owner

It's going well. You can follow along in this branch: https://github.com/oconnor663/blake2b_simd/tree/copying_blake2s

Fingers crossed I'll have a first published release of blake2s_simd tomorrow.

@oconnor663
Copy link
Owner

We're live! https://crates.io/crates/blake2s_simd

@dignifiedquire
Copy link
Author

Awesome, thank you!!!

@oconnor663
Copy link
Owner

oconnor663 commented May 22, 2019

Let me know how the new interface works for you, especially if you touch anything in many::.

@oconnor663
Copy link
Owner

oconnor663 commented Jun 22, 2019

@dignifiedquire have you kept working on the AVX512 stuff? I've noticed an issue when I've tried to do it myself, and it seems to repro on your branch too. If I write a function like this:

#[target_feature(enable = "avx512f")]
pub unsafe fn example(a: __m512i) -> __m512i {
    _mm512_ror_epi32(a, 1)
}

The assembly output is this:

 mov     rax, rdi
 vprorq  zmm0, zmmword, ptr, [rsi], 1
 vmovdqa64 zmmword, ptr, [rdi], zmm0
 vzeroupper
 ret

I think that vprorq is wrong. It should be vprord, right? For 32-bit "doublewords"? This seems to give wrong answers when I test it too.

Did you ever notice this corrupting your output? Is this some kind of LLVM bug

@dignifiedquire
Copy link
Author

dignifiedquire commented Jun 23, 2019 via email

@oconnor663
Copy link
Owner

Yeah I was experimenting with some bindings of my own and comparing that to your feat/avx512f branch.

@oconnor663
Copy link
Owner

Oh I think I figured it out. The llvm.fshr intrinsic is polymorphic on the argument type, and __mm512i is interpreted as i64x8. But if I explicitly pass i32x16 to it then I get the right behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants