-
Notifications
You must be signed in to change notification settings - Fork 251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
switch the BLAKE2 implementation to blake2b_simd/blake2s_simd #88
Conversation
Ah, it seems like one issue is that |
Greetings! I would strongly prefer to merge your implementations into Also how do you handle feature detection? Are you using CPUID directly or utilize Regarding MSRV, in |
Ah no I hadn't. Looking at that branch now, the BLAKE2b long input performance seems to be about 1.5% improved relative to master. However, enabling the I also notice that that branch adds BLAKE2bp and BLAKE2sp implementations, with comparable performance to BLAKE2b and BLAKE2s. However, this way of implementing the parallel variants (calling compress in a loop on each of the inner instances) isn't the best. It turns out that if you reorganize your SIMD vectors so that they hold a state word from each input, the performance goes through the roof, I think mainly because you never need to diagonalize the vectors. Here are the relevant numbers from blake2_simd:
I use the macro. With
I haven't done a very careful comparison yet. Trying it just now, it seems like with
I'm not sure we can share that much code between B and S. The SIMD implementations -- particularly message loads that differ from round to round -- are done "by hand" and can't really be generated from a shared macro. We could maybe share some of the boilerplate in the higher level types like Can you say more about why you prefer to vendor the implementation? In my mind it would be a bit of a maintenance burden for both crates. Note that the compression functions have both received substantial optimizations as recently as a couple weeks ago, see oconnor663/blake2_simd@32065b5 and oconnor663/blake2_simd@e26796e. The higher level APIs are also a core dependency of my Bao project, and they're getting tweaks as I get more experience using them. The |
I went ahead and added BLAKE2bp and BLAKE2sp support to this PR, to mirror what was done in #72, since it's not very much additional code. (Note though, I haven't figured out how to add test vectors for those yet.) Here's where things stand performance-wise, as measured with the benchmarks in this crate on my laptop:
If something like this lands, these figures would change the recommendation I'd make for application developers. In particular, I'd recommend that most applications concerned about hash performance use BLAKE2sp, if they have a choice. In addition to being faster on modern x86 machines with AVX2 support, it's also faster on 32-bit embedded systems without SIMD, because of its 32-bit words. The main downside of BLAKE2sp is reduced throughput for very short messages compared to BLAKE2s and BLAKE2b. That's probably not a big deal for most applications, since bottlenecking on hashing performance usually means you're hashing something big, but it's the kind of thing you need to measure. |
Sorry for the late reply! What do you think about moving your crates into this repository? I am ready to provide you a full access to it and using the new code owners feature all PRs affecting BLAKE2 crates will go through you. |
My first instinct is that it wouldn't make sense to merge them. The Rather than having these odd ducks in the RustCrypto repo, wouldn't it make more sense to just wrap their APIs? If taking a dependency is an issue, I suppose vendoring the relevant parts of the code could also make sense. The core hash implementations are likely to be stable for the foreseeable future, at least until Rust stabilizes AVX-512 and NEON intrinsics. |
This case is covered by the
Keying is supported via Personalized blocks and
It's about improving visibility of crates and reducing a number of groups to which people have to trust. And I would like to avoid copying code around if possible, so vendoring the code will be sub-optimal as well. |
@oconnor663 would you like to continue to work on this PR, and if so, can you rebase? |
I'd be happy to rebase it on |
fb0ba5e
to
0a91489
Compare
I've rebased (more like rewritten) the branch on top of current master. As before, I've added support and benchmarks for BLAKE2bp and BLAKE2sp, but I haven't yet added new test vectors for those. There's some complexity due to the fact that my *_simd crates don't support personalization or salting for BLAKE2bp or BLAKE2sp. (I've never seen any implementation that does, and I'm not aware of any test vectors covering those cases, official or otherwise. So I'm hesitant to "innovate" in that direction.) |
Replace the internal implementation of BLAKE2b and BLAKE2s with calls to the blake2b_simd and blake2s_simd crates. Those crates contain optimized implementations for SSE4.1 and AVX2, and they use runtime CPU feature detection to select the best implementation. Running the long-input benchmarks on an Intel i9-9880H with AVX2 support, this change is a performance improvement of about 1.5x for BLAKE2b and 1.35x for BLAKE2s. This change deletes the undocumented `with_parameter_block` method, as the raw parameter block is not exposed by blake2b_simd or blak2s_simd. Callers who need BLAKE2 tree mode parameters can use the upstream crates directly. They provide a complete set of parameter methods. This change also deletes the `finalize_last_node` method. This method was arguably attached to the wrong types, `VarBlake2b` and `VarBlake2s`, where it would panic with a non-default output length. It's not very useful without the other tree parameters, so rather than moving it to the fixed-length `Blake2b` and `Blake2s` types where it belongs, we just delete it. This also simplifies the addition of BLAKE2bp and BLAKE2sp support in the following commit, as those algorithms use the last node flag internally and cannot expose it.
On an Intel i9-9880H with AVX2 support, both BLAKE2bp and BLAKE2sp are about 1.75x faster than BLAKE2b. Note that while these algorithms can be implemented with multi-threading, these implementations from blake2b_simd and blake2s_simd are single-threaded, using only SIMD parallelism. The blake2b_simd and blake2s_simd crates don't support salting or personalization for BLAKE2bp and BLAKE2sp, so the `with_params` methods are moved out into blake2b.rs and blake2s.rs.
On x86 targets, SSE4.1 and AVX2 implementations are always compiled. With the `std` feature enabled, runtime CPU feature detection is used to select between them. With `std` disabled (e.g. --no-default-features), the only way to activate SIMD is something like export RUSTFLAGS="-C target-cpu=native"
@oconnor663 taking a look at this again, I think I also share @newpavlov's concerns that this is relying on external crates. Would you consider merging your work directly into the |
I've leaned against it in the past, but I could change my mind on that. I guess the best long term outcome would be to deprecate the *_simd crates, and to port their internals into this crate. Maintaining them there vs maintaining them here shouldn't really make much of a difference in practice, as long as we can avoid needing to maintain them in two places. The biggest hurdle for me would be figuring out how to keep something like the More broadly/philosophically, I think we have different preference for trait methods vs inherent methods. My preference has been to make One minor issue: I decided to split BLAKE2b/BLAKE2s into two repos in the past, mainly because building all the intrinsics implementations for different SIMD instruction sets takes a non-trivial amount of time, and it's nice not to double it. Presumably we'd be combining them here and paying that cost in build times for callers don't need both. (I think in practice most callers just want BLAKE2b.) That's definitely not the end of the world, but we might want to measure it and make an explicit judgment call. |
Probably. You can always add inherent methods that accept them.
There's not a dichotomy here: you can always have inherent methods with the same name, and have the traits call the inherent methods (we do this quite a bit in the
We also have held off on a 1.0 until const generics land. The goal is definitely to use them ASAP when they become available.
Any reason this can't be solved with cargo features? |
Ah cool, I wasn't sure whether this was something you all were intentionally trying to avoid. Going more extreme in that direction, how would you feel about an incompatible version bump in this crate that wholesale replaces this crate's API with the existing *_simd ones (plus
That's a good point. Maybe Another open question here: the |
Done correctly it shouldn't be "breaking" so long as the types are named the same and the same traits are impl'd. Merging your existing code in and adding the appropriate trait impls sounds good to me. |
Yes. I don't see a problem in exposing additional functionality not covered by the existing traits via inherent methods and additional types.
Personally I dislike duplicating functionality in both inherent and trait methods. And I think it's worth to keep doing so for consistency sake. Also note that such approach is used not only by our crates, but also by But I do understand why some would prefer to have inherent methods, but I think a better solution would be to have "inherent traits" in the language. |
Yeah strongly agreed on that one.
I think By the way, there are other little differences in the APIs between our two crates that will come up as we look at this more closely. Another one is that the *_simd crates have a couple of fluent method interfaces, including the let hash = Params::new()
.hash_length(16)
.to_state()
.update(b"foo")
.update(b"bar")
.update(b"baz")
.finalize(); I don't think the |
The |
Oh neat! Though it looks like we took different positions on the classic debate of |
The |
Ah yeah, there's another difference. |
It's not just an efficiency problem: we're trying to catch errors like doing
|
Update after a few months: I'm still supportive of merging these crates, but I'm going to be busy with other things for the foreseeable future. If anyone else wants to tackle this, I'm happy to make suggestions, review PRs, and update README's as needed. |
@oconnor663 great! I probably won't have time to look myself in the short term, but I think it shouldn't be too difficult to make this transition in an almost purely additive manner, copying over your existing crates and then adding If I do have some time to take a look at this, I'll make a mention on this issue before proceeding. |
Question for both @oconnor663 and @newpavlov: should we preserve the original git history or not? |
After a merger, I'd certainly keep the original history around at https://github.com/oconnor663/blake2_simd, maybe as an "archived" repo or maybe just with some prominent comments and links about how development has moved. So I don't feel strongly about whether the same history is migrated over vs whether everything gets squashed. |
Cool, in that case, it sounds like we could just do a single-commit import of the current sources. |
I started poking at this locally. The first consideration is that RustCrypto owns the I have reached out to the current owner of the Failing that, locally I am working on trying to shoehorn them both into a I'll wait a bit to hear back from the |
Having said all that, I went ahead and tried to do a quick-and-dirty integration: #228 Closing this out in favor of that PR. |
[This is a very large change which I haven't discussed with anyone yet, so I'm
not sure it's the right choice for the project. But hopefully this is a good starting
point for discussion.]
This is mostly a large performance improvement. The BLAKE2b bench_10000
case is improved by about 30%. This implementation also detects SIMD
support at runtime, so the feature flags related to SIMD support are
removed.
The only performance loss is in the bench_10 cases, where the caller
repeatedly feeds input slices less than one block long. The BLAKE2s
bench_10 case is almost 20% slower. I'm not sure exactly why, but this
implementation optimizes for avoiding copies on long runs of input, so
it might just be that it's doing more math up front. This performance
issue disappears if the inputs are a full block or longer.
The only API consequence of this change is that the undocumented
with_parameter_block constructor is no longer supported. Callers who
need other parameters might prefer to use the blake2b_simd/blake2s_simd
APIs directly, which expose them in a safer way through a Params object.