-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up SipHasher128
.
#68914
Speed up SipHasher128
.
#68914
Conversation
BTW, I'm planning to make the equivalent change to @bors try @rust-timer queue |
Awaiting bors try build completion |
Speed up `SipHasher128`. The current code in `SipHasher128::short_write` is inefficient. It uses `u8to64_le` (which is complex and slow) to extract just the right number of bytes of the input into a u64 and pad the result with zeroes. It then left-shifts that value in order to bitwise-OR it with `self.tail`. For example, imagine we have a u32 input 0xIIHH_GGFF and only need three bytes to fill up `self.tail`. The current code uses `u8to64_le` to construct 0x0000_0000_00HH_GGFF, which is just 0xIIHH_GGFF with the 0xII removed and zero-extended to a u64. The code then left-shifts that value by five bytes -- discarding the 0x00 byte that replaced the 0xII byte! -- to give 0xHHGG_FF00_0000_0000. It then then ORs that value with self.tail. There's a much simpler way to do it: zero-extend to u64 first, then left shift. E.g. 0xIIHH_GGFF is zero-extended to 0x0000_0000_IIHH_GGFF, and then left-shifted to 0xHHGG_FF00_0000_0000. We don't have to take time to exclude the unneeded 0xII byte, because it just gets shifted out anyway! It also avoids multiple occurrences of `unsafe`. There's a similar story with the setting of `self.tail` at the method's end. The current code uses `u8to64_le` to extract the remaining part of the input, but the same effect can be achieved more quickly with a right shift on the zero-extended input. This commit changes `SipHasher128` to use the simpler shift-based approach. The code is also smaller, which means that `short_write` is now inlined where previously it wasn't, which makes things faster again. This gives big speed-ups for all incremental builds, especially "baseline" incremental builds. r? @michaelwoerister
Local results (
It's notable that every benchmark except for |
☀️ Try build successful - checks-azure |
Queued 05cb825 with parent 442ae7f, future comparison URL. |
Thanks for the PR, @nnethercote. Looks like a great find! We are doing lots of hashing Did you think about which implications these changes might have on big endian systems? Hashing needs to be stable across platforms for cross-compilation. The changes are probably fine with respect to this but it's something to look out for. I'll review in detail soon. |
I did think about big-endian. AFAIK, the new code will work fine there. Things are simpler because the code operates mostly on integers, with fewer conversions. But I'm happy to hear a second opinion. Is there a way to know for sure if I'm right about this? Do you know if the tests cover BE, or if we have access to any BE machines for testing? |
We do run tests on some big endian platforms, I think (right, @rust-lang/infra?) In the past we had problems with symbol name mismatches when compiling some things on little-endian and the rest on big-endian, because the symbol hashes didn't match up. But we now know which kinds of bug reports to look out for after a change like this and testing should be better too now. |
https://cfarm.tetaneutral.net/ provides access to some big endian systems. |
@michaelwoerister the CI run tests on ARM, x86 and WASM only, so no big-endian platforms. |
e606fe7
to
a2da54b
Compare
@michaelwoerister: After some thought I see that my original code was not correct for big-endian. Fortunately there is a cheap and easy fix. Consider this 9 byte stream:
On little-endian it is equivalent to On big-endian it is equivalent to The new To make it work for big-endian, we just need to call I have updated the PR to do this. I haven't tested it on a big-endian machine but I'm fairly confident it's correct. But this stuff is tricky to think about so, again, I'm happy to hear second opinions. |
a2da54b
to
7edcdc8
Compare
The current code in `SipHasher128::short_write` is inefficient. It uses `u8to64_le` (which is complex and slow) to extract just the right number of bytes of the input into a u64 and pad the result with zeroes. It then left-shifts that value in order to bitwise-OR it with `self.tail`. For example, imagine we have a u32 input 0xIIHH_GGFF and only need three bytes to fill up `self.tail`. The current code uses `u8to64_le` to construct 0x0000_0000_00HH_GGFF, which is just 0xIIHH_GGFF with the 0xII removed and zero-extended to a u64. The code then left-shifts that value by five bytes -- discarding the 0x00 byte that replaced the 0xII byte! -- to give 0xHHGG_FF00_0000_0000. It then then ORs that value with self.tail. There's a much simpler way to do it: zero-extend to u64 first, then left shift. E.g. 0xIIHH_GGFF is zero-extended to 0x0000_0000_IIHH_GGFF, and then left-shifted to 0xHHGG_FF00_0000_0000. We don't have to take time to exclude the unneeded 0xII byte, because it just gets shifted out anyway! It also avoids multiple occurrences of `unsafe`. There's a similar story with the setting of `self.tail` at the method's end. The current code uses `u8to64_le` to extract the remaining part of the input, but the same effect can be achieved more quickly with a right shift on the zero-extended input. All that works on little-endian. It doesn't work for big-endian, but we can just do a `to_le` before calling `short_write` and then it works. This commit changes `SipHasher128` to use the simpler shift-based approach. The code is also smaller, which means that `short_write` is now inlined where previously it wasn't, which makes things faster again. This gives big speed-ups for all incremental builds, especially "baseline" incremental builds.
7edcdc8
to
f8a0286
Compare
Note that we already call |
OK, I wrote the following test program that compares hash values before and after this PR: https://github.com/michaelwoerister/sip-endian/blob/master/main.rs On a little endian machine everything works as expected. However, when I tried it on a big endian machine (gcc110 from cfarm.tetaneutral.net), I got different values until I removed the The requirement here is that the same sequence of However, the implementation in this PR does not operate on byte slices anymore, so there is no need to do the whole byte-swapping dance. The new So the correct fix, in my opinion, is to remove the |
Also, we should be able to replace the /// Loads up to 7 bytes from a byte-slice into a u64.
#[inline]
fn u8to64_le(buf: &[u8], start: usize, len: usize) -> u64 {
assert!(len <= 8 && start + len <= buf.len());
let mut out = 0u64;
unsafe {
let out_ptr = &mut out as *mut _ as *mut u8;
ptr::copy_nonoverlapping(buf.as_ptr().offset(start as isize), out_ptr, len);
}
#[cfg(target_endian = "big")]
{
// If this is a big endian system we swap bytes, so that the first
// byte ends up in the lowest order byte, like SipHash expects.
out = out.swap_bytes();
}
out
} |
I requested access to the GCC farm on Saturday, but I am still waiting for a response.
Hmm. I was using this code as the basis for my reasoning: Lines 297 to 326 in e6ec0d1
The use of
On little-endian it is equivalent to I was taking this equivalence to be axiomatic (i.e. required). But it makes sense that But should
Thank you for doing this checking. Here's what I was expecting from
Can you write down the values you got for the four combinations? For
because of the extra |
In case it helps, here is what I think should happen in the four
I have confirmed that the two little-endian cases are correct, I haven't been able to confirm the big-endian cases. |
This whole snippet can be simplified to |
@michaelwoerister: I have added some debugging This is the output I get on little-endian:
This is the output I expect on big-endian:
Can you check the big-endian results? |
Here is what I get on the big endian machine:
|
So it looks the same as on little-endian. This is what I expected because the code in question operates on integers, not on byte sequences, i.e. the number
I see, that is interesting. I didn't know that the So I think our options for
I prefer option (1) as it is just simpler.
Yeah, I know. I just find |
The new code operates on integers, so I agree that the new But the old code involves byte sequences in I just looked more closely at your
Yes.
I'd prefer it to handle it the same way...
I prefer option (3). My desires are:
The only way to satisfy all of these is via (3), which the current PR code implements. The downside is extra Does that sound reasonable?
That's interesting. I find lots of things about little-endian/big-endian confusing, but I don't have trouble with |
One more tidbit: the unit tests have a test of rust/src/librustc_data_structures/sip128/tests.rs Lines 401 to 418 in 3f32e30
So the test confirms that the |
This makes it faster and also changes it to a safe function. (Thanks to Michael Woerister for the suggestion.) `load_int_le!` is also no longer necessary.
@michaelwoerister: I have left the previous changes in place, because I think option (3) is the best. I have also added another commit that makes the |
Yes, I'm OK with that.
I think what I find even more confusing about // We have some bytes that encode the number 1 as a 32 bit integer in LE format,
let le_bytes = [1, 0, 0, 0];
// Load the bytes into the `u32` verbatim
let x: u32 = *(&le_bytes as &u32);
// On a big endian machine x is now the number 16777216, so we convert
// to big endian by, obviously, calling `to_le()`
x.to_le() It does the right thing but it spells the opposite of what it does. What I usually really want, and I'm glad that Rust has it since recently, is // We have some bytes that encode the number 1 as a 32 bit integer in LE format,
let le_bytes = [1, 0, 0, 0];
// So much nicer!
let x = u32::from_le_bytes(le_bytes); I would actually prefer implementing @bors r+ |
📌 Commit 9aea154 has been approved by |
💡 This pull request was already approved, no need to approve it again.
|
📌 Commit 9aea154 has been approved by |
…r=michaelwoerister Speed up `SipHasher128`. The current code in `SipHasher128::short_write` is inefficient. It uses `u8to64_le` (which is complex and slow) to extract just the right number of bytes of the input into a u64 and pad the result with zeroes. It then left-shifts that value in order to bitwise-OR it with `self.tail`. For example, imagine we have a u32 input `0xIIHH_GGFF` and only need three bytes to fill up `self.tail`. The current code uses `u8to64_le` to construct `0x0000_0000_00HH_GGFF`, which is just `0xIIHH_GGFF` with the `0xII` removed and zero-extended to a u64. The code then left-shifts that value by five bytes -- discarding the `0x00` byte that replaced the `0xII` byte! -- to give `0xHHGG_FF00_0000_0000`. It then then ORs that value with `self.tail`. There's a much simpler way to do it: zero-extend to u64 first, then left shift. E.g. `0xIIHH_GGFF` is zero-extended to `0x0000_0000_IIHH_GGFF`, and then left-shifted to `0xHHGG_FF00_0000_0000`. We don't have to take time to exclude the unneeded `0xII` byte, because it just gets shifted out anyway! It also avoids multiple occurrences of `unsafe`. There's a similar story with the setting of `self.tail` at the method's end. The current code uses `u8to64_le` to extract the remaining part of the input, but the same effect can be achieved more quickly with a right shift on the zero-extended input. This commit changes `SipHasher128` to use the simpler shift-based approach. The code is also smaller, which means that `short_write` is now inlined where previously it wasn't, which makes things faster again. This gives big speed-ups for all incremental builds, especially "baseline" incremental builds. r? @michaelwoerister
Rollup of 8 pull requests Successful merges: - #67585 (Improve `char::is_ascii_*` codegen) - #68914 (Speed up `SipHasher128`.) - #68994 (rustbuild: include channel in sanitizers installed name) - #69032 (ICE in nightly-2020-02-08: handle TerminatorKind::Yield in librustc_mir::transform::promote_consts::Validator method) - #69034 (parser: Remove `Parser::prev_token_kind`) - #69042 (Remove backtrace header text) - #69059 (Remove a few unused objects) - #69089 (Properly use the darwin archive format on Apple targets) Failed merges: r? @ghost
|
`SipHasher128`'s `u8to64_le` function was simplified in rust-lang#68914. Unfortunately, the new version is slower, because it introduces `memcpy` calls with non-statically-known lengths. This commit reverts the change, and adds an explanatory comment (which is also added to `libcore/hash/sip.rs`). This barely affects `SipHasher128`'s speed because it doesn't use `u8to64_le` much, but it does result in `SipHasher128` once again being consistent with `libcore/hash/sip.rs`.
…lwoerister Revert `u8to64_le` changes from #68914. `SipHasher128`'s `u8to64_le` function was simplified in #68914. Unfortunately, the new version is slower, because it introduces `memcpy` calls with non-statically-known lengths. This commit reverts the change, and adds an explanatory comment (which is also added to `libcore/hash/sip.rs`). This barely affects `SipHasher128`'s speed because it doesn't use `u8to64_le` much, but it does result in `SipHasher128` once again being consistent with `libcore/hash/sip.rs`. r? @michaelwoerister
This commit changes `sip::Hasher` to use the faster `short_write` approach that was used for `SipHasher128` in rust-lang#68914. This has no effect because `sip::Hasher::short_write` is currently unused. See the next commit for more details, and a fix. (One difference with rust-lang#68914 is that this commit doesn't apply the `u8to64_le` change from that PR, because I found it is slower, because it introduces `memcpy` calls with non-statically-known lengths. Therefore, this commit also undoes the `u8to64_le` change in `SipHasher128` for this reason. This doesn't affect `SipHasher128` much because it doesn't use `u8to64_le` much, but I am making the change to keep the two implementations consistent.)
The current code in
SipHasher128::short_write
is inefficient. It usesu8to64_le
(which is complex and slow) to extract just the right number ofbytes of the input into a u64 and pad the result with zeroes. It then
left-shifts that value in order to bitwise-OR it with
self.tail
.For example, imagine we have a u32 input
0xIIHH_GGFF
and only need three bytesto fill up
self.tail
. The current code usesu8to64_le
to construct0x0000_0000_00HH_GGFF
, which is just0xIIHH_GGFF
with the0xII
removed andzero-extended to a u64. The code then left-shifts that value by five bytes --
discarding the
0x00
byte that replaced the0xII
byte! -- to give0xHHGG_FF00_0000_0000
. It then then ORs that value withself.tail
.There's a much simpler way to do it: zero-extend to u64 first, then left shift.
E.g.
0xIIHH_GGFF
is zero-extended to0x0000_0000_IIHH_GGFF
, and thenleft-shifted to
0xHHGG_FF00_0000_0000
. We don't have to take time to excludethe unneeded
0xII
byte, because it just gets shifted out anyway! It also avoidsmultiple occurrences of
unsafe
.There's a similar story with the setting of
self.tail
at the method's end.The current code uses
u8to64_le
to extract the remaining part of the input,but the same effect can be achieved more quickly with a right shift on the
zero-extended input.
This commit changes
SipHasher128
to use the simpler shift-based approach. Thecode is also smaller, which means that
short_write
is now inlined wherepreviously it wasn't, which makes things faster again. This gives big
speed-ups for all incremental builds, especially "baseline" incremental
builds.
r? @michaelwoerister