Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix isize optimization in StableHasher for big-endian architectures #93615

Merged
merged 1 commit into from
Feb 5, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 7 additions & 3 deletions compiler/rustc_data_structures/src/stable_hasher.rs
Original file line number Diff line number Diff line change
Expand Up @@ -133,18 +133,18 @@ impl Hasher for StableHasher {

#[inline]
fn write_isize(&mut self, i: isize) {
// Always treat isize as i64 so we get the same results on 32 and 64 bit
// Always treat isize as a 64-bit number so we get the same results on 32 and 64 bit
// platforms. This is important for symbol hashes when cross compiling,
// for example. Sign extending here is preferable as it means that the
// same negative number hashes the same on both 32 and 64 bit platforms.
let value = (i as i64).to_le() as u64;
let value = i as u64;

// Cold path
#[cold]
#[inline(never)]
fn hash_value(state: &mut SipHasher128, value: u64) {
state.write_u8(0xFF);
state.write_u64(value);
state.write_u64(value.to_le());
}

// `isize` values often seem to have a small (positive) numeric value in practice.
Expand All @@ -161,6 +161,10 @@ impl Hasher for StableHasher {
// 8 bytes. Since this prefix cannot occur when we hash a single byte, when we hash two
// `isize`s that fit within a different amount of bytes, they should always produce a different
// byte stream for the hasher.
//
// To ensure that this optimization hashes the exact same bytes on both little-endian and
// big-endian architectures, we compare the value with 0xFF before we convert the number
// into a unified representation (little-endian).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is correct but I think we could put more emphasis that the endianness conversion must be the last step because that creates platform-dependent values to get platform-independent bytes.

It would be clearer if siphasher::write were generic over [u8; N] instead of taking different primitives. Oh well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well to be fair it contains an optimized implementation for these primitives, so it's probably worth it.
Should I add something like

First, we have to compare the value (which has to be done in a platform-dependent manner) and only then can we convert the number to the little-endian format (to ensure platform-independent bytes being hashed).

?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(which has to be done in a platform-dependent manner)

That's probably confusing. We're going from [platform-dependent byte-representation, platform-independent value] to [platform-independent byte-representation, platform-dependent value]. Which means all operations that depend on the value must happen before that and afterwards we could only do bit-twiddling operations.
It would be more obvious if we used to_le_bytes.

I don't mean to explain endianness, it's just about which things must be be done before and after the conversion. That's what I didn't consider during the review. 😓

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find .to_le() and .to_be() to be really confusing and always use to_le_bytes() and to_be_bytes() instead, which makes it much less likely to get things accidentally wrong (by converting twice for example).

Now that we have const generics it would probably be easy to just change SipHasher128::short_write() to SipHasher128::short_write<const LEN: usize>(&mut self, bytes: &[u8; LEN]).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a good plan. Since this is a portability bug let's fix it first and then improve the design.

if value < 0xFF {
self.state.write_u8(value as u8);
} else {
Expand Down
1 change: 1 addition & 0 deletions compiler/rustc_data_structures/src/stable_hasher/tests.rs
Original file line number Diff line number Diff line change
Expand Up @@ -159,4 +159,5 @@ fn test_isize_compression() {
check_hash(0xAAAA, 0xAAAAAA);
check_hash(0xAAAAAA, 0xAAAAAAAA);
check_hash(0xFF, 0xFFFFFFFFFFFFFFFF);
check_hash(u64::MAX /* -1 */, 1);
}