Auto merge of #93432 - Kobzol:stable-hash-isize-hash-compression, r=t…

…he8472 Compress amount of hashed bytes for `isize` values in StableHasher This is another attempt to land #92103, this time hopefully with a correct implementation w.r.t. stable hashing guarantees. The previous PR was [reverted](#93014) because it could produce the [same hash](#92103 (comment)) for different values even in quite simple situations. I have since added a basic [test](#93193) that should guard against that situation, I also added a new test in this PR, specialised for this optimization. ## Why this optimization helps Since the original PR, I have tried to analyze why this optimization even helps (and why it especially helps for `clap`). I found that the vast majority of stable-hashing `i64` actually comes from hashing `isize` (which is converted to `i64` in the stable hasher). I only found a single place where is this datatype used directly in the compiler, and this place has also been showing up in traces that I used to find out when is `isize` being hashed. This place is `rustc_span::FileName::DocTest`, however, I suppose that isizes also come from other places, but they might not be so easy to find (there were some other entries in the trace). `clap` hashes about 8.5 million `isize`s, and all of them fit into a single byte, which is why this optimization has helped it [quite a lot](#92103 (comment)). Now, I'm not sure if special casing `isize` is the correct solution here, maybe something could be done with that `isize` inside `DocTest` or in other places, but that's for another discussion I suppose. In this PR, instead of hardcoding a special case inside `SipHasher128`, I instead put it into `StableHasher`, and only used it for `isize` (I tested that for `i64` it doesn't help, or at least not for `clap` and other few benchmarks that I was testing). ## New approach Since the most common case is a single byte, I added a fast path for hashing `isize` values which positive value fits within a single byte, and a cold path for the rest of the values. To avoid the previous correctness problem, we need to make sure that each unique `isize` value will produce a unique hash stream to the hasher. By hash stream I mean a sequence of bytes that will be hashed (a different sequence should produce a different hash, but that is of course not guaranteed). We have to distinguish different values that produce the same bit pattern when we combine them. For example, if we just simply skipped the leading zero bytes for values that fit within a single byte, `(0xFF, 0xFFFFFFFFFFFFFFFF)` and `(0xFFFFFFFFFFFFFFFF, 0xFF)` would send the same hash stream to the hasher, which must not happen. To avoid this situation, values `[0, 0xFE]` are hashed as a single byte. When we hash a larger (treating `isize` as `u64`) value, we first hash an additional byte `0xFF`. Since `0xFF` cannot occur when we apply the single byte optimization, we guarantee that the hash streams will be unique when hashing two values `(a, b)` and `(b, a)` if `a != b`: 1) When both `a` and `b` are within `[0, 0xFE]`, their hash streams will be different. 2) When neither `a` and `b` are within `[0, 0xFE]`, their hash streams will be different. 3) When `a` is within `[0, 0xFE]` and `b` isn't, when we hash `(a, b)`, the hash stream will definitely not begin with `0xFF`. When we hash `(b, a)`, the hash stream will definitely begin with `0xFF`. Therefore the hash streams will be different. r? `@the8472`
rust-lang · Feb 3, 2022 · 1be5c8f · 1be5c8f
2 parents b380086 + 8de59be
commit 1be5c8f
Show file tree

Hide file tree

Showing 5 changed files with 57 additions and 9 deletions.
diff --git a/compiler/rustc_data_structures/src/stable_hasher.rs b/compiler/rustc_data_structures/src/stable_hasher.rs
@@ -137,7 +137,35 @@ impl Hasher for StableHasher {
         // platforms. This is important for symbol hashes when cross compiling,
         // for example. Sign extending here is preferable as it means that the
         // same negative number hashes the same on both 32 and 64 bit platforms.
-        self.state.write_i64((i as i64).to_le());
+        let value = (i as i64).to_le() as u64;
+
+        // Cold path
+        #[cold]
+        #[inline(never)]
+        fn hash_value(state: &mut SipHasher128, value: u64) {
+            state.write_u8(0xFF);
+            state.write_u64(value);
+        }
+
+        // `isize` values often seem to have a small (positive) numeric value in practice.
+        // To exploit this, if the value is small, we will hash a smaller amount of bytes.
+        // However, we cannot just skip the leading zero bytes, as that would produce the same hash
+        // e.g. if you hash two values that have the same bit pattern when they are swapped.
+        // See https://github.com/rust-lang/rust/pull/93014 for context.
+        //
+        // Therefore, we employ the following strategy:
+        // 1) When we encounter a value that fits within a single byte (the most common case), we
+        // hash just that byte. This is the most common case that is being optimized. However, we do
+        // not do this for the value 0xFF, as that is a reserved prefix (a bit like in UTF-8).
+        // 2) When we encounter a larger value, we hash a "marker" 0xFF and then the corresponding
+        // 8 bytes. Since this prefix cannot occur when we hash a single byte, when we hash two
+        // `isize`s that fit within a different amount of bytes, they should always produce a different
+        // byte stream for the hasher.
+        if value < 0xFF {
+            self.state.write_u8(value as u8);
+        } else {
+            hash_value(&mut self.state, value);
+        }
     }
 }
 

diff --git a/compiler/rustc_data_structures/src/stable_hasher/tests.rs b/compiler/rustc_data_structures/src/stable_hasher/tests.rs
@@ -39,7 +39,7 @@ fn test_hash_integers() {
     test_isize.hash(&mut h);
 
     // This depends on the hashing algorithm. See note at top of file.
-    let expected = (2736651863462566372, 8121090595289675650);
+    let expected = (1784307454142909076, 11471672289340283879);
 
     assert_eq!(h.finalize(), expected);
 }
@@ -67,7 +67,7 @@ fn test_hash_isize() {
     test_isize.hash(&mut h);
 
     // This depends on the hashing algorithm. See note at top of file.
-    let expected = (14721296605626097289, 11385941877786388409);
+    let expected = (2789913510339652884, 674280939192711005);
 
     assert_eq!(h.finalize(), expected);
 }
@@ -140,3 +140,23 @@ fn test_attribute_permutation() {
     test_type!(i64);
     test_type!(i128);
 }
+
+// Check that the `isize` hashing optimization does not produce the same hash when permuting two
+// values.
+#[test]
+fn test_isize_compression() {
+    fn check_hash(a: u64, b: u64) {
+        let hash_a = hash(&(a as isize, b as isize));
+        let hash_b = hash(&(b as isize, a as isize));
+        assert_ne!(
+            hash_a, hash_b,
+            "The hash stayed the same when permuting values `{a}` and `{b}!",
+        );
+    }
+
+    check_hash(0xAA, 0xAAAA);
+    check_hash(0xFF, 0xFFFF);
+    check_hash(0xAAAA, 0xAAAAAA);
+    check_hash(0xAAAAAA, 0xAAAAAAAA);
+    check_hash(0xFF, 0xFFFFFFFFFFFFFFFF);
+}
diff --git a/src/test/debuginfo/function-names.rs b/src/test/debuginfo/function-names.rs
@@ -37,7 +37,7 @@
 // Const generic parameter
 // gdb-command:info functions -q function_names::const_generic_fn.*
 // gdb-check:[...]static fn function_names::const_generic_fn_bool<false>();
-// gdb-check:[...]static fn function_names::const_generic_fn_non_int<{CONST#fe3cfa0214ac55c7}>();
+// gdb-check:[...]static fn function_names::const_generic_fn_non_int<{CONST#3fcd7c34c1555be6}>();
 // gdb-check:[...]static fn function_names::const_generic_fn_signed_int<-7>();
 // gdb-check:[...]static fn function_names::const_generic_fn_unsigned_int<14>();
 
@@ -76,7 +76,7 @@
 // Const generic parameter
 // cdb-command:x a!function_names::const_generic_fn*
 // cdb-check:[...] a!function_names::const_generic_fn_bool<false> (void)
-// cdb-check:[...] a!function_names::const_generic_fn_non_int<CONST$fe3cfa0214ac55c7> (void)
+// cdb-check:[...] a!function_names::const_generic_fn_non_int<CONST$3fcd7c34c1555be6> (void)
 // cdb-check:[...] a!function_names::const_generic_fn_unsigned_int<14> (void)
 // cdb-check:[...] a!function_names::const_generic_fn_signed_int<-7> (void)
 

diff --git a/src/test/ui/symbol-names/basic.legacy.stderr b/src/test/ui/symbol-names/basic.legacy.stderr
@@ -1,10 +1,10 @@
-error: symbol-name(_ZN5basic4main17h7c2c715a9b77648bE)
+error: symbol-name(_ZN5basic4main17h611df9c6948c15f7E)
   --> $DIR/basic.rs:8:1
    |
 LL | #[rustc_symbol_name]
    | ^^^^^^^^^^^^^^^^^^^^
 
-error: demangling(basic::main::h7c2c715a9b77648b)
+error: demangling(basic::main::h611df9c6948c15f7)
   --> $DIR/basic.rs:8:1
    |
 LL | #[rustc_symbol_name]

diff --git a/src/test/ui/symbol-names/issue-60925.legacy.stderr b/src/test/ui/symbol-names/issue-60925.legacy.stderr
@@ -1,10 +1,10 @@
-error: symbol-name(_ZN11issue_609253foo37Foo$LT$issue_60925..llv$u6d$..Foo$GT$3foo17h419983d0842a72aeE)
+error: symbol-name(_ZN11issue_609253foo37Foo$LT$issue_60925..llv$u6d$..Foo$GT$3foo17h5425dadb5b1e5fb6E)
   --> $DIR/issue-60925.rs:21:9
    |
 LL |         #[rustc_symbol_name]
    |         ^^^^^^^^^^^^^^^^^^^^
 
-error: demangling(issue_60925::foo::Foo<issue_60925::llvm::Foo>::foo::h419983d0842a72ae)
+error: demangling(issue_60925::foo::Foo<issue_60925::llvm::Foo>::foo::h5425dadb5b1e5fb6)
   --> $DIR/issue-60925.rs:21:9
    |
 LL |         #[rustc_symbol_name]