Reduce the number of bytes hashed by IchHasher. #37427

nnethercote · 2016-10-27T01:58:43Z

IchHasher uses blake2b hashing, which is expensive, so the fewer bytes hashed
the better. There are two big ways to reduce the number of bytes hashed.

Filenames in spans account for ~66% of all bytes (for builds with debuginfo).
The vast majority of spans have the same filename for the start of the span
and the end of the span, so hashing the filename just once in those cases is
a big win.
u32 and u64 and usize values account for ~25%--33% of all bytes (for builds
with debuginfo). The vast majority of these are small, i.e. fit in a u8, so
shrinking them down before hashing is also a big win.

This PR implements these two optimizations. I'm certain the first one is safe.
I'm about 90% sure that the second one is safe.

Here are measurements of the number of bytes hashed when doing
debuginfo-enabled builds of stdlib and
rustc-benchmarks/syntex-0.42.2-incr-clean.

                    stdlib   syntex-incr
                    ------   -----------
original       156,781,386   255,095,596
half-SawSpan   106,744,403   176,345,419
short-ints      45,890,534   118,014,227
no-SawSpan[*]    6,831,874    45,875,714

[*] don't hash the SawSpan at all. Not part of this PR, just implemented for
    comparison's sake.

For debug builds of syntex-0.42.2-incr-clean, the two changes give a 1--2%
speed-up.

rust-highfive · 2016-10-27T01:58:48Z

r? @eddyb

(rust_highfive has picked a reviewer for you, use r? to override)

eddyb · 2016-10-27T02:11:09Z

r? @michaelwoerister

michaelwoerister · 2016-10-27T14:34:55Z

The first optimization looks good, I'm happy to approve that.
I don't think the second one is safe, e.g. &[0x1234, 0x56] will have the same hash as &[0x12, 0x3456] which we don't want here. You could try encoding integers with leb128 (there's implementation somewhere in the compiler) and see if that is faster than hashing them directly.

nnethercote · 2016-10-28T03:06:38Z

I've changed the second commit to leb128-encode integers.

michaelwoerister · 2016-10-28T12:17:29Z

I've changed the second commit to leb128-encode integers

Do you have numbers on the performance impact of that?

nnethercote · 2016-10-28T21:25:57Z

leb128 made very little difference to the performance. The number of bytes hashed increased slightly (a few percent) compared to the "just truncate" version, mostly because 255 is a fairly common value and it leb128-encodes to two bytes; I think it's used by Hasher to indicate the end of some types, or something like that?

The speed difference was negligible -- leb128 encoding is much cheaper than blake2b hashing :)

michaelwoerister · 2016-10-29T17:01:49Z

src/librustc_incremental/calculate_svh/hasher.rs

@@ -50,23 +73,28 @@ impl Hasher for IchHasher {
    }

    #[inline]
+    fn write_u8(&mut self, i: u8) {
+        self.write_uleb128(i as u64);


u8 doesn't need to use leb128. There's nothing to be gained here.

I think it does need to use leb128. If you have a u16 value like 256 it gets encoded as [128,2]. That would be indistinguishable from two u8 values 128 and 2.

But encoding is typeful - a u8 can never be in the same start as a u16. A sufficient criterion is that the encoding for each type is prefix-free.

For a hash like this to be correct, it has to be unambiguous. My personal benchmark for this, the one I find most intuitive, is: If we were serializing this data instead of hashing it, would we be able to correctly deserialize it again? In this case yes, because we always know whether we are reading a u8 or a u16 or something else next. A problem only arises if we have a bunch of, say, u16s in a row but they are encoded with an ambiguous variable length encoding.

michaelwoerister · 2016-10-29T17:14:24Z

It's a bit unfortunate that we have to duplicate the leb128 implementation. Can you factor things in a way so that we can have a unit test for it (e.g. by making write_uleb128 a free standing function that writes its output to a &mut [u8])? These hashes are critical for correctness, so I'd like them to be well tested.
Alternatively, you could add a Vec<u8> to the hasher struct and re-use that to avoid allocations.

bors · 2016-10-31T09:54:38Z

☔ The latest upstream changes (presumably #37439) made this pull request unmergeable. Please resolve the merge conflicts.

This significantly reduces the number of bytes hashed by IchHasher.

nnethercote · 2016-11-02T05:10:39Z

I redid the second commit so it uses the leb128 encoding from libserialize.

michaelwoerister

Looks good to me! Happy to merge it after you renamed the bytes field and removed the comment about usize.

michaelwoerister · 2016-11-02T16:03:57Z

src/librustc_incremental/calculate_svh/hasher.rs

+
+    #[inline]
+    fn write_usize(&mut self, i: usize) {
+        // always hash as u64, so we don't depend on the size of `usize`


Since are encoding as leb128 before hashing, it should not make a difference whether the number was a usize or a u64 before. The leb128 representation of both will be identical.

michaelwoerister · 2016-11-02T16:05:49Z

src/librustc_incremental/calculate_svh/hasher.rs


 #[derive(Debug)]
 pub struct IchHasher {
    state: ArchIndependentHasher<Blake2bHasher>,
+    bytes: Vec<u8>,


Can you rename this to something like leb128_helper? Otherwise one might think that it contains the bytes that were hashed so far.

This significantly reduces the number of bytes hashed by IchHasher.

nnethercote · 2016-11-02T22:06:41Z

Updated to address comments.

michaelwoerister · 2016-11-02T22:10:17Z

@bors r+

Thanks, @nnethercote!

bors · 2016-11-02T22:10:18Z

📌 Commit d73c68c has been approved by michaelwoerister

bors · 2016-11-03T13:23:48Z

⌛ Testing commit d73c68c with merge 69c4cfb...

bors · 2016-11-03T13:27:31Z

💔 Test failed - auto-mac-64-opt-rustbuild

michaelwoerister · 2016-11-03T13:57:18Z

@bors retry

bors · 2016-11-03T13:57:21Z

⌛ Testing commit d73c68c with merge e67a9ba...

bors · 2016-11-03T14:01:21Z

💔 Test failed - auto-mac-64-opt-rustbuild

…lwoerister Reduce the number of bytes hashed by IchHasher. IchHasher uses blake2b hashing, which is expensive, so the fewer bytes hashed the better. There are two big ways to reduce the number of bytes hashed. - Filenames in spans account for ~66% of all bytes (for builds with debuginfo). The vast majority of spans have the same filename for the start of the span and the end of the span, so hashing the filename just once in those cases is a big win. - u32 and u64 and usize values account for ~25%--33% of all bytes (for builds with debuginfo). The vast majority of these are small, i.e. fit in a u8, so shrinking them down before hashing is also a big win. This PR implements these two optimizations. I'm certain the first one is safe. I'm about 90% sure that the second one is safe. Here are measurements of the number of bytes hashed when doing debuginfo-enabled builds of stdlib and rustc-benchmarks/syntex-0.42.2-incr-clean. ``` stdlib syntex-incr ------ ----------- original 156,781,386 255,095,596 half-SawSpan 106,744,403 176,345,419 short-ints 45,890,534 118,014,227 no-SawSpan[*] 6,831,874 45,875,714 [*] don't hash the SawSpan at all. Not part of this PR, just implemented for comparison's sake. ``` For debug builds of syntex-0.42.2-incr-clean, the two changes give a 1--2% speed-up.

bors · 2016-11-05T05:19:38Z

⌛ Testing commit d73c68c with merge aa90c30...

bors · 2016-11-05T05:23:01Z

💔 Test failed - auto-win-msvc-64-opt-rustbuild

alexcrichton · 2016-11-05T05:39:02Z

@bors: retry

On Fri, Nov 4, 2016 at 10:23 PM, bors notifications@github.com wrote:

💔 Test failed - auto-win-msvc-64-opt-rustbuild
https://buildbot.rust-lang.org/builders/auto-win-msvc-64-opt-rustbuild/builds/2924

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#37427 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAD95CE-b49-yhmL8ANQMNdJFGs-mxsAks5q7BK4gaJpZM4Kh3Qk
.

bors · 2016-11-05T06:01:17Z

⌛ Testing commit d73c68c with merge 149d86a...

bors · 2016-11-05T06:04:46Z

💔 Test failed - auto-mac-64-opt-rustbuild

Rollup of 24 pull requests - Successful merges: #37255, #37317, #37408, #37410, #37422, #37427, #37470, #37501, #37537, #37556, #37557, #37564, #37565, #37566, #37569, #37574, #37577, #37579, #37583, #37585, #37586, #37587, #37589, #37596 - Failed merges: #37521, #37547

alexcrichton · 2016-11-05T06:07:12Z

@bors: retry

On Fri, Nov 4, 2016 at 11:04 PM, bors notifications@github.com wrote:

💔 Test failed - auto-mac-64-opt-rustbuild
https://buildbot.rust-lang.org/builders/auto-mac-64-opt-rustbuild/builds/2906

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#37427 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAD95Bmi7WeKI29iTiuqcuIKNbC5cbn7ks5q7ByCgaJpZM4Kh3Qk
.

bors · 2016-11-05T08:10:57Z

⌛ Testing commit d73c68c with merge 0883996...

Reduce the number of bytes hashed by IchHasher. IchHasher uses blake2b hashing, which is expensive, so the fewer bytes hashed the better. There are two big ways to reduce the number of bytes hashed. - Filenames in spans account for ~66% of all bytes (for builds with debuginfo). The vast majority of spans have the same filename for the start of the span and the end of the span, so hashing the filename just once in those cases is a big win. - u32 and u64 and usize values account for ~25%--33% of all bytes (for builds with debuginfo). The vast majority of these are small, i.e. fit in a u8, so shrinking them down before hashing is also a big win. This PR implements these two optimizations. I'm certain the first one is safe. I'm about 90% sure that the second one is safe. Here are measurements of the number of bytes hashed when doing debuginfo-enabled builds of stdlib and rustc-benchmarks/syntex-0.42.2-incr-clean. ``` stdlib syntex-incr ------ ----------- original 156,781,386 255,095,596 half-SawSpan 106,744,403 176,345,419 short-ints 45,890,534 118,014,227 no-SawSpan[*] 6,831,874 45,875,714 [*] don't hash the SawSpan at all. Not part of this PR, just implemented for comparison's sake. ``` For debug builds of syntex-0.42.2-incr-clean, the two changes give a 1--2% speed-up.

bors · 2016-11-05T11:32:02Z

…lwoerister Reduce the number of bytes hashed by IchHasher. IchHasher uses blake2b hashing, which is expensive, so the fewer bytes hashed the better. There are two big ways to reduce the number of bytes hashed. - Filenames in spans account for ~66% of all bytes (for builds with debuginfo). The vast majority of spans have the same filename for the start of the span and the end of the span, so hashing the filename just once in those cases is a big win. - u32 and u64 and usize values account for ~25%--33% of all bytes (for builds with debuginfo). The vast majority of these are small, i.e. fit in a u8, so shrinking them down before hashing is also a big win. This PR implements these two optimizations. I'm certain the first one is safe. I'm about 90% sure that the second one is safe. Here are measurements of the number of bytes hashed when doing debuginfo-enabled builds of stdlib and rustc-benchmarks/syntex-0.42.2-incr-clean. ``` stdlib syntex-incr ------ ----------- original 156,781,386 255,095,596 half-SawSpan 106,744,403 176,345,419 short-ints 45,890,534 118,014,227 no-SawSpan[*] 6,831,874 45,875,714 [*] don't hash the SawSpan at all. Not part of this PR, just implemented for comparison's sake. ``` For debug builds of syntex-0.42.2-incr-clean, the two changes give a 1--2% speed-up.

Rollup of 24 pull requests - Successful merges: #37255, #37317, #37408, #37410, #37422, #37427, #37470, #37501, #37537, #37556, #37557, #37564, #37565, #37566, #37569, #37574, #37577, #37579, #37583, #37585, #37586, #37587, #37589, #37596 - Failed merges: #37521, #37547

rust-highfive assigned eddyb Oct 27, 2016

rust-highfive assigned michaelwoerister and unassigned eddyb Oct 27, 2016

nnethercote force-pushed the opt-IchHasher branch from 9f7a960 to 5810cdd Compare October 28, 2016 03:05

michaelwoerister reviewed Oct 29, 2016

View reviewed changes

Don't hash span filenames twice in IchHasher.

af0b27e

This significantly reduces the number of bytes hashed by IchHasher.

nnethercote force-pushed the opt-IchHasher branch from 5810cdd to 3e65298 Compare November 2, 2016 05:10

nnethercote force-pushed the opt-IchHasher branch from 3e65298 to 60d4232 Compare November 2, 2016 05:11

michaelwoerister approved these changes Nov 2, 2016

View reviewed changes

leb128-encode integers before hashing them in IchHasher.

d73c68c

This significantly reduces the number of bytes hashed by IchHasher.

nnethercote force-pushed the opt-IchHasher branch from 60d4232 to d73c68c Compare November 2, 2016 22:06

sophiajt mentioned this pull request Nov 4, 2016

Rollup of 17 pull requests #37581

Closed

alexcrichton mentioned this pull request Nov 4, 2016

Rollup of 24 pull requests #37597

Merged

bors merged commit d73c68c into rust-lang:master Nov 5, 2016

nnethercote deleted the opt-IchHasher branch November 6, 2016 21:29

bluss mentioned this pull request Nov 26, 2016

Serialization performance regression #38021

Closed

nnethercote mentioned this pull request May 25, 2018

SipHasher takes up lots of time in incremental builds #51054

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce the number of bytes hashed by IchHasher. #37427

Reduce the number of bytes hashed by IchHasher. #37427

nnethercote commented Oct 27, 2016

rust-highfive commented Oct 27, 2016

eddyb commented Oct 27, 2016

michaelwoerister commented Oct 27, 2016

nnethercote commented Oct 28, 2016

michaelwoerister commented Oct 28, 2016

nnethercote commented Oct 28, 2016

michaelwoerister Oct 29, 2016 •

edited

Loading

nnethercote Oct 31, 2016

arielb1 Oct 31, 2016 •

edited

Loading

michaelwoerister Oct 31, 2016

michaelwoerister commented Oct 29, 2016

bors commented Oct 31, 2016

nnethercote commented Nov 2, 2016

michaelwoerister left a comment

michaelwoerister Nov 2, 2016

michaelwoerister Nov 2, 2016

nnethercote commented Nov 2, 2016

michaelwoerister commented Nov 2, 2016

bors commented Nov 2, 2016

bors commented Nov 3, 2016

bors commented Nov 3, 2016

michaelwoerister commented Nov 3, 2016

bors commented Nov 3, 2016

bors commented Nov 3, 2016

bors commented Nov 5, 2016

bors commented Nov 5, 2016

alexcrichton commented Nov 5, 2016

bors commented Nov 5, 2016

bors commented Nov 5, 2016

alexcrichton commented Nov 5, 2016

bors commented Nov 5, 2016

bors commented Nov 5, 2016

Reduce the number of bytes hashed by IchHasher. #37427

Reduce the number of bytes hashed by IchHasher. #37427

Conversation

nnethercote commented Oct 27, 2016

rust-highfive commented Oct 27, 2016

eddyb commented Oct 27, 2016

michaelwoerister commented Oct 27, 2016

nnethercote commented Oct 28, 2016

michaelwoerister commented Oct 28, 2016

nnethercote commented Oct 28, 2016

michaelwoerister Oct 29, 2016 • edited Loading

Choose a reason for hiding this comment

nnethercote Oct 31, 2016

Choose a reason for hiding this comment

arielb1 Oct 31, 2016 • edited Loading

Choose a reason for hiding this comment

michaelwoerister Oct 31, 2016

Choose a reason for hiding this comment

michaelwoerister commented Oct 29, 2016

bors commented Oct 31, 2016

nnethercote commented Nov 2, 2016

michaelwoerister left a comment

Choose a reason for hiding this comment

michaelwoerister Nov 2, 2016

Choose a reason for hiding this comment

michaelwoerister Nov 2, 2016

Choose a reason for hiding this comment

nnethercote commented Nov 2, 2016

michaelwoerister commented Nov 2, 2016

bors commented Nov 2, 2016

bors commented Nov 3, 2016

bors commented Nov 3, 2016

michaelwoerister commented Nov 3, 2016

bors commented Nov 3, 2016

bors commented Nov 3, 2016

bors commented Nov 5, 2016

bors commented Nov 5, 2016

alexcrichton commented Nov 5, 2016

bors commented Nov 5, 2016

bors commented Nov 5, 2016

alexcrichton commented Nov 5, 2016

bors commented Nov 5, 2016

bors commented Nov 5, 2016

michaelwoerister Oct 29, 2016 •

edited

Loading

arielb1 Oct 31, 2016 •

edited

Loading