Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seed HashMaps thread-locally, straight from the OS. #31356

Closed
wants to merge 1 commit into from

Conversation

huonw
Copy link
Member

@huonw huonw commented Feb 2, 2016

This reduces how much rand is required in std, and makes creating
hash-maps noticably faster. The first invocation of HashMap::new() in a
thread goes from ~130_000 ns to ~1500 ns (i.e. 64x faster) and later
invocations (non-reseeding ones) go from 10-20 ns to a more consistent
1.8-2.0ns.

The mean for many invocations in a loop, on a single thread, goes from
78ns to ~1.9ns, and the standard deviation drops from 2800ns to
33ns (the massive variance of the old scheme comes from the occasional
reseeding of the thread rng).

These new numbers are only slightly higher than creating the
RandomState outside the loop and calling with_state in it (i.e.
only measuring the non-random parts of hashmap initialisation): 1.3 and
18 ns respectively.

This new scheme has the slight downside of being consistent on a single
thread, so users may unintentionally rely on the order being
fixed (e.g. two hashmaps with the same contents).

Closes #27243.

@rust-highfive
Copy link
Collaborator

r? @brson

(rust_highfive has picked a reviewer for you, use r? to override)

@huonw
Copy link
Member Author

huonw commented Feb 2, 2016

cc @rust-lang/libs

This reduces how much rand is required in std

I only had a few minutes to write this, and wanted to get the conversation started if there is one. The next step will be a patch that moves the rest of the private std::rand out of std (there's some fiddly bits that may take a few compile-edit cycles, since various tests use it, which is why I haven't done it here).

@brson
Copy link
Contributor

brson commented Feb 2, 2016

This new scheme has the slight downside of being consistent on a single
thread

Can you expand on what this means and why it is? I don't see why changing the source of randomness would affect the behavior.

let mut bytes = [0_u8; 16];
internal_rand::fill_bytes(&mut bytes);

let keys: (u64, u64) = unsafe { mem::transmute(bytes) };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird transmute. Is this really ok? Seems like the alignment of the u8 array is probably less than the u64 tuple.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, should start with keys, and transmute keys to bytes before passing them to internal_rand.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A transmute like this should work fine: it's reinterpreting the value, not the memory location. That is, let out = transmute::<A, B>(x) is/should be equivalent to (in C)

B out;
memcpy(&out, &x, sizeof(B))

It would definitely be problematic if it was &[u8; 16] to &(u64, u64).

(If the above is not the case, it seems to me like it is an unnecessary footgun.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Do we guarantee tuple layout?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, but there's no way for that to matter here. If the transmute succeeds then you've loaded up 128 bits with randomness, and that's all that matters.

I am intrigued by the by-val transmute issue... I don't think I see a lot of transmuting of non-pointers... I have no idea how it should behave!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding padding will make the tuple larger, and hence cause the transmute to fail. Also, note that this is in the standard library, and hence can rely on current behaviours of the compiler.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gankro You shouldn't see much transmutes of pointers either, since those are covered by regular casts (and &* to create a reference).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a more natural way to write this is to create a [u64; 2] and take a slice of it as &mut [u8] (its byte representation) and pass it to fill bytes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is, let out = transmute::<A, B>(x) is/should be equivalent to (in C) ... (memcpy)

should be, but istm llvm could decide there are alignment-dependent instructions that would work better. just handwaving

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a more natural way to write this is to create a [u64; 2] and take a slice of it as &mut [u8](its byte representation) and pass it to fill bytes?

I did this originally, but it is significantly more fiddly.

llvm could decide there are alignment-dependent instructions that would work better. just handwaving

I think that would mean that LLVM is explicitly disobeying our instructions and hence would be a miscompilation. (I had a look at the IR, and it does compile down to a memcpy.)

@brson brson added the relnotes Marks issues that should be documented in the release notes of the next release. label Feb 2, 2016
@Gankra
Copy link
Contributor

Gankra commented Feb 2, 2016

@brson today HashMap::new is like:

get_random_seed_from_os()

with the PR it's like

rngs.entry(thread_id).or_insert_with(get_random_seed_from_os)

Today if you make two HashMaps and perform identical operations on them, you'll get the different iteration orders. With this patch, they will iterate identically. Historically programmers have been all-too-happy to latch onto any perceived determinism in HashMap iteration order. This is why JavaScript now specifies that an object is semantically a LinkedHashMap. People relied on this fact, and when it didn't hold drop-downs got shuffled.

Now, this is less of a risk for Rust, because the JS case is relying on individual maps being really reliable, and this doesn't hold today or with this patch. However, one could imagine something to the effect of:

fn test() {
   let results: Vec<Vec<u8>> = vec![];
   for vec in get_data_sets() {
      let map = HashMap::new();
      for k in vec {
        map.insert(k, compute(k));
      }
      results.push(map.values().cloned().collect()); 
   }
   assert!(results[0] == results[1]);
}

Today this would be hopelessly broken, but with this patch this will "happen to work" if the input/output sequences line up. This is one of those situations where we can wag our finger and tell people not to do it, but at the end of the day it's externally observable it can and will be leveraged.

Note that the following will still be hopelessly and obviously broken:

fn test() {
  // I got this when I ran it the first time, shouldn't change!
  let expected = vec![1, 3, 5, 6, 11, 13];
  let map = HashMap::new();
  for k in get_test_input() {
    map.insert(k, compute(k));
  }
  let result = map.values().cloned().collect();
  assert_eq!(result, expected);
}

Which is the thing I would most worry about.


pub use self::imp::fill_bytes;

#[cfg(all(unix, not(target_os = "ios"), not(target_os = "openbsd")))]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to deduplicate this with the existing randomness support? At least in terms of an internal implementation detail it'd be good to not have to keep this in sync in two places.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mention above, I was/am going to remove the existing support from std since we only need what's in this file now. However, the rand crate will still need to exist for internal testing purposes; should it still be dedup'd with that? (It's definitely possible, it just means slightly more boilerplate in here.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry for letting this slip through the cracks, but yeah that's seems fine to me!

@alexcrichton
Copy link
Member

Wow those are some impressive numbers! In the past I don't remember them being quite so dramatic...

If we're moving to thread-locals I wouldn't mind moving to full-blown process-globals. We've concluded in the past that there's no need in terms of DOS protection to have per-hashmap or even per-thread keys, a global one should be good enough.

@huonw
Copy link
Member Author

huonw commented Feb 2, 2016

Can you expand on what this means and why it is? I don't see why changing the source of randomness would affect the behavior.

Currently each hashmap has different keys since they retrieve new ones from the thread local on creation. This patch changes those u64s to be generated once and cached per-thread so all maps created on a single thread have the same keys.

Wow those are some impressive numbers! In the past I don't remember them being quite so dramatic...

It'll probably depend on the platform (incl. Linux kernel version), since that'll change the performance characteristics of retrieving random numbers (the old thread rng requires retrieving kilobytes, while this just needs a few).

If we're moving to thread-locals I wouldn't mind moving to full-blown process-globals. We've concluded in the past that there's no need in terms of DOS protection to have per-hashmap or even per-thread keys, a global one should be good enough.

Yeah, definitely possible, but this implementation is particularly simple (don't have to worry about unsafe/CASing anything etc). Very happy to switch to that if people prefer it.

@ranma42
Copy link
Contributor

ranma42 commented Feb 2, 2016

It would be possible to address @gankro's concern by generating the actual keys with a per-thread PRNG seeded with the OS random data.

EDIT: but isn't this what the old implementation was already doing?

@ranma42
Copy link
Contributor

ranma42 commented Feb 2, 2016

The improvement from 130_000 ns to 3000 ns looks huge. Would it make sense to investigate why thread_rng is so slow (and maybe backport these changes to it)?

@huonw
Copy link
Member Author

huonw commented Feb 2, 2016

EDIT: but isn't this what the old implementation was already doing?

Yes, it was, and it's why I was sure to point out this change has that downside in the PR. If we absolutely want to proactively help people avoid relying on hashmap order, a simpler and more efficient method would be to just increment one of the cached keys each time. (NB. this doesn't work quite so well with global rather than thread-local values: the locked increment/contention makes things slower.)

Would it make sense to investigate why thread_rng is so slow (and maybe backport these changes to it)?

Maybe, but it's almost certainly fundamental to the algorithm (it requires an extensive initialisation). In any case, the thread_rng in std is only used here, and hence is being/can be removed.

@ranma42
Copy link
Contributor

ranma42 commented Feb 2, 2016

If the update of the global cached keys is that simple, we could do atomic increments. There might be some contention, but it should still be way faster than locking.

Another option would be to keep the thread-local approach and use a simpler (non crypto-safe) RNG to have cheaper initialisation and key generation. In the repo for the rand crate there has been some discussion about providing fast weak RNG, which might be a good fit for this purpose: they would be marginally slower than the no-op key copy and still shuffle the keys in a different way for each HashMap

This reduces how much rand is required in std, and makes creating
hash-maps noticably faster. The first invocation of HashMap::new() in a
thread goes from ~130_000 ns to ~1500 ns (i.e. 86x faster) and later
invocations go from 10-20 ns to a more consistent 1.8-2.0ns.

The mean for many invocations in a loop, on a single thread, goes from
~77ns to ~1.9ns, and the standard deviation drops from 2800ns to
33ns (the *massive* variance of the old scheme comes from the occasional
reseeding of the thread rng).

These new numbers are only slightly higher than creating the
`RandomState` outside the loop and calling `with_state` in it (i.e.
only measuring the non-random parts of hashmap initialisation): 1.3 and
18 ns respectively.

This new scheme has the slight downside of being consistent on a single
thread, so users may unintentionally rely on the order being
fixed (e.g. two hashmaps with the same contents).

Closes rust-lang#27243.
@huonw
Copy link
Member Author

huonw commented Feb 3, 2016

(Incidentally, my method of measuring had a lot of overhead due to the timers, I've updated the numbers.)


If the update of the global cached keys is that simple, we could do atomic increments. There might be some contention, but it should still be way faster than locking.

Yes, but... locking was never proposed? In the single threaded case (i.e. no contention), doing the atomic increment seems to be more than 2.8 times slower than the non-incrementing global version and both thread local ones (incrementing or not): it's about 5.4 ns / invocation.

Another option would be to keep the thread-local approach and use a simpler (non crypto-safe) RNG to have cheaper initialisation and key generation. In the repo for the rand crate there has been some discussion about providing fast weak RNG, which might be a good fit for this purpose: they would be marginally slower than the no-op key copy and still shuffle the keys in a different way for each HashMap

One could say that the incrementing was exactly this sort of non-crypto safe RNG (and is about as fast as you can get). There's two factors at play here:

  • the key for a hashmap should be hard to deduce externally to protect against HashDoS
  • the order of keys within a hashmap is undefined, and hence people shouldn't accidentally rely on it (to ensure their code doesn't break when things change slightly)

The first we are assuming (see e.g. #27243) is guaranteed by the combination of 128-bits of randomness plus the design of SipHash, so this discussion is all about the second.

@brson
Copy link
Contributor

brson commented Feb 28, 2016

I'm still a fan of this. @huonw is this ready to go?

@huonw huonw added I-nominated T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Feb 29, 2016
@alexcrichton
Copy link
Member

The libs team discussed this during triage yesterday and the conclusion was:

  • The speedup wins here seem quite good, so we should land
  • It would be nice for now to preserve nondeterministic iteration order among all hash maps
  • We can probably recover this via Cell<u64> in TLS and it's just bumped each time we hit it.

Once that's in place this should be good to go, thanks @huonw!

@alexcrichton alexcrichton removed I-nominated T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Mar 3, 2016
@pczarn
Copy link
Contributor

pczarn commented Mar 24, 2016

I suggest preserving nondeterministic iteration order only in debug builds, not release builds. However, there's no rush to change the team's decision.

(It seems the only flag for conditional compilation in debug builds is cfg!(debug_assertions).)

@bluss
Copy link
Member

bluss commented Mar 24, 2016

alex's solution sounds good. Making it conditional on debug_assertions inside libstd doesn't affect most users: almost all users will use a libstd compiled with debug_assertions off, until recently not even our buildbots would test have such a configuration.

@bors
Copy link
Contributor

bors commented Apr 1, 2016

☔ The latest upstream changes (presumably #32635) made this pull request unmergeable. Please resolve the merge conflicts.

@alexcrichton
Copy link
Member

I've started to continue this in #33318 so I'm gonna close this in favor of that

alexcrichton added a commit to alexcrichton/rust that referenced this pull request May 19, 2016
This is a rebase and extension of rust-lang#31356 where we cache the keys in thread local
storage. This should give us a nice speed bost in creating hash maps along with
mostly retaining the property that all maps have a nondeterministic iteration
order.

Closes rust-lang#27243
bors added a commit that referenced this pull request May 20, 2016
std: Cache HashMap keys in TLS

This is a rebase and extension of #31356 where we not only cache the keys in
thread local storage but we also bump each key every time a new `HashMap` is
created. This should give us a nice speed bost in creating hash maps along with
retaining the property that all maps have a nondeterministic iteration order.

Closes #27243
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
relnotes Marks issues that should be documented in the release notes of the next release.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make HashMap use a per-process or per-thread seed instead of a per-instance seed
10 participants