Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up range sampling. #115

Closed
wants to merge 1 commit into from
Closed

Conversation

bhickey
Copy link

@bhickey bhickey commented Aug 26, 2016

This change removes division from the rejection sampler in random_range().
Replace divisions in range sampling with bitshifting. Instead of finding a
well fitting range, we generate a number [0,2^n) and reject out of range
values.

In synthetic benchmarks, this approximately doubles the throughput.

This change removes division from the rejection sampler in random_range().
Replace divisions in range sampling with bitshifting. Instead of finding a
well fitting range, we generate a number [0,2^n) and reject out of range
values.

In synthetic benchmarks, this approximately doubles the throughput.
@dhardy
Copy link
Member

dhardy commented Aug 17, 2017

In synthetic benchmarks, this approximately doubles the throughput.

I can't duplicate this. In my tests:

  • ~3070ns per i64 with the old algorithm
  • ~3200ns per i64 with your algorithm
  • ~3340ns per i64 with your algorithm, but recalculating leading_zeros() each call instead of storing an lz field
  • ~10200ns per i64 with the old algorithm, but not storing the rejection zone field

So your algorithm most definitely is useful if we wish to cut out the extra field or where the range can't be pre-computed, but not with the benchmark you used(?).

Actually, my test was using a different range (Range::new(3i64, 134217671i64)) and type. Using Range::new(10, 10000) from your benchmark, I get ~2600ns with the old algorithm and ~6200ns with your algorithm: more than twice as slow (and slower for i32 than i64). I'm not sure why it's this bad.

@pitdicker
Copy link
Contributor

Using bitshifts instead of a modulus improves performance.
But the usable zone is a bad approximation: depending on the range only on average 50% of the generated values pass. So there is a high change the RNG has to run multiple times, and the branch prediction becomes terrible.

Benchmarking this method is a bit more involved, because the results very much depend on how close a range is to a power of two.

I think dhardy#2, dhardy#68 and dhardy#69 help more to improve the performance of Range. Techniques are: never working on values smaller than 32 bits, using a widening multiply instead of modulus, and a sample_single function using the bitshift technique you described.

@dhardy
Copy link
Member

dhardy commented Dec 12, 2017

Yes, sounds like we can close this now (still need to get that code merged of course)!

@dhardy dhardy closed this Dec 12, 2017
@pitdicker
Copy link
Contributor

Thanks for the effort on this though!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants