Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Floating point numbers #105

Open
curiousleo opened this issue Apr 22, 2020 · 9 comments
Open

Floating point numbers #105

curiousleo opened this issue Apr 22, 2020 · 9 comments

Comments

@curiousleo
Copy link
Collaborator

curiousleo commented Apr 22, 2020

This issue motivates and gives context for #102, where I have implemented a method that lets us generate every representable number in the inclusive unit interval with the correct probability based on Downey's method [1].


Float representation

There are roughly two ways to create a random floating point number in the unit interval: non-injective ones and injective ones.

As a quick refresher, an IEEE binary32 floating point number ("float") is laid out as follows (the | symbols are just for illustration):

b31 | b30 ... b23 | b22 ... b0

which is interpreted as the number

(-1)^(b31) * 2^([b30 ... b23] - 127) * 1.[b22 .. b0]

where [bx .. by] is the number we get if we interpret the bits bx to by written consecutively as a binary number.

  • b31 is the sign
  • [b30 ... b23] is the exponent, which is biased by -127
  • [b22 ... b0] is the mantissa

Float is the set of all representable IEEE binary32 floating point numbers. Let's refer to the subset of Float that represents the range [0,1] as UnitFloat.

What form do the numbers in UnitFloat take? They are:

  • the non-negative floats smaller than one: all those floats with b31 = 0 and exponent < 0, that is, [b30 ... b23] < 127; this includes, as a special case, the float interpreted as zero
  • the float interpreted as one: b31 = 0, [b30 ... b23] = 127 and [b22 ... b0] = 0

Non-injective (Word32 -> UnitFloat)

Constant exponent

The method used right now is this:

(castWord32ToFloat $ (w32 `unsafeShiftR` 9) .|. 0x3f800000) - 1.0

where w32 is a random Word32.

What is happening here? This sets the lowest 32-9=23 bits, i.e. the mantissa, to random bits from the Word32. It also sets the upper bits to 0x3f8, which is 0|01111111:

  • b31 = 0 so the sign is positive
  • [b30 ... b23] = 01111111 = 127, so the exponent is 127 - 127 = 0.

In other words, this method generates numbers of the form

(-1)^0 * 2^0 * 1.[b22 ... b0] = 1.[b22 ... b0]

These numbers all lie in [1, 2):

  • if b22 to b0 are all zero, we get 1, so 1 is included
  • if b22 to b0 are all one, we get close to 2, but we never hit it

By subtracting by 1.0 we get numbers in the range [0, 1).

AFAICT the v1.1 method does something very similar.

Multiplication

mwc-random uses multiplication and addition to scale a Word32 into the (0,1] range.

Comparison

Fortunately it is possible to simply enumerate all Word32. This is what I've done in the following.

  • full is the cardinality of UnitFloat
  • count is the cardinality of the image of the respective conversion functions
  • percentage hit is count / full, expressed as a percentage
random: word32ToFloatInUnitInterval
full  = 1065353218
count = 8388608
percentage hit = 0.787%
includes 0 = 1
includes 1 = 0

mwc-random: wordToFloat
full  = 1065353218
count = 25165825
percentage hit = 2.362%
includes 0 = 0
includes 1 = 1

downey baseline: wordToFloatByDivision
full  = 1065353218
count = 83886081
percentage hit = 7.874%
includes 0 = 1
includes 1 = 1

downey baseline: wordToFloatByDivisionDouble
full  = 1065353218
count = 83886081
percentage hit = 7.874%
includes 0 = 1
includes 1 = 1

java: nextFloat/wordToFloatByDivision24
full  = 1065353218
count = 16777216
percentage hit = 1.5748031466499028%
includes 0 = 1
includes 1 = 0

Generated by https://gist.github.com/curiousleo/d8222f63be859f8207e805e71d13095a.

Injective (Word150 -> UnitFloat)

What would we need to do to actually hit every possible UnitFloat with the correct probability? Downey [1] shows one method.

Roughly, it amounts to the following:

  • pick the mantissa bits [b22 ... b0] at random
  • to pick the biased exponent, compute 126 - n where n is the number of failed Bernoulli trials up to the first success, up to 126. This can be done by drawing a random 126-bit bitstring and counting the number of leading zeros

This maps to every possible float in the range [0, 2^-1*1.[11 ... 11]]. To de-bias the method and include 1, the following correction is made:

  • if the mantissa is all zeros, then with p=0.5, increase the exponent by one

I won't go into the probabilistic justification for this here, but it should be clear that this makes it possible to hit 1: exactly when the mantissa was all zeros, the unbiased exponent before the correction was -1 and the coin toss results in increasing the exponent by one. Then we get 2^0 = 1.

This process consumes 150 bits of randomness overall (in most cases, a lot less: we can stop the Bernoulli trials once we've had a single success). It is injective: for every UnitFloat, we can find at least one conceptual Word150 that maps to it.

[1] http://allendowney.com/research/rand/


Double

The same considerations apply to generating uniformly distributed Double values. Here, Downey's method consumes up to 1022 bits for Bernoulli trials + 52 bits for the mantissa + 1 bit for the exponent correction = 1075 bits.


Unit interval vs. general range

In order to generate a floating point number in an arbitrary interval [lo, hi] with lo < hi, we can generate x in the unit interval and then calculate lo + x * (hi - lo).

Alternatively, it is conceivable to use a variant of Downey's method to directly generate a floating point number uniformly distributed over an arbitrary interval. That is what rademacher-fpl [2] does. Pro: it's "perfect": hitting every representable number in the range with the right probability. Con: it's pretty involved. Generating a floating point number in the unit interval is a very simple special case of the more general procedure.

In order to weigh these options, I wonder how we can analyse the quality of lo + x * (hi - lo): how many representable numbers in the resulting range can we hit in this way? Also, is there a better way to do this calculation, perhaps x * hi - lo * (x - 1), that improves the "hit rate"?

[2] https://gitlab.com/christoph-conrads/rademacher-fpl/-/blob/master/include/rademacher-fpl/impl/uniform-real-distribution.hpp

@curiousleo
Copy link
Collaborator Author

Relevant: #113 (comment), which suggests thinking about FP numbers as triples (sign, exponent, significand) explicitly.

@curiousleo
Copy link
Collaborator Author

Benchmarks for the different methods, as implemented here: https://gist.github.com/curiousleo/d8222f63be859f8207e805e71d13095a

word32ToFloatInUnitInterval:
pure/random/Float                        mean 334.2 μs  ( +- 10.47 μs  )
pure/uniformR/unbounded/Float            mean 335.0 μs  ( +- 7.438 μs  )

wordToFloat:
pure/random/Float                        mean 28.78 μs  ( +- 2.373 μs  )
pure/uniformR/unbounded/Float            mean 28.63 μs  ( +- 2.400 μs  )

wordToFloatByDivision:
pure/random/Float                        mean 28.20 μs  ( +- 2.218 μs  )
pure/uniformR/unbounded/Float            mean 27.62 μs  ( +- 950.8 ns  )

It appears that of these "simple" methods (non-Downey-inspired), wordToFloatByDivision has both the largest coverage (see issue description) and is no slower than any other method.

@idontgetoutmuch
Copy link
Owner

idontgetoutmuch commented Apr 29, 2020

I think this says we should go with the method used in MWC i.e. select 32 bits convert to double, multiply by 2^-23 - this is a compromise between performance and coverage and much better than is in 1.1.

Also simplifies the code base as we can delete CastFloatWord.cmm

@curiousleo
Copy link
Collaborator Author

I think this says we should go with the method used in MWC i.e. select 32 bits convert to double, multiply by 2^-23 - this is a compromise between performance and coverage and much better than is in 1.1.

Also simplifies the code base as we can delete CastFloatWord.cmm

What is the advantage of mwc-random's wordToFloat over wordToFloatByDivision? They have the same performance (within each other's stddev), but wordToFloatByDivision has a larger coverage.

@idontgetoutmuch
Copy link
Owner

Oh sorry I thought they were the same but let's use the one with better coverage.

@curiousleo
Copy link
Collaborator Author

Based on @christoph-conrads' comments on #113, I've started looking into shortcomings of the translation approach, i.e. "given a ≤ b, generate x in the unit interval and return a + x * (b - a)", using the marvelous sbv package. Here's what I've found so far:

Translating Float

∀ a b x. a ≤ b & 0 ≤ x ≤ 1 ⇒ a ≤ a + x × (b - a)
Falsifiable. Counter-example:
  a               = -1.7159568e38 :: Float
  b               =  1.7159568e38 :: Float
  x               =           0.0 :: Float
  a + x * (b - a) =           NaN :: Float

∀ a b x. a ≤ b & 0 < x < 1 ⇒ a ≤ a + x × (b - a)
Falsifiable. Counter-example:
  a               =    -Infinity :: Float
  b               = -4.2949673e9 :: Float
  x               = 4.656614e-10 :: Float
  a + x * (b - a) =          NaN :: Float

∀ a b x. a ≤ b & 0 ≤ x ≤ 1 ⇒ a + x × (b - a) ≤ b
Falsifiable. Counter-example:
  a               = -4.7021254e-38 :: Float
  b               =     -1.481e-42 :: Float
  x               =            1.0 :: Float
  a + x * (b - a) =      -1.48e-42 :: Float

∀ a b x. a ≤ b & 0 < x < 1 ⇒ a + x × (b - a) ≤ b
[still running ...]

Translating Double

∀ a b x. a ≤ b & 0 ≤ x ≤ 1 ⇒ a ≤ a + x × (b - a)
Falsifiable. Counter-example:
  a               = -3.511119404031257e305 :: Double
  b               = 1.7941820154582846e308 :: Double
  x               =                    0.0 :: Double
  a + x * (b - a) =                    NaN :: Double

∀ a b x. a ≤ b & 0 < x < 1 ⇒ a ≤ a + x × (b - a)
Falsifiable. Counter-example:
  a               =              -Infinity :: Double
  b               = -1.39009830559643e-309 :: Double
  x               = 2.786775579063254e-309 :: Double
  a + x * (b - a) =                    NaN :: Double

∀ a b x. a ≤ b & 0 ≤ x ≤ 1 ⇒ a + x × (b - a) ≤ b
Falsifiable. Counter-example:
  a               =   -1.84467440737107e19 :: Double
  b               = -2.3364633538570005e12 :: Double
  x               =                    1.0 :: Double
  a + x * (b - a) =     -2.336463353856e12 :: Double

∀ a b x. a ≤ b & 0 < x < 1 ⇒ a + x × (b - a) ≤ b
[still running ...]

@christoph-conrads
Copy link

christoph-conrads commented Apr 30, 2020

Based on @christoph-conrads' comments on #113, I've started looking into shortcomings of the translation approach, i.e. "given a ≤ b, generate x in the unit interval and return a + x * (b - a)", using the marvelous sbv package. Here's what I've found so far:

My comments are misleading in that they obscure the fundamental reason why the affine transformation a + x * (b-a) cannot work.

When computing random integers, it is well known that it may be necessary to reject certain random number generator (RNG) variates. For example, given an RNG computing values 0, 1, 2 with equal probability, the only way to select between an apple and a pear with 50% probability is by

  • taking the apple if the RNG output is 0,
  • taking the pear if the RNG output is 1,
  • trying again if the RNG output is 2.

Floating numbers are a finite set and so we must reject values, too. As an example, you are given an RNG with output in [0,1) and you want uniformly distributed values in [0, 0.75) with floating-point numbers. To simplify the problem, consider a FP number with two signficand bits such that you have the FP values 0, 1/8, 2/8, 3/8, 4/8, 6/8 in [0, 1). Because of the interval definition, the value 0.5 should be drawn with probability 1/3.

Here is the key question: How do you map the available set of numbers to [0, 0.75) to achieve this?

To help you, here is a table with the probabilities.

FP value Probability Exponent
0 1/8 0
1/8 1/8 0
2/8 1/8 1
3/8 1/8 1
4/8 1/4 2
6/8 1/4 2

There is no way to sum up a finite number of values 2^-k to yield exactly 1/3. So the correct answer is: Draw from the RNG and redraw if the output is >=0.75.

This leads me to a nice conjecture:

Given floating-point numbers a, b, x (b-a) + a, a ≥ 0, 0 ≤ x < 1 uniformly distributed, yields a uniform distribution if and only if round-toward zero is used and b-a is a power of two (in real arithmetic).

Algorithm:

def uniform(a, b):
    assert a <= b

    if b <= 0:
        return -uniform(-b, -a)

    if a < 0 and b > 0:
        x = next_larger_power_of_two(abs(a))
        y = next_larger_power_of_two(abs(b))
        u = uniform(0, x) with probability x/(x+y), uniform(0, y) otherwise

        check if u is in [a,b], try again if not

        return u

    while True:
        x = random01()
        d = b-a with upward rounding
        e = next_larger_power_of_two(d)

        u = x * e + a with rounding toward zero

        if a <= u and u <= b:
            return u

@curiousleo
Copy link
Collaborator Author

@christoph-conrads thank you so much for this clear explanation! When I was trying to come up with an algorithm, I was thinking along similar lines but with much less clarity :)

I believe that the code has 1.5 bugs:

  • if b = 0 then it will loop forever, unless I'm missing some subtlety related to the different zeros (in this case, was the intention that -0 < +0? According to Wikipedia, that is not the case ...)
  • in the a < 0 && 0 < b case, I believe there is a sign mistake:
    - u = uniform(0, x) with probability x/(x+y), uniform(0, y) otherwise
    + u = -uniform(0, x) with probability x/(x+y), uniform(0, y) otherwise

@christoph-conrads
Copy link

if b = 0 then it will loop forever, unless I'm missing some subtlety related to the different zeros (in this case, was the intention that -0 < +0? According to Wikipedia, that is not the case ...)

Yes, this should read as signbit(b) == 1.

in the a < 0 && 0 < b case, I believe there is a sign mistake:

Yes.

def uniform(a, b):
    assert a <= b

    if signbit(b) == 1:
        return -uniform(-b, -a)

    if a < 0 and b > 0:
        x = next_larger_or_equal_power_of_two(abs(a))
        y = next_larger_or_equal_power_of_two(abs(b))
        u = -uniform(0, x) with probability x/(x+y), uniform(0, y) otherwise

        check if u is in [a,b], try again if not

        return u

    while True:
        x = random01()
        d = b-a with upward rounding
        e = next_larger_or_equal_power_of_two(d)

        u = x * e + a with rounding toward zero

        if a <= u and u <= b:
            return u

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants