Seeding #18

curiousleo · 2020-03-03T12:38:00Z

Initialising PRNGs correctly is hard. For casual users in particular, it should be easy to Do The Right Thing.

Typical things users should be able to do and the common solutions are:

User: initialise a PRNG without specifying a seed at all
Solution: draw non-deterministic randomness from a system source to create a good seed
User: initialise a PRNG from a low entropy constant (e.g. small integers like 1337 or 42) without getting punished for it
Solution: mix user-provided seed deterministically to turn low-entropy seed (e.g. with lots of zero bytes) into high-entropy seed
User: initialise multiple instances of a PRNG
Solution: provide a seed sequence with a spawn function, see below

Prior art

Some RNG APIs have a separate "seeding" layer that takes care of this, see #13.

C++ `randutils` / `seed_seq`

Initialising a PRNG without parameters uses a system randomness source to seed it (source here)
Builds on top of the seed_seq idea from the C++ standard library; this is a kind of low level PRNG that can only fill arrays of 32-bit numbers with random values, nothing else - its only purpose is to seed other (more convenient and faster) PRNGs.

numpy `SeedSequence`

Initialising a SeedSequence without parameters uses a system randomness source to seed it
SeedSequence.spawn initialises a number of new SeedSequence instances. The unique spawn key is hashed into the sequence's entropy
SeedSequence.generate_state returns an array of random numbers for seeding
PRNG constructors take a SeedSequence from which they draw their seed

`mwc-random`

create and initialize combine a user-provided seed with a fixed default seed - this is a form of deterministic mixing
withSystemRandom and createSystemRandom seed the PRNG with system randomness

`splitmix`

seedSMGen uses the given seeds without mixing
mkSMGen mixes the given Word64 before initialising the SMGen
initSMGen uses system time
newSMGen splits an new generator off from the global generator

`random` right now

mkStdGen (public) throws away all but the lowest 32 bits of the given Int and then creates the StdGen via mkStdGen32. createStdGen (private) does the same, but takes an Integer.
split is used to create new RNG instances

Design considerations

Mixing user seeds

random as it is designed right now provides no method that deterministically mixes the user seed. This is easily fixed e.g. by calling out to the relevant splitmix function.

Spawning and splitting

In the random API, split is part of the RandomGen class. This means that there is an assumption that each RNG must be split in its own way.
This is in contrast to the C++ and numpy APIs, where there is a source of seeds separate from any particular RNG implementation (seed_seq and SeedSequence, respectively) which is used to seed all RNGs.

For concreteness' sake, what I have in mind is something along those lines:

-- | Size of the entropy pool, in bytes (default 4 or 8)
type PoolSize = Int

-- | Non-deterministic seed using system randomness source
systemSeed :: IO SeedSeq
systemSeedWithPoolSize :: PoolSize -> IO SeedSeq
-- | Deterministic seed generated from user specified input
fixedSeed :: Word64 -> SeedSeq
fixedSeedWithPoolSize :: PoolSize -> Word64 -> SeedSeq

-- | Seed a single 'RandomGen'
spawn :: RandomGen g => SeedSeq -> (g, SeedSeq)
-- | Seed multiple 'RandomGen'
spawnN :: RandomGen g => SeedSeq -> Int -> ([g], SeedSeq)

This is a simplified translation of the numpy / C++ approach. In the background, it would use something like this initialize function in RandomGen to create RNGs.

Pros of such an API with a separate, opaque source of seeds:

Allows us to build a 'good seed by default' API for any compatible RNG (see systemSeed and fixedSeed above)
Intent is communicated clearly: Seed is meant to be used for seeding, so its design constraints are different: it must create reasonable seeds for any RNG, but it can be slower than other RNGs because in the expected use case, only a small number of seeds are required compared to the overall number of random values generated by RNGs over the course of a program

Cons:

A bad source of seeds would lead to a 'bad seed by default' API
It might interact weirdly with some particular RNG algorithm

Implementation notes

entropy could be used to get entropy from the system.
The PRNG behind Seed could be

the non-cryptographic algorithm designed by O'Neill for randutils, which is also used by numpy: it passes statistical tests and is now widely used
a cryptographic RNG

Thoughts?

Edit: renamed Seed to SeedSeq to avoid clashes with the existing definition of Seed inside MonadRandom.

The text was updated successfully, but these errors were encountered:

curiousleo · 2020-03-03T12:43:46Z

Related: #4, #7.

peteroupc · 2020-03-03T13:10:39Z

For the case of initializing multiple instances of a PRNG, using a seed sequence generator to produce multiple seeds is a generic approach to seeding. However, this has the risk of producing overlapping, correlated, or even identical random number sequences. There are several ways to reduce this risk (some of which are PRNG-specific):

By using a seed sequence generator that ensures each generated seed is unique. As an example, O'Neill's seed sequence generator, but not the seed_seq provided by C++, produces unique seeds from unique inputs of a given size.
By using a counter-based PRNG or another PRNG that gives each seed its own non-overlapping and independent sequence of numbers.
By using a PRNG that allows a huge number of outputs to be discarded efficiently ("jump-ahead").
By using PRNGs of different designs among the processes.

For PRNGs that support jump-ahead or "one sequence per seed", there should be a way to allow these PRNGs to use an alternative strategy to initialize multiple PRNGs based on a common seed. The use of a seed sequence generator can remain the default behavior, especially because the risk of overlap is often negligible (S. Vigna, "On the probability of overlap of random subsequences of pseudorandom number generators", 2020).

Shimuuar · 2020-03-03T16:50:16Z

@curiousleo Thanks for the excellent summary!

Another consideration is saving and restoring of generator state. It could be important for long-running simulations for creation of snapshot which allows to continue from latest snapshot instead of replaying hour and possibly days of compute.

Single Seed wouldn't work very well because size of generator state could vary by order of magnitude. From 1-2 Word64 to 1032B for mwc-random and around same value for Mersenne twister.

AFAIR crypto RNGs allow to add additional entropy midcalculation but I'm not sure about this.

lehins · 2020-03-03T17:06:20Z

crypto RNGs allow to add additional entropy midcalculation

I think here is an example API that seems to allow such reseeding: https://hackage.haskell.org/package/crypto-api-0.13.3/docs/Crypto-Random.html#v:reseed

I don't think we need to create the most general API though. We need to be careful with feature creep. I'd hate it to mess up our efforts.

Please guys, correct me if I am wrong, but in my mind the goal for random package is to be a defacto place to go for efficient pseudo random number generation needed for daily use. If it can be used with some cryptographic algorithms it would be a plus, but not a requirement.

Shimuuar · 2020-03-03T17:18:11Z

Please guys, correct me if I am wrong, but in my mind the goal for random package is to be a defacto place to go for efficient pseudo random number generation needed for daily use. If it can be used with some cryptographic algorithms it would be a plus, but not a requirement.

Totally agree. In fact I'm somewhat wary of CPRNGs. Requirements are just so different. So we should probably leave these to crypto libraries. They will hopefully know what they're doing

I just more comfortable with "we decided to leave this thing out" than "No one thought about this"

peteroupc · 2020-03-03T17:39:59Z

Please guys, correct me if I am wrong, but in my mind the goal for random package is to be a defacto place to go for efficient pseudo random number generation needed for daily use. If it can be used with some cryptographic algorithms it would be a plus, but not a requirement.

Rather, the random package should be for random number generation for purposes outside of information security. For applications that require information security, only a cryptographic RNG is appropriate, as are libraries designed for security.

Thus, it would not be within the scope of this package to "reseed" an existing PRNG (that is, initialize it in the same way as initializing a new PRNG) or to add "entropy" to an RNG to increase its "randomness", and it may or may not be within its scope to save and restore a PRNG's state, which is a very complicated matter because restoring the state this way requires knowing the exact PRNG algorithm and keeping it unchanged.

To be clear, for the use case of initializing a new (noncryptographic) PRNG "without specifying a seed at all" (use case 1 in the opening post), it's appropriate to generate a seed using a cryptographic RNG (such as one provided by the operating system) or with an entropy source.

curiousleo · 2020-03-04T10:45:04Z

@Shimuuar wrote:

Another consideration is saving and restoring of generator state. It could be important for long-running simulations for creation of snapshot which allows to continue from latest snapshot instead of replaying hour and possibly days of compute.

We already have save and restore for that:

random/System/Random.hs

Lines 208 to 209 in 656ef17

    
           restore :: Seed g -> m g 
        
           save :: g -> m (Seed g)

Did you mean that there should be save and restore for SeedSeq as well?

It's good you mention this though: save, restore and initialize (or whatever we call the class method that effectively takes a byte string and creates a new PRNG instance) interact and should probably be considered together.

Single Seed wouldn't work very well because size of generator state could vary by order of magnitude. From 1-2 Word64 to 1032B for mwc-random and around same value for Mersenne twister.

This is exactly what O'Neill's seed_seq_fe (which stands for "Fixed-Entropy Seed sequence") and numpy's SeedSequence do: they keep a small entropy pool and then use a PRNG to generate seeds of arbitrary size from it.

curiousleo · 2020-04-14T11:51:31Z

We've stepped away from the idea of a default split implementation for now. Closing.

curiousleo mentioned this issue Mar 4, 2020

Generic split function #12

Closed

peteroupc mentioned this issue Mar 6, 2020

Cmdstan3 stan-dev/design-docs#15

Closed

curiousleo closed this as completed Apr 14, 2020

curiousleo mentioned this issue Nov 14, 2021

Reimplementation of Random using the Xoshiro256++ PRNG ocaml/ocaml#10701

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Seeding #18

Seeding #18

curiousleo commented Mar 3, 2020 •

edited

Loading

curiousleo commented Mar 3, 2020

peteroupc commented Mar 3, 2020 •

edited

Loading

Shimuuar commented Mar 3, 2020

lehins commented Mar 3, 2020

Shimuuar commented Mar 3, 2020

peteroupc commented Mar 3, 2020 •

edited

Loading

curiousleo commented Mar 4, 2020 •

edited

Loading

curiousleo commented Apr 14, 2020

Seeding #18

Seeding #18

Comments

curiousleo commented Mar 3, 2020 • edited Loading

Prior art

C++ randutils / seed_seq

numpy SeedSequence

mwc-random

splitmix

random right now

Design considerations

Mixing user seeds

Spawning and splitting

Implementation notes

curiousleo commented Mar 3, 2020

peteroupc commented Mar 3, 2020 • edited Loading

Shimuuar commented Mar 3, 2020

lehins commented Mar 3, 2020

Shimuuar commented Mar 3, 2020

peteroupc commented Mar 3, 2020 • edited Loading

curiousleo commented Mar 4, 2020 • edited Loading

curiousleo commented Apr 14, 2020

curiousleo commented Mar 3, 2020 •

edited

Loading

C++ `randutils` / `seed_seq`

numpy `SeedSequence`

`mwc-random`

`splitmix`

`random` right now

peteroupc commented Mar 3, 2020 •

edited

Loading

peteroupc commented Mar 3, 2020 •

edited

Loading

curiousleo commented Mar 4, 2020 •

edited

Loading