Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seeding #18

Closed
curiousleo opened this issue Mar 3, 2020 · 8 comments
Closed

Seeding #18

curiousleo opened this issue Mar 3, 2020 · 8 comments

Comments

@curiousleo
Copy link
Collaborator

curiousleo commented Mar 3, 2020

Initialising PRNGs correctly is hard. For casual users in particular, it should be easy to Do The Right Thing.

Typical things users should be able to do and the common solutions are:

  1. User: initialise a PRNG without specifying a seed at all
    Solution: draw non-deterministic randomness from a system source to create a good seed
  2. User: initialise a PRNG from a low entropy constant (e.g. small integers like 1337 or 42) without getting punished for it
    Solution: mix user-provided seed deterministically to turn low-entropy seed (e.g. with lots of zero bytes) into high-entropy seed
  3. User: initialise multiple instances of a PRNG
    Solution: provide a seed sequence with a spawn function, see below

Prior art

Some RNG APIs have a separate "seeding" layer that takes care of this, see #13.

C++ randutils / seed_seq

  • Initialising a PRNG without parameters uses a system randomness source to seed it (source here)
  • Builds on top of the seed_seq idea from the C++ standard library; this is a kind of low level PRNG that can only fill arrays of 32-bit numbers with random values, nothing else - its only purpose is to seed other (more convenient and faster) PRNGs.

numpy SeedSequence

  • Initialising a SeedSequence without parameters uses a system randomness source to seed it
  • SeedSequence.spawn initialises a number of new SeedSequence instances. The unique spawn key is hashed into the sequence's entropy
  • SeedSequence.generate_state returns an array of random numbers for seeding
  • PRNG constructors take a SeedSequence from which they draw their seed

mwc-random

  • create and initialize combine a user-provided seed with a fixed default seed - this is a form of deterministic mixing
  • withSystemRandom and createSystemRandom seed the PRNG with system randomness

splitmix

  • seedSMGen uses the given seeds without mixing
  • mkSMGen mixes the given Word64 before initialising the SMGen
  • initSMGen uses system time
  • newSMGen splits an new generator off from the global generator

random right now

  • mkStdGen (public) throws away all but the lowest 32 bits of the given Int and then creates the StdGen via mkStdGen32. createStdGen (private) does the same, but takes an Integer.
  • split is used to create new RNG instances

Design considerations

Mixing user seeds

random as it is designed right now provides no method that deterministically mixes the user seed. This is easily fixed e.g. by calling out to the relevant splitmix function.

Spawning and splitting

In the random API, split is part of the RandomGen class. This means that there is an assumption that each RNG must be split in its own way.
This is in contrast to the C++ and numpy APIs, where there is a source of seeds separate from any particular RNG implementation (seed_seq and SeedSequence, respectively) which is used to seed all RNGs.

For concreteness' sake, what I have in mind is something along those lines:

-- | Size of the entropy pool, in bytes (default 4 or 8)
type PoolSize = Int

-- | Non-deterministic seed using system randomness source
systemSeed :: IO SeedSeq
systemSeedWithPoolSize :: PoolSize -> IO SeedSeq
-- | Deterministic seed generated from user specified input
fixedSeed :: Word64 -> SeedSeq
fixedSeedWithPoolSize :: PoolSize -> Word64 -> SeedSeq

-- | Seed a single 'RandomGen'
spawn :: RandomGen g => SeedSeq -> (g, SeedSeq)
-- | Seed multiple 'RandomGen'
spawnN :: RandomGen g => SeedSeq -> Int -> ([g], SeedSeq)

This is a simplified translation of the numpy / C++ approach. In the background, it would use something like this initialize function in RandomGen to create RNGs.

Pros of such an API with a separate, opaque source of seeds:

  • Allows us to build a 'good seed by default' API for any compatible RNG (see systemSeed and fixedSeed above)
  • Intent is communicated clearly: Seed is meant to be used for seeding, so its design constraints are different: it must create reasonable seeds for any RNG, but it can be slower than other RNGs because in the expected use case, only a small number of seeds are required compared to the overall number of random values generated by RNGs over the course of a program

Cons:

  • A bad source of seeds would lead to a 'bad seed by default' API
  • It might interact weirdly with some particular RNG algorithm

Implementation notes

entropy could be used to get entropy from the system.
The PRNG behind Seed could be

  • the non-cryptographic algorithm designed by O'Neill for randutils, which is also used by numpy: it passes statistical tests and is now widely used
  • a cryptographic RNG

Thoughts?

Edit: renamed Seed to SeedSeq to avoid clashes with the existing definition of Seed inside MonadRandom.

@curiousleo
Copy link
Collaborator Author

Related: #4, #7.

@peteroupc
Copy link

peteroupc commented Mar 3, 2020

For the case of initializing multiple instances of a PRNG, using a seed sequence generator to produce multiple seeds is a generic approach to seeding. However, this has the risk of producing overlapping, correlated, or even identical random number sequences. There are several ways to reduce this risk (some of which are PRNG-specific):

  • By using a seed sequence generator that ensures each generated seed is unique. As an example, O'Neill's seed sequence generator, but not the seed_seq provided by C++, produces unique seeds from unique inputs of a given size.
  • By using a counter-based PRNG or another PRNG that gives each seed its own non-overlapping and independent sequence of numbers.
  • By using a PRNG that allows a huge number of outputs to be discarded efficiently ("jump-ahead").
  • By using PRNGs of different designs among the processes.

For PRNGs that support jump-ahead or "one sequence per seed", there should be a way to allow these PRNGs to use an alternative strategy to initialize multiple PRNGs based on a common seed. The use of a seed sequence generator can remain the default behavior, especially because the risk of overlap is often negligible (S. Vigna, "On the probability of overlap of random subsequences of pseudorandom number generators", 2020).

@Shimuuar
Copy link

Shimuuar commented Mar 3, 2020

@curiousleo Thanks for the excellent summary!

Another consideration is saving and restoring of generator state. It could be important for long-running simulations for creation of snapshot which allows to continue from latest snapshot instead of replaying hour and possibly days of compute.

Single Seed wouldn't work very well because size of generator state could vary by order of magnitude. From 1-2 Word64 to 1032B for mwc-random and around same value for Mersenne twister.

AFAIR crypto RNGs allow to add additional entropy midcalculation but I'm not sure about this.

@lehins
Copy link
Collaborator

lehins commented Mar 3, 2020

crypto RNGs allow to add additional entropy midcalculation

I think here is an example API that seems to allow such reseeding: https://hackage.haskell.org/package/crypto-api-0.13.3/docs/Crypto-Random.html#v:reseed

I don't think we need to create the most general API though. We need to be careful with feature creep. I'd hate it to mess up our efforts.

Please guys, correct me if I am wrong, but in my mind the goal for random package is to be a defacto place to go for efficient pseudo random number generation needed for daily use. If it can be used with some cryptographic algorithms it would be a plus, but not a requirement.

@Shimuuar
Copy link

Shimuuar commented Mar 3, 2020

Please guys, correct me if I am wrong, but in my mind the goal for random package is to be a defacto place to go for efficient pseudo random number generation needed for daily use. If it can be used with some cryptographic algorithms it would be a plus, but not a requirement.

Totally agree. In fact I'm somewhat wary of CPRNGs. Requirements are just so different. So we should probably leave these to crypto libraries. They will hopefully know what they're doing

I just more comfortable with "we decided to leave this thing out" than "No one thought about this"

@peteroupc
Copy link

peteroupc commented Mar 3, 2020

Please guys, correct me if I am wrong, but in my mind the goal for random package is to be a defacto place to go for efficient pseudo random number generation needed for daily use. If it can be used with some cryptographic algorithms it would be a plus, but not a requirement.

Rather, the random package should be for random number generation for purposes outside of information security. For applications that require information security, only a cryptographic RNG is appropriate, as are libraries designed for security.

Thus, it would not be within the scope of this package to "reseed" an existing PRNG (that is, initialize it in the same way as initializing a new PRNG) or to add "entropy" to an RNG to increase its "randomness", and it may or may not be within its scope to save and restore a PRNG's state, which is a very complicated matter because restoring the state this way requires knowing the exact PRNG algorithm and keeping it unchanged.


To be clear, for the use case of initializing a new (noncryptographic) PRNG "without specifying a seed at all" (use case 1 in the opening post), it's appropriate to generate a seed using a cryptographic RNG (such as one provided by the operating system) or with an entropy source.

@curiousleo
Copy link
Collaborator Author

curiousleo commented Mar 4, 2020

@Shimuuar wrote:

Another consideration is saving and restoring of generator state. It could be important for long-running simulations for creation of snapshot which allows to continue from latest snapshot instead of replaying hour and possibly days of compute.

We already have save and restore for that:

random/System/Random.hs

Lines 208 to 209 in 656ef17

restore :: Seed g -> m g
save :: g -> m (Seed g)

Did you mean that there should be save and restore for SeedSeq as well?

It's good you mention this though: save, restore and initialize (or whatever we call the class method that effectively takes a byte string and creates a new PRNG instance) interact and should probably be considered together.

Single Seed wouldn't work very well because size of generator state could vary by order of magnitude. From 1-2 Word64 to 1032B for mwc-random and around same value for Mersenne twister.

This is exactly what O'Neill's seed_seq_fe (which stands for "Fixed-Entropy Seed sequence") and numpy's SeedSequence do: they keep a small entropy pool and then use a PRNG to generate seeds of arbitrary size from it.

@curiousleo
Copy link
Collaborator Author

We've stepped away from the idea of a default split implementation for now. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants