Pseudo-randomness is not totally reproducible #218

aedobbyn · 2018-02-01T21:57:53Z

@krlmlr did all the work for me here: http://rpubs.com/krlmlr/drake-random-seed

krlmlr · 2018-02-01T22:12:21Z

To expand a bit: It appears that the random seed isn't fixed automatically. The simple example runs a plan with random output, then reruns parts of it. The output looks different for the two cases.

I think we should be fixing the random seed for each target, for reproducibility. Issue in remake: richfitz/remake#106.

wlandau · 2018-02-01T22:28:00Z

@aedobbyn the basic example takes random bootstraps from mtcars. Are you looking for more randomness than that? Maybe something that explains how drake tries to handle reproducible random numbers?

@krlmlr the way drake approaches reproducible random seeds is in R/random.R and tests/testthat/test-random.R. drake preserves the random seed using withr::with_seed() and salts that seed for individual targets.

Sorry I did not provide links. The internet connection crashed here and I am using my phone.

wlandau · 2018-02-01T22:32:40Z

Also, the seed is stored in the config object, which you can recover with read_drake_config (). So drake does try to be reproducible under randomness. What I worry about is that the seed is salted deterministically for different targets based on the target name (in build_target()). Should the salting be random?

krlmlr · 2018-02-01T22:33:56Z

Thanks for the pointers! The example appears to do something different than the tests: I'm not clearing everything, rather I'm deleting an intermediate target.

wlandau · 2018-02-01T22:35:38Z

Hmm... It should still work then. I wonder what the problem is.

wlandau · 2018-02-01T22:36:04Z

See also #56 and #109.

wlandau · 2018-02-01T23:52:33Z

Kirill, in that example, did you have anything leftover in your cache from a previous make ()? I am starting to think the problem is that drake does not try to load the root seed if one is already cached. See where get_valid_seed() is called in drake_config().

wlandau · 2018-02-02T08:10:45Z

We should also probably add a drake_read_seed() function. I am resistant to a potential drake_set_seed(), however, because then some targets could be built with one root seed and others could be built with another. For reproducibility's sake, I think users should have to destroy the cache to reset the seed.

wlandau · 2018-02-02T08:15:54Z

And a new test in test-random.R should use set_seed() to reproduce the original error first.

krlmlr · 2018-02-02T08:42:49Z

In that example I cleaned the random2 target, so I guess the cache was partially populated.

Would it work if the seed was only a function of the target name, and would be set unconditionally? Users shouldn't be worried about the random seed at all, things should just work.

wlandau · 2018-02-02T10:30:26Z

Things would run, yes. But I think the result should at least depend on the random seed of the R session that first created the cache. That way, set_seed(1234); make(...) and set_seed(5678); make(...) produce different results if the cache is not already created. I think R users are accustomed to setting seeds to ensure reproducibility, and some stochastic methods should probably be tested repeatedly on different seeds.

wlandau · 2018-02-02T10:54:27Z

I think I fixed this issue as of 307bbe1. The help file of the new read_drake_seed() function shows what drake now does with pseudo-random numbers.

Before, I was concerned about preserving the mathematical properties of pseudo-randomness when it came to deriving target-specific seeds from the root seed, but after talking with you both today, I am less worried.

krlmlr · 2018-02-02T14:51:39Z

Agreed to the part about running the same project with different random seeds.

How would you now embed the random seed in the workflow, so that different seeds are used with the same cache? In the following plan, I see different but stable results for the same rule:

library(drake)

unlink(".drake", recursive = TRUE)

plan <-
  drake_plan(
    data1 = 1,
    data2 = 1,
    data3 = 1,
    f_data1 = runif(data1),
    f_data2 = runif(data2),
    f_data3 = runif(data3)
  )

set.seed(123)
make(plan)
#> cache /tmp/RtmpavUNa7/.drake
#> connect 1 import: plan
#> connect 6 targets: data1, data2, data3, f_data1, f_data2, f_data3
#> check 1 item: runif
#> check 3 items: data1, data2, data3
#> target data1
#> target data2
#> target data3
#> check 3 items: f_data1, f_data2, f_data3
#> target f_data1
#> target f_data2
#> target f_data3

f_data1
#> [1] 0.414735
f_data2
#> [1] 0.9176192
f_data3
#> [1] 0.4733941

Created on 2018-02-02 by the reprex package (v0.1.1.9000).

Is there a more elegant way to write this?

Does make() warn you if the cache is empty and you haven't set the random seed?

wlandau · 2018-02-02T15:31:08Z

The user can set the seed before make() as you point out, but she does not actually need to. If no seed is set explicitly, drake makes sure your R session has one. This is exactly what withr does with with_seed() and with_preserve_seed(). Then, the session's guaranteed .Random.seed is cached. Given this root seed, drake deterministically gives different seeds to different targets in a way that avoids race conditions. That is why f_data1 through f_data3 have different but reproducible values.

Does that answer your questions? Do you think any of this behavior should be changed?

wlandau · 2018-02-02T18:12:54Z

Does make() warn you if the cache is empty and you haven't set the random seed?

Right now, there is no warning. My current opinion is that drake should not force users to pick a seed manually. The way I see it, the choice of seed does not matter as long as a project has one. After that, drake >= 5.0.1.9001 stays faithful to that seed.

The R session's .Random.seed seemed like a natural choice at the time. This seemed like yet another way that drake is R-focused and friendly. On the other hand, I am beginning to see your point: if someone else tries to replicate your results with a fresh cache, they may get a different seed if there is no set_seed() before make(), and that may cause problems. What about a deterministic hash of all the target names? That way, the default seed will be the same across different sessions, but it won't be the same hard-coded number everywhere.

wlandau · 2018-02-02T18:16:55Z

Then again, target names could change without the project's seed changing. I think we'll have to use a seed argument with a default value. You have convinced me.

wlandau · 2018-02-02T18:25:33Z

Like you said, make() should have a seed argument. I think the default should be NULL. Cases:

New cache, `make(seed = NULL)`

Drake should provide a fixed hard-coded seed that does not change based on the R session.

New cache, `make(seed = 1234)`

Drake should use the seed 1234.

Old cache, `make(seed = NULL)`

Drake should use the already-cached seed.

Old cache, `make(seed = 1234)`

Quit in error because the project already has a seed.

wlandau · 2018-02-02T19:08:32Z

Update: we should only throw an error if the non-NULL user-defined seed is not the same as the one in the cache. That piece makes it work without friction.

wlandau changed the title ~~No existing workflow for random outputs~~ Pseudo-randomness is not totally reproducible Feb 2, 2018

wlandau added the type: bug label Feb 2, 2018

wlandau-lilly added a commit that referenced this issue Feb 2, 2018

Make a test to reproduce #218 (deliberately fails)

c29bcbd

wlandau mentioned this issue Feb 2, 2018

Convert roxygen2 documentation to Markdown #204

Merged

wlandau closed this as completed Feb 2, 2018

wlandau reopened this Feb 2, 2018

wlandau removed the type: bug label Feb 2, 2018

wlandau closed this as completed in d219633 Feb 3, 2018

wlandau mentioned this issue Feb 3, 2018

Add a random seed argument to make() and drake_config(). #221

Merged

wlandau mentioned this issue Mar 11, 2018

Reproducibility with random numbers #313

Closed

billdenney mentioned this issue Oct 1, 2018

Documentation Request: Details of Random Seed Behavior? #526

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pseudo-randomness is not totally reproducible #218

Pseudo-randomness is not totally reproducible #218

aedobbyn commented Feb 1, 2018

krlmlr commented Feb 1, 2018

wlandau commented Feb 1, 2018 •

edited

Loading

wlandau commented Feb 1, 2018 •

edited

Loading

krlmlr commented Feb 1, 2018

wlandau commented Feb 1, 2018

wlandau commented Feb 1, 2018

wlandau commented Feb 1, 2018

wlandau commented Feb 2, 2018 •

edited

Loading

wlandau commented Feb 2, 2018 •

edited

Loading

krlmlr commented Feb 2, 2018

wlandau commented Feb 2, 2018

wlandau commented Feb 2, 2018

krlmlr commented Feb 2, 2018

wlandau commented Feb 2, 2018

wlandau commented Feb 2, 2018

wlandau commented Feb 2, 2018

wlandau commented Feb 2, 2018 •

edited

Loading

wlandau commented Feb 2, 2018

Pseudo-randomness is not totally reproducible #218

Pseudo-randomness is not totally reproducible #218

Comments

aedobbyn commented Feb 1, 2018

krlmlr commented Feb 1, 2018

wlandau commented Feb 1, 2018 • edited Loading

wlandau commented Feb 1, 2018 • edited Loading

krlmlr commented Feb 1, 2018

wlandau commented Feb 1, 2018

wlandau commented Feb 1, 2018

wlandau commented Feb 1, 2018

wlandau commented Feb 2, 2018 • edited Loading

wlandau commented Feb 2, 2018 • edited Loading

krlmlr commented Feb 2, 2018

wlandau commented Feb 2, 2018

wlandau commented Feb 2, 2018

krlmlr commented Feb 2, 2018

wlandau commented Feb 2, 2018

wlandau commented Feb 2, 2018

wlandau commented Feb 2, 2018

wlandau commented Feb 2, 2018 • edited Loading

New cache, make(seed = NULL)

New cache, make(seed = 1234)

Old cache, make(seed = NULL)

Old cache, make(seed = 1234)

wlandau commented Feb 2, 2018

wlandau commented Feb 1, 2018 •

edited

Loading

wlandau commented Feb 1, 2018 •

edited

Loading

wlandau commented Feb 2, 2018 •

edited

Loading

wlandau commented Feb 2, 2018 •

edited

Loading

wlandau commented Feb 2, 2018 •

edited

Loading

New cache, `make(seed = NULL)`

New cache, `make(seed = 1234)`

Old cache, `make(seed = NULL)`

Old cache, `make(seed = 1234)`