-
Notifications
You must be signed in to change notification settings - Fork 810
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document reproducibility guarantees #298
Comments
Good plan. I agree that setting a default seed is sensible under the circumstances. Did you have a suggestion for where in the documentation this should go? Presumably in the basic tutorial. |
So this becomes a question of do you want the default to be speed or reproducibility? I, myself, would prefer speed over reproducibility and just print a warning or hint. |
I can see the merits of either approach. Maybe I can crowdsource this. |
Thanks for staring the vote and discussion @lmcinnes: https://twitter.com/leland_mcinnes/status/1177367770679435267. The results were to make the default reproducible/slower (60%), rather than faster/non-reproducible (40%). I'll submit a PR. |
@tomwhite I have been out of the loop for a couple of days. Can you refresh my memory as to why using rng does not lead to reproducible visuals? I might have an idea, but I am not sure I remember the problem correctly. |
@sleighsoft do you mean why using a RNG with a seed and Numba parallel does not lead to reproducible output? It's because while Numba parallel has per-thread random state, it's not possible to seed these states. |
While the poll came out for reproducible, I have to admit that some of the discussion led me to lean the opposite way. I fear that fixing a random state will lead people to believe that the algorithm is deterministic. I know the docs say otherwise, but the reality is that people don't read the docs, they just run stuff and see what happens. In particular I like this quote: 'For me setting a random seed is like signing a waiver "I am aware that this is a stochastic algorithm and I have done sufficient tests to confirm that my main conclusions are not affected by this randomess". |
I agree with having |
I can see arguments for both sides. I am however uncomfortable with asking users what they prefer via a poll then ignoring the result. |
It is also unclear if the phrasing of the twitter question was misleading in the first place. I think following a proofen approach (sklearn) is a good thing. |
@tomwhite I agree with that statement, but I do believe there was some confusion (I think, in the end, I phrased the question badly). I am planning on putting together a notebook to go in the tutorial documentation that documents this clearly, and gives the justification for the choice made. |
@tomwhite import numba
import numpy as np
@numba.njit(parallel=True)
def test():
a = np.empty(20, dtype=np.float64)
for i in numba.prange(20):
np.random.seed(42)
a[i] = np.random.uniform(0.0,1.0)
return a
test() This will print
Just make sure If you need more help just let me know. |
Or do @numba.njit(parallel=True)
def test(seeds):
a = np.empty(20, dtype=np.float64)
for i in numba.prange(20):
np.random.seed(seeds[i])
a[i] = np.random.uniform(0.0,1.0)
return a
np.random.seed(42)
seeds = np.random.randint(np.iinfo(np.int32).max, size=[20])
test(seeds) For different but constant values |
I played with this a little and I think for this to work properly this Line 70 in d214e5d
head containing the same index multiple times numba basically operates on the same values in parallel and the state of the head_embedding and tail_embedding arrays cannot be controlled.
I will look into this a little more if I find the time. For me it looks like having parallel return the same result multiple times is possible. I am just lacking understanding of that code part. Maybe @lmcinnes can explain what is going on there. Edit: |
I haven't looked at it closely again to be sure, but my understanding is that |
I'm new to UMAP so this may be a naive question and if so point me to the docs BUT a user can specify a random seed for reproducibility. If the user doesn't specify a seed I assume the algorithm does. Can we obtain the seed used by the algorithm in that case? That way if you have a result without specifying the seed that pleases, you could reproduce it by obtaining the seed the algorithm chose. Yes I could just save the original result but consider having the seed as insurance. |
Following up on the discussion here, it would be good to document how to get reproducible results with UMAP.
I think we should consider changing
random_state
in the UMAP constructor to a seed (e.g. 42, like the newtransform_seed
default) so that UMAP is reproducible by default.We should document that users can set
random_state
toNone
to get faster results at the expense of reproducibility. In this mode there is no seed that would produce the same output due to the multithreading. (This was introduced in #294.)The text was updated successfully, but these errors were encountered: