Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document reproducibility guarantees #298

Open
tomwhite opened this issue Sep 25, 2019 · 16 comments
Open

Document reproducibility guarantees #298

tomwhite opened this issue Sep 25, 2019 · 16 comments

Comments

@tomwhite
Copy link
Collaborator

Following up on the discussion here, it would be good to document how to get reproducible results with UMAP.

I think we should consider changing random_state in the UMAP constructor to a seed (e.g. 42, like the new transform_seed default) so that UMAP is reproducible by default.

We should document that users can set random_state to None to get faster results at the expense of reproducibility. In this mode there is no seed that would produce the same output due to the multithreading. (This was introduced in #294.)

@lmcinnes
Copy link
Owner

Good plan. I agree that setting a default seed is sensible under the circumstances. Did you have a suggestion for where in the documentation this should go? Presumably in the basic tutorial.

@sleighsoft
Copy link
Collaborator

We should document that users can set random_state to None to get faster results at the expense of reproducibility.

So this becomes a question of do you want the default to be speed or reproducibility?

I, myself, would prefer speed over reproducibility and just print a warning or hint.

@lmcinnes
Copy link
Owner

I can see the merits of either approach. Maybe I can crowdsource this.

@tomwhite
Copy link
Collaborator Author

tomwhite commented Oct 2, 2019

Thanks for staring the vote and discussion @lmcinnes: https://twitter.com/leland_mcinnes/status/1177367770679435267. The results were to make the default reproducible/slower (60%), rather than faster/non-reproducible (40%). I'll submit a PR.

@sleighsoft
Copy link
Collaborator

@tomwhite I have been out of the loop for a couple of days. Can you refresh my memory as to why using rng does not lead to reproducible visuals?

I might have an idea, but I am not sure I remember the problem correctly.

@tomwhite
Copy link
Collaborator Author

tomwhite commented Oct 2, 2019

@sleighsoft do you mean why using a RNG with a seed and Numba parallel does not lead to reproducible output? It's because while Numba parallel has per-thread random state, it's not possible to seed these states.

@lmcinnes
Copy link
Owner

lmcinnes commented Oct 3, 2019

While the poll came out for reproducible, I have to admit that some of the discussion led me to lean the opposite way. I fear that fixing a random state will lead people to believe that the algorithm is deterministic. I know the docs say otherwise, but the reality is that people don't read the docs, they just run stuff and see what happens. In particular I like this quote: 'For me setting a random seed is like signing a waiver "I am aware that this is a stochastic algorithm and I have done sufficient tests to confirm that my main conclusions are not affected by this randomess".
Thus having it reproducible should not be the default.' (https://twitter.com/ZanotelliVRT/status/1177470041475837952). I think users should be forced to set a seed for reproducibility.

@dillondaudert
Copy link

I agree with having random_state=None be the default. This matches how sklearn treats other stochastic algorithms, like SGDClassifier.

@tomwhite
Copy link
Collaborator Author

tomwhite commented Oct 9, 2019

I can see arguments for both sides. I am however uncomfortable with asking users what they prefer via a poll then ignoring the result.

@sleighsoft
Copy link
Collaborator

It is also unclear if the phrasing of the twitter question was misleading in the first place. I think following a proofen approach (sklearn) is a good thing.

@sleighsoft sleighsoft added the 0.4 label Oct 9, 2019
@lmcinnes
Copy link
Owner

lmcinnes commented Oct 9, 2019

@tomwhite I agree with that statement, but I do believe there was some confusion (I think, in the end, I phrased the question badly). I am planning on putting together a notebook to go in the tutorial documentation that documents this clearly, and gives the justification for the choice made.

@sleighsoft
Copy link
Collaborator

sleighsoft commented Oct 10, 2019

@tomwhite
You can seed the parallel execution like this

import numba
import numpy as np

@numba.njit(parallel=True)
def test():
    a = np.empty(20, dtype=np.float64)
    for i in numba.prange(20):
        np.random.seed(42)
        a[i] = np.random.uniform(0.0,1.0)
    return a
        
test()

This will print

array([0.37454012, 0.37454012, 0.37454012, 0.37454012, 0.37454012,
       0.37454012, 0.37454012, 0.37454012, 0.37454012, 0.37454012,
       0.37454012, 0.37454012, 0.37454012, 0.37454012, 0.37454012,
       0.37454012, 0.37454012, 0.37454012, 0.37454012, 0.37454012])

Just make sure np.random.seed is within the prange.

If you need more help just let me know.

@sleighsoft
Copy link
Collaborator

Or do

@numba.njit(parallel=True)
def test(seeds):
    a = np.empty(20, dtype=np.float64)
    for i in numba.prange(20):
        np.random.seed(seeds[i])
        a[i] = np.random.uniform(0.0,1.0)
    return a

np.random.seed(42)
seeds = np.random.randint(np.iinfo(np.int32).max, size=[20])
test(seeds)

For different but constant values

@sleighsoft
Copy link
Collaborator

sleighsoft commented Oct 11, 2019

I played with this a little and I think for this to work properly this

j = head[i]
has to be reworked. Due to head containing the same index multiple times numba basically operates on the same values in parallel and the state of the head_embedding and tail_embedding arrays cannot be controlled.

I will look into this a little more if I find the time.

For me it looks like having parallel return the same result multiple times is possible. I am just lacking understanding of that code part. Maybe @lmcinnes can explain what is going on there.

Edit:
For me it even looks like a bug / unknown behaviour when "parallel" is set to True.

@lmcinnes
Copy link
Owner

I haven't looked at it closely again to be sure, but my understanding is that parallel=True is essentially going to be non-deterministic due to race conditions on updating the embedding. This is, I believe, the unknown behaviour you are thinking of. In practice everything is sparse and for large datasets the odds of race conditions causing actual issues are very low. This is essentially the benefit of the SGD rather than a standard GD.

@MikeB2019x
Copy link

I'm new to UMAP so this may be a naive question and if so point me to the docs BUT a user can specify a random seed for reproducibility. If the user doesn't specify a seed I assume the algorithm does. Can we obtain the seed used by the algorithm in that case? That way if you have a result without specifying the seed that pleases, you could reproduce it by obtaining the seed the algorithm chose. Yes I could just save the original result but consider having the seed as insurance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants