Document reproducibility guarantees #298

tomwhite · 2019-09-25T09:33:42Z

Following up on the discussion here, it would be good to document how to get reproducible results with UMAP.

I think we should consider changing random_state in the UMAP constructor to a seed (e.g. 42, like the new transform_seed default) so that UMAP is reproducible by default.

We should document that users can set random_state to None to get faster results at the expense of reproducibility. In this mode there is no seed that would produce the same output due to the multithreading. (This was introduced in #294.)

The text was updated successfully, but these errors were encountered:

lmcinnes · 2019-09-25T14:17:43Z

Good plan. I agree that setting a default seed is sensible under the circumstances. Did you have a suggestion for where in the documentation this should go? Presumably in the basic tutorial.

sleighsoft · 2019-09-25T16:15:31Z

We should document that users can set random_state to None to get faster results at the expense of reproducibility.

So this becomes a question of do you want the default to be speed or reproducibility?

I, myself, would prefer speed over reproducibility and just print a warning or hint.

lmcinnes · 2019-09-26T23:31:39Z

I can see the merits of either approach. Maybe I can crowdsource this.

tomwhite · 2019-10-02T10:03:03Z

Thanks for staring the vote and discussion @lmcinnes: https://twitter.com/leland_mcinnes/status/1177367770679435267. The results were to make the default reproducible/slower (60%), rather than faster/non-reproducible (40%). I'll submit a PR.

sleighsoft · 2019-10-02T10:58:34Z

@tomwhite I have been out of the loop for a couple of days. Can you refresh my memory as to why using rng does not lead to reproducible visuals?

I might have an idea, but I am not sure I remember the problem correctly.

tomwhite · 2019-10-02T11:07:07Z

@sleighsoft do you mean why using a RNG with a seed and Numba parallel does not lead to reproducible output? It's because while Numba parallel has per-thread random state, it's not possible to seed these states.

lmcinnes · 2019-10-03T15:26:08Z

While the poll came out for reproducible, I have to admit that some of the discussion led me to lean the opposite way. I fear that fixing a random state will lead people to believe that the algorithm is deterministic. I know the docs say otherwise, but the reality is that people don't read the docs, they just run stuff and see what happens. In particular I like this quote: 'For me setting a random seed is like signing a waiver "I am aware that this is a stochastic algorithm and I have done sufficient tests to confirm that my main conclusions are not affected by this randomess".
Thus having it reproducible should not be the default.' (https://twitter.com/ZanotelliVRT/status/1177470041475837952). I think users should be forced to set a seed for reproducibility.

dillondaudert · 2019-10-03T19:07:50Z

I agree with having random_state=None be the default. This matches how sklearn treats other stochastic algorithms, like SGDClassifier.

tomwhite · 2019-10-09T09:00:47Z

I can see arguments for both sides. I am however uncomfortable with asking users what they prefer via a poll then ignoring the result.

sleighsoft · 2019-10-09T10:49:34Z

It is also unclear if the phrasing of the twitter question was misleading in the first place. I think following a proofen approach (sklearn) is a good thing.

lmcinnes · 2019-10-09T13:33:06Z

@tomwhite I agree with that statement, but I do believe there was some confusion (I think, in the end, I phrased the question badly). I am planning on putting together a notebook to go in the tutorial documentation that documents this clearly, and gives the justification for the choice made.

sleighsoft · 2019-10-10T19:07:41Z

@tomwhite
You can seed the parallel execution like this

import numba
import numpy as np

@numba.njit(parallel=True)
def test():
    a = np.empty(20, dtype=np.float64)
    for i in numba.prange(20):
        np.random.seed(42)
        a[i] = np.random.uniform(0.0,1.0)
    return a
        
test()

This will print

array([0.37454012, 0.37454012, 0.37454012, 0.37454012, 0.37454012,
       0.37454012, 0.37454012, 0.37454012, 0.37454012, 0.37454012,
       0.37454012, 0.37454012, 0.37454012, 0.37454012, 0.37454012,
       0.37454012, 0.37454012, 0.37454012, 0.37454012, 0.37454012])

Just make sure np.random.seed is within the prange.

If you need more help just let me know.

sleighsoft · 2019-10-10T19:24:10Z

Or do

@numba.njit(parallel=True)
def test(seeds):
    a = np.empty(20, dtype=np.float64)
    for i in numba.prange(20):
        np.random.seed(seeds[i])
        a[i] = np.random.uniform(0.0,1.0)
    return a

np.random.seed(42)
seeds = np.random.randint(np.iinfo(np.int32).max, size=[20])
test(seeds)

For different but constant values

sleighsoft · 2019-10-11T21:45:54Z

I played with this a little and I think for this to work properly this

umap/umap/layouts.py

Line 70 in d214e5d

j = head[i]

has to be reworked. Due to head containing the same index multiple times numba basically operates on the same values in parallel and the state of the head_embedding and tail_embedding arrays cannot be controlled.

I will look into this a little more if I find the time.

For me it looks like having parallel return the same result multiple times is possible. I am just lacking understanding of that code part. Maybe @lmcinnes can explain what is going on there.

Edit:
For me it even looks like a bug / unknown behaviour when "parallel" is set to True.

lmcinnes · 2019-10-12T02:46:27Z

I haven't looked at it closely again to be sure, but my understanding is that parallel=True is essentially going to be non-deterministic due to race conditions on updating the embedding. This is, I believe, the unknown behaviour you are thinking of. In practice everything is sparse and for large datasets the odds of race conditions causing actual issues are very low. This is essentially the benefit of the SGD rather than a standard GD.

MikeB2019x · 2023-03-10T14:14:13Z

I'm new to UMAP so this may be a naive question and if so point me to the docs BUT a user can specify a random seed for reproducibility. If the user doesn't specify a seed I assume the algorithm does. Can we obtain the seed used by the algorithm in that case? That way if you have a result without specifying the seed that pleases, you could reproduce it by obtaining the seed the algorithm chose. Yes I could just save the original result but consider having the seed as insurance.

lmcinnes added documentation good first issue labels Sep 25, 2019

sleighsoft mentioned this issue Oct 2, 2019

Document reproducibility (#298) #304

Open

sleighsoft added the 0.4 label Oct 9, 2019

Fil mentioned this issue Jul 20, 2020

UMAP produces different results with the same parameters for a loop #466

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document reproducibility guarantees #298

Document reproducibility guarantees #298

tomwhite commented Sep 25, 2019

lmcinnes commented Sep 25, 2019

sleighsoft commented Sep 25, 2019

lmcinnes commented Sep 26, 2019

tomwhite commented Oct 2, 2019

sleighsoft commented Oct 2, 2019

tomwhite commented Oct 2, 2019

lmcinnes commented Oct 3, 2019

dillondaudert commented Oct 3, 2019

tomwhite commented Oct 9, 2019

sleighsoft commented Oct 9, 2019

lmcinnes commented Oct 9, 2019

sleighsoft commented Oct 10, 2019 •

edited

Loading

sleighsoft commented Oct 10, 2019

sleighsoft commented Oct 11, 2019 •

edited

Loading

lmcinnes commented Oct 12, 2019

MikeB2019x commented Mar 10, 2023

Document reproducibility guarantees #298

Document reproducibility guarantees #298

Comments

tomwhite commented Sep 25, 2019

lmcinnes commented Sep 25, 2019

sleighsoft commented Sep 25, 2019

lmcinnes commented Sep 26, 2019

tomwhite commented Oct 2, 2019

sleighsoft commented Oct 2, 2019

tomwhite commented Oct 2, 2019

lmcinnes commented Oct 3, 2019

dillondaudert commented Oct 3, 2019

tomwhite commented Oct 9, 2019

sleighsoft commented Oct 9, 2019

lmcinnes commented Oct 9, 2019

sleighsoft commented Oct 10, 2019 • edited Loading

sleighsoft commented Oct 10, 2019

sleighsoft commented Oct 11, 2019 • edited Loading

lmcinnes commented Oct 12, 2019

MikeB2019x commented Mar 10, 2023

sleighsoft commented Oct 10, 2019 •

edited

Loading

sleighsoft commented Oct 11, 2019 •

edited

Loading