Break out inner loop of optimize_layout function #231

tomwhite · 2019-05-03T09:42:58Z

...so numba can parallelize it effectively across all cores of a machine.

During benchmarking, I found that numba was not parallelizing optimize_layout, which is the most compute intensive part of UMAP. By extracting the code that runs for each epoch into its own function and decorating with numba's parallel=True I found that the computation can take advantage of all cores in a machine.

For example, running locally on a 4-core Mac I found one benchmark (of optimize_layout) went from ~110s to ~45s. And on a 16-core Linux box, the UMAP step of Scanpy went from ~206s to ~73s (for 130K rows).

effectively across all cores of a machine.

pep8speaks · 2019-05-03T09:43:00Z

Hello @tomwhite! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file umap/umap_.py:

Line 857:80: E501 line too long (86 > 79 characters)

Comment last updated at 2019-06-19 10:55:20 UTC

coveralls · 2019-05-03T10:04:28Z

Coverage increased (+0.003%) to 94.538% when pulling 8314d5d on tomwhite:embedding_optimization into ec860e9 on lmcinnes:master.

tomwhite · 2019-05-03T10:05:17Z

cc @falexwolf

lmcinnes · 2019-05-03T16:59:24Z

This looks intriguing. I had tried playing with various ways to encourage numba to parallelise (rather than have to do it by hand myself). This seems to have achieved that. I would be interested to know how different the resulting embeddings are -- have you tried it on a few datasets? Does it give comparable/similar results? I would think the parallel version should potentially take more iterations to get the same quality result.

tomwhite · 2019-05-03T19:23:15Z

I assume that the parallel version has exactly the same output as the non-parallel one, but admittedly I haven't checked this! I should run some tests to see. (BTW in case the diff isn't clear, I have simply extracted a function, there is no change to the logic anywhere.)

lmcinnes · 2019-05-03T23:09:52Z

In principle I believe the parallel version can have race conditions in updating which can result in different output. There's also a read/locking issue that can result in more work being required to get the same quality result. I could well be wrong and the result may not be different. If that's the case all the better and we can merge this ASAP for a free performance boost.

falexwolf · 2019-05-06T07:45:36Z

Cool! 😄

tomwhite · 2019-05-08T13:18:29Z

I ran the MNIST dataset example (using this code), using master and the code in this PR. The 2D embedding plot looked identical, and the runtime went from around 100 seconds to around 60 seconds. So it looks like this change is safe and quite beneficial.

jlmelville · 2019-05-08T16:18:12Z

@tomwhite, do you get the same coordinates values on multiple runs with the same random number seed?

tomwhite · 2019-05-08T16:59:21Z

@jlmelville, no, the coordinates change even with the same seed. In _optimize_layout_single_epoch there is a shared rng_state object and with multiple threads this state may not be updated determinstically. In pynndescent there is a seed_per_row option which computes the seed based on the row number, and uses a rng_state object per thread, but this only works because the threading there is done manually, not via numba's parallel=True. Unless there's a way of doing something similar with parallel=True, perhaps we should take the same approach here.

lmcinnes · 2019-05-08T17:46:16Z

Sounds good. I will merge this into the 0.4dev branch if possible so I can give it a little more testing before actually getting released into the wild.

jlmelville · 2019-05-09T03:41:02Z

My sense developing uwot was that a UMAP implementation that doesn't give reproducible results with a fixed seed would get a lot of push-back from scientific users (which I understand, being a lapsed scientist myself). Is there a way to this off if needed?

Personally, I only use UMAP for visualization, so I look forward to a faster optimization.

tomwhite · 2019-05-09T13:10:28Z

Totally agree about reproducible results. So this optimization is useful, but there needs to be a way of getting the same results if you set a seed. So I'm not sure this is quite right yet.

tomwhite · 2019-05-23T16:07:37Z

I had a bit more of a look at Numba's parallel implementation, and while it has per-thread random state, does not seed these states, so it's not possible to make the calculation reproducible in this case.

So one way to fix this would be to use parallel=True only if UMAP's random_state is None, otherwise it would drop back to the sequential version for reproducibility, perhaps with a warning. I've prototyped this successfully.

Another related issue: it would be good to use NumPy's random library for random number generation, instead of the tau_rand_int function. This is because Numba creates per-thread state for random number generation, whereas tau_rand_int relies on a shared state and its unclear if that is safe for separate threads. Are there any objections to replacing tau_rand_int with np.random?

jlmelville · 2019-05-23T16:18:42Z

In a C++ implementation of the optimization loop, I found that the random number generation is the performance bottleneck for low-dimensional embeddings. So it's worth measuring the performance effect before and after.

lmcinnes · 2019-05-23T17:37:52Z

The short answer is that (in my recollection) np.random is definitely significantly slower (which is why any of the specialised random number routines exist at all); this is partly because it is higher quality in terms of pseudo-random numbers, but since we need speed more than quality (the quality is fairly moot for the way umap uses the randomness) I went with speed. I agree with James: it would be nice to see benchmarking on this. It is possible things have changed and np.random is fine now.

tomwhite · 2019-05-30T14:58:12Z

Some notes after further investigation:

UMAP's tau_rand random number generation doesn't play well with Numba's parallelization, since the latter doesn't provide a way to have per-thread state (there's an open issue though: Threading improvements in roadmap numba/numba#3656). Having a random number state shared between threads is potentially unsafe, and causes problems with reproducibility.
Conversely, Numba's np.random random number generation doesn't play well with joblib, since Numba doesn't expose the per-thread state (it's only available to its own parallelization) and it doesn't support creating random state instances. So, again, having a single global state shared between threads is potentially unsafe and precludes reproducibility.
Using Numba's np.random random number generation with Numba's parallelization would be safe (since internally it uses per-thread states), but these states cannot be seeded, which impacts reproducibility. Also, the threading support in pynndescent uses joblib, since it can't be expressed using Numba parallelization, and it probably makes sense to use the same random number generators in both projects (tau_rand).
UMAP's tau_rand random number generation does work well with joblib, since the latter allows management of per-thread state, which means that the code is threadsafe and reproducible. (Reproducibility can be achieved by seeding each random state with a chunk index.) The downside is that it is a bit more work to add - it's not just a question of adding a Numba decorator.

I've implemented 4 in this branch: https://github.com/tomwhite/umap/tree/embedding_optimization_joblib. It achieves a comparable speedup to this PR, and the output of MNIST looks very similar.

Having got to this point, I'm not sure if this should be merged though! The reason is that the inner loop of optimize_layout does non-local updates of the embedding matrix (i.e. not just row i), so when multiple threads are running in parallel their updates could interfere, so that rows are not updated atomically. (Note that this observation is unrelated to points 1-4 above, and would be a problem under any of those implementations.) This may not be a problem in practice - and indeed this is reminiscent of Hogwild!, although Hogwild! depends on atomic updates of single components of a weight vector, which is not what is happening here.

Perhaps some of the other approaches for optimizing SGD in #37 would be better.

jlmelville · 2019-05-30T16:01:51Z

The non-local update is how LargeVis works, so there is a precedent. I have implemented this scheme in uwot, and visually, it doesn't make much difference, so for visualizations it is a useful option to have. But it is not on by default in the name of preserving reproducibility.

tomwhite · 2019-05-31T09:26:42Z

Interesting to hear that @jlmelville. When I ran the MNIST example it didn't make a difference visually either. In the branch (https://github.com/tomwhite/umap/tree/embedding_optimization_joblib) the default is to use a single thread for the sake of reproducibility, so this might be a useful change to consider after all.

lmcinnes · 2019-05-31T17:40:56Z

I think this is definitely worth considering. The non-local update aspect was what was concerning me in terms of the embedding quality, but if it makes little difference visually then perhaps we are mostly good (as long as users are made aware of the trade-offs).

tomwhite · 2019-08-09T13:12:17Z

Closing in favour of #255

Break out inner loop of optimize_layout so numba can parallelize it

8314d5d

effectively across all cores of a machine.

lmcinnes changed the base branch from master to 0.4dev May 8, 2019 19:28

tomwhite force-pushed the embedding_optimization branch 3 times, most recently from eb4d0b7 to 8314d5d Compare June 19, 2019 10:55

tomwhite mentioned this pull request Jun 19, 2019

Break out inner loop of optimize_layout so joblib can parallelize it #255

Closed

tomwhite closed this Aug 9, 2019

This was referenced Sep 11, 2019

Break out inner loop of optimize_layout so joblib can parallelize it (0.4dev version) #292

Closed

Break out inner loop of optimize_layout_euclidean so numba can parallelize it 0.4dev #294

Merged

tomwhite mentioned this pull request Sep 25, 2019

Document reproducibility guarantees #298

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Break out inner loop of optimize_layout function #231

Break out inner loop of optimize_layout function #231

tomwhite commented May 3, 2019

pep8speaks commented May 3, 2019 •

edited

Loading

coveralls commented May 3, 2019

tomwhite commented May 3, 2019

lmcinnes commented May 3, 2019

tomwhite commented May 3, 2019

lmcinnes commented May 3, 2019

falexwolf commented May 6, 2019

tomwhite commented May 8, 2019

jlmelville commented May 8, 2019

tomwhite commented May 8, 2019

lmcinnes commented May 8, 2019

jlmelville commented May 9, 2019

tomwhite commented May 9, 2019

tomwhite commented May 23, 2019

jlmelville commented May 23, 2019

lmcinnes commented May 23, 2019

tomwhite commented May 30, 2019

jlmelville commented May 30, 2019

tomwhite commented May 31, 2019

lmcinnes commented May 31, 2019

tomwhite commented Aug 9, 2019

Break out inner loop of optimize_layout function #231

Break out inner loop of optimize_layout function #231

Conversation

tomwhite commented May 3, 2019

pep8speaks commented May 3, 2019 • edited Loading

Comment last updated at 2019-06-19 10:55:20 UTC

coveralls commented May 3, 2019

tomwhite commented May 3, 2019

lmcinnes commented May 3, 2019

tomwhite commented May 3, 2019

lmcinnes commented May 3, 2019

falexwolf commented May 6, 2019

tomwhite commented May 8, 2019

jlmelville commented May 8, 2019

tomwhite commented May 8, 2019

lmcinnes commented May 8, 2019

jlmelville commented May 9, 2019

tomwhite commented May 9, 2019

tomwhite commented May 23, 2019

jlmelville commented May 23, 2019

lmcinnes commented May 23, 2019

tomwhite commented May 30, 2019

jlmelville commented May 30, 2019

tomwhite commented May 31, 2019

lmcinnes commented May 31, 2019

tomwhite commented Aug 9, 2019

pep8speaks commented May 3, 2019 •

edited

Loading