Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Break out inner loop of optimize_layout function #231

Closed
wants to merge 1 commit into from

Conversation

tomwhite
Copy link
Collaborator

@tomwhite tomwhite commented May 3, 2019

...so numba can parallelize it effectively across all cores of a machine.

During benchmarking, I found that numba was not parallelizing optimize_layout, which is the most compute intensive part of UMAP. By extracting the code that runs for each epoch into its own function and decorating with numba's parallel=True I found that the computation can take advantage of all cores in a machine.

For example, running locally on a 4-core Mac I found one benchmark (of optimize_layout) went from ~110s to ~45s. And on a 16-core Linux box, the UMAP step of Scanpy went from ~206s to ~73s (for 130K rows).

@pep8speaks
Copy link

pep8speaks commented May 3, 2019

Hello @tomwhite! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 857:80: E501 line too long (86 > 79 characters)

Comment last updated at 2019-06-19 10:55:20 UTC

@coveralls
Copy link

Coverage Status

Coverage increased (+0.003%) to 94.538% when pulling 8314d5d on tomwhite:embedding_optimization into ec860e9 on lmcinnes:master.

@tomwhite
Copy link
Collaborator Author

tomwhite commented May 3, 2019

cc @falexwolf

@lmcinnes
Copy link
Owner

lmcinnes commented May 3, 2019

This looks intriguing. I had tried playing with various ways to encourage numba to parallelise (rather than have to do it by hand myself). This seems to have achieved that. I would be interested to know how different the resulting embeddings are -- have you tried it on a few datasets? Does it give comparable/similar results? I would think the parallel version should potentially take more iterations to get the same quality result.

@tomwhite
Copy link
Collaborator Author

tomwhite commented May 3, 2019

I assume that the parallel version has exactly the same output as the non-parallel one, but admittedly I haven't checked this! I should run some tests to see. (BTW in case the diff isn't clear, I have simply extracted a function, there is no change to the logic anywhere.)

@lmcinnes
Copy link
Owner

lmcinnes commented May 3, 2019

In principle I believe the parallel version can have race conditions in updating which can result in different output. There's also a read/locking issue that can result in more work being required to get the same quality result. I could well be wrong and the result may not be different. If that's the case all the better and we can merge this ASAP for a free performance boost.

@falexwolf
Copy link

Cool! 😄

@tomwhite
Copy link
Collaborator Author

tomwhite commented May 8, 2019

I ran the MNIST dataset example (using this code), using master and the code in this PR. The 2D embedding plot looked identical, and the runtime went from around 100 seconds to around 60 seconds. So it looks like this change is safe and quite beneficial.

@jlmelville
Copy link
Collaborator

@tomwhite, do you get the same coordinates values on multiple runs with the same random number seed?

@tomwhite
Copy link
Collaborator Author

tomwhite commented May 8, 2019

@jlmelville, no, the coordinates change even with the same seed. In _optimize_layout_single_epoch there is a shared rng_state object and with multiple threads this state may not be updated determinstically. In pynndescent there is a seed_per_row option which computes the seed based on the row number, and uses a rng_state object per thread, but this only works because the threading there is done manually, not via numba's parallel=True. Unless there's a way of doing something similar with parallel=True, perhaps we should take the same approach here.

@lmcinnes
Copy link
Owner

lmcinnes commented May 8, 2019

Sounds good. I will merge this into the 0.4dev branch if possible so I can give it a little more testing before actually getting released into the wild.

@lmcinnes lmcinnes changed the base branch from master to 0.4dev May 8, 2019 19:28
@jlmelville
Copy link
Collaborator

My sense developing uwot was that a UMAP implementation that doesn't give reproducible results with a fixed seed would get a lot of push-back from scientific users (which I understand, being a lapsed scientist myself). Is there a way to this off if needed?

Personally, I only use UMAP for visualization, so I look forward to a faster optimization.

@tomwhite
Copy link
Collaborator Author

tomwhite commented May 9, 2019

Totally agree about reproducible results. So this optimization is useful, but there needs to be a way of getting the same results if you set a seed. So I'm not sure this is quite right yet.

@tomwhite
Copy link
Collaborator Author

I had a bit more of a look at Numba's parallel implementation, and while it has per-thread random state, does not seed these states, so it's not possible to make the calculation reproducible in this case.

So one way to fix this would be to use parallel=True only if UMAP's random_state is None, otherwise it would drop back to the sequential version for reproducibility, perhaps with a warning. I've prototyped this successfully.

Another related issue: it would be good to use NumPy's random library for random number generation, instead of the tau_rand_int function. This is because Numba creates per-thread state for random number generation, whereas tau_rand_int relies on a shared state and its unclear if that is safe for separate threads. Are there any objections to replacing tau_rand_int with np.random?

@jlmelville
Copy link
Collaborator

In a C++ implementation of the optimization loop, I found that the random number generation is the performance bottleneck for low-dimensional embeddings. So it's worth measuring the performance effect before and after.

@lmcinnes
Copy link
Owner

The short answer is that (in my recollection) np.random is definitely significantly slower (which is why any of the specialised random number routines exist at all); this is partly because it is higher quality in terms of pseudo-random numbers, but since we need speed more than quality (the quality is fairly moot for the way umap uses the randomness) I went with speed. I agree with James: it would be nice to see benchmarking on this. It is possible things have changed and np.random is fine now.

@tomwhite
Copy link
Collaborator Author

Some notes after further investigation:

  1. UMAP's tau_rand random number generation doesn't play well with Numba's parallelization, since the latter doesn't provide a way to have per-thread state (there's an open issue though: Threading improvements in roadmap numba/numba#3656). Having a random number state shared between threads is potentially unsafe, and causes problems with reproducibility.
  2. Conversely, Numba's np.random random number generation doesn't play well with joblib, since Numba doesn't expose the per-thread state (it's only available to its own parallelization) and it doesn't support creating random state instances. So, again, having a single global state shared between threads is potentially unsafe and precludes reproducibility.
  3. Using Numba's np.random random number generation with Numba's parallelization would be safe (since internally it uses per-thread states), but these states cannot be seeded, which impacts reproducibility. Also, the threading support in pynndescent uses joblib, since it can't be expressed using Numba parallelization, and it probably makes sense to use the same random number generators in both projects (tau_rand).
  4. UMAP's tau_rand random number generation does work well with joblib, since the latter allows management of per-thread state, which means that the code is threadsafe and reproducible. (Reproducibility can be achieved by seeding each random state with a chunk index.) The downside is that it is a bit more work to add - it's not just a question of adding a Numba decorator.

I've implemented 4 in this branch: https://github.com/tomwhite/umap/tree/embedding_optimization_joblib. It achieves a comparable speedup to this PR, and the output of MNIST looks very similar.

Having got to this point, I'm not sure if this should be merged though! The reason is that the inner loop of optimize_layout does non-local updates of the embedding matrix (i.e. not just row i), so when multiple threads are running in parallel their updates could interfere, so that rows are not updated atomically. (Note that this observation is unrelated to points 1-4 above, and would be a problem under any of those implementations.) This may not be a problem in practice - and indeed this is reminiscent of Hogwild!, although Hogwild! depends on atomic updates of single components of a weight vector, which is not what is happening here.

Perhaps some of the other approaches for optimizing SGD in #37 would be better.

@jlmelville
Copy link
Collaborator

The non-local update is how LargeVis works, so there is a precedent. I have implemented this scheme in uwot, and visually, it doesn't make much difference, so for visualizations it is a useful option to have. But it is not on by default in the name of preserving reproducibility.

@tomwhite
Copy link
Collaborator Author

Interesting to hear that @jlmelville. When I ran the MNIST example it didn't make a difference visually either. In the branch (https://github.com/tomwhite/umap/tree/embedding_optimization_joblib) the default is to use a single thread for the sake of reproducibility, so this might be a useful change to consider after all.

@lmcinnes
Copy link
Owner

I think this is definitely worth considering. The non-local update aspect was what was concerning me in terms of the embedding quality, but if it makes little difference visually then perhaps we are mostly good (as long as users are made aware of the trade-offs).

@tomwhite
Copy link
Collaborator Author

tomwhite commented Aug 9, 2019

Closing in favour of #255

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants