-
Notifications
You must be signed in to change notification settings - Fork 812
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Break out inner loop of optimize_layout function #231
Conversation
effectively across all cores of a machine.
Hello @tomwhite! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
Comment last updated at 2019-06-19 10:55:20 UTC |
cc @falexwolf |
This looks intriguing. I had tried playing with various ways to encourage numba to parallelise (rather than have to do it by hand myself). This seems to have achieved that. I would be interested to know how different the resulting embeddings are -- have you tried it on a few datasets? Does it give comparable/similar results? I would think the parallel version should potentially take more iterations to get the same quality result. |
I assume that the parallel version has exactly the same output as the non-parallel one, but admittedly I haven't checked this! I should run some tests to see. (BTW in case the diff isn't clear, I have simply extracted a function, there is no change to the logic anywhere.) |
In principle I believe the parallel version can have race conditions in updating which can result in different output. There's also a read/locking issue that can result in more work being required to get the same quality result. I could well be wrong and the result may not be different. If that's the case all the better and we can merge this ASAP for a free performance boost. |
Cool! 😄 |
I ran the MNIST dataset example (using this code), using master and the code in this PR. The 2D embedding plot looked identical, and the runtime went from around 100 seconds to around 60 seconds. So it looks like this change is safe and quite beneficial. |
@tomwhite, do you get the same coordinates values on multiple runs with the same random number seed? |
@jlmelville, no, the coordinates change even with the same seed. In |
Sounds good. I will merge this into the 0.4dev branch if possible so I can give it a little more testing before actually getting released into the wild. |
My sense developing Personally, I only use UMAP for visualization, so I look forward to a faster optimization. |
Totally agree about reproducible results. So this optimization is useful, but there needs to be a way of getting the same results if you set a seed. So I'm not sure this is quite right yet. |
I had a bit more of a look at Numba's parallel implementation, and while it has per-thread random state, does not seed these states, so it's not possible to make the calculation reproducible in this case. So one way to fix this would be to use Another related issue: it would be good to use NumPy's random library for random number generation, instead of the |
In a C++ implementation of the optimization loop, I found that the random number generation is the performance bottleneck for low-dimensional embeddings. So it's worth measuring the performance effect before and after. |
The short answer is that (in my recollection) np.random is definitely significantly slower (which is why any of the specialised random number routines exist at all); this is partly because it is higher quality in terms of pseudo-random numbers, but since we need speed more than quality (the quality is fairly moot for the way umap uses the randomness) I went with speed. I agree with James: it would be nice to see benchmarking on this. It is possible things have changed and np.random is fine now. |
Some notes after further investigation:
I've implemented 4 in this branch: https://github.com/tomwhite/umap/tree/embedding_optimization_joblib. It achieves a comparable speedup to this PR, and the output of MNIST looks very similar. Having got to this point, I'm not sure if this should be merged though! The reason is that the inner loop of Perhaps some of the other approaches for optimizing SGD in #37 would be better. |
The non-local update is how LargeVis works, so there is a precedent. I have implemented this scheme in uwot, and visually, it doesn't make much difference, so for visualizations it is a useful option to have. But it is not on by default in the name of preserving reproducibility. |
Interesting to hear that @jlmelville. When I ran the MNIST example it didn't make a difference visually either. In the branch (https://github.com/tomwhite/umap/tree/embedding_optimization_joblib) the default is to use a single thread for the sake of reproducibility, so this might be a useful change to consider after all. |
I think this is definitely worth considering. The non-local update aspect was what was concerning me in terms of the embedding quality, but if it makes little difference visually then perhaps we are mostly good (as long as users are made aware of the trade-offs). |
eb4d0b7
to
8314d5d
Compare
Closing in favour of #255 |
...so numba can parallelize it effectively across all cores of a machine.
During benchmarking, I found that numba was not parallelizing
optimize_layout
, which is the most compute intensive part of UMAP. By extracting the code that runs for each epoch into its own function and decorating with numba'sparallel=True
I found that the computation can take advantage of all cores in a machine.For example, running locally on a 4-core Mac I found one benchmark (of
optimize_layout
) went from ~110s to ~45s. And on a 16-core Linux box, the UMAP step of Scanpy went from ~206s to ~73s (for 130K rows).