-
Notifications
You must be signed in to change notification settings - Fork 667
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel distance calculations #261
Conversation
Quick benchmarks: import numpy as np
from MDAnalysis.core.parallel.distances import distance_array as da_p
from MDAnalysis.core.distances import distance_array as da
N = 1000
a = np.random.random(N * 3).reshape(N, 3).astype(np.float32)
b = np.random.random(N * 3).reshape(N, 3).astype(np.float32) In [1]: r1 = da(a, b) In [2]: r2 = da_p(a, b) In [3]: r1 == r2 In [4]: %timeit da(a, b) In [5]: %timeit da_p(a, b) This is using 8 cores... I think the forced copy (done in serial) is slowing us down a lot. |
if there is a bunch of frames, I think it's better to split them into different cores. The parallel scaling is better. I did this for http://nbviewer.ipython.org/github/pytraj/pytraj/blob/master/note-books/parallel/rmsd_mpi.ipynb |
I'm not saying this is the fastest existing option, but it definitely has a place That's a very cool notebook though. Is this line:
Loading the entire trajectory into memory? |
for the above example There is reason I designed so (I am thinking about a version using http://nbviewer.ipython.org/github/pytraj/pytraj/blob/master/note-books/speed_test_2_trajs.ipynb Hai |
The parallel version has a place, in particular for distance selections. If the parallel version will do PBC then we can use it as a drop-in replacement. We should have a way to set the number of threads, perhaps using core.flags. There should also be a way to tell the distance-based selections to default to parallel or single versions. I'm definitely with @hainm that the best case is to do work pleasingly parallel on chunks but sometimes it is just easier to do it serially and then distance_array is often the bottle neck (e.g. for @richardjgowers : I haven't had time to look at the code and don't have final answers for your questions but would raise the following points for discussion:
|
To the best of my recollection, the Obviously this isn't available as an option when you're using the pragmas yourself in C (as here), but ifdef guarding with |
just want to confirm what @rmcgibbo saying. I got a cythonized file without OPENMP https://github.com/pytraj/pytraj/blob/master/pytraj/ActionList.cpp#L284 |
@orbeckst I agree about parallel for single Frame. Never thought about we need to perform analysis with >100K atoms too. Hai |
So to save everyone having to read C code... the tl;dr of the commit is in our C distances library (src/numtools/calc_distances.h) can now look like #ifdef PARALLEL
#include <omp.h>
#endif
static void some_function(args)
#ifdef PARALLEL
// This only gets added if PARALLEL was defined when calling the C compiler
#pragma omp parallel for
#endif
for (i=0; i++; i<n){
do math
} Our setup.py then builds 2 extensions which both refer to calc_distances.h, but one pass includes the parallel stuff (ie. -DPARALLEL) I'll add in selecting the number of cores and flags (thanks @orbeckst ), then I'll just need someone to test off MacOS. |
So for this what I think I'll do is add a keyword to these Cython functions which will do as Oliver suggested. The other cool performance stuff should and will become a different PR. It might be a good juncture to tidy the namespace, I'd like to:
So analysis.distances is more the "user" module and core._distances the "dev" one. This will also centralise the distance related code (currently split in two between core and analysis). So I'll do this in the next few days unless someone objects wildly |
something like this (which is similiar to https://github.com/pytraj/pytraj/blob/master/pytraj/Frame.pyx#L978 after you implementation, can you post your benchmark? the openmp stuff for distance calculation is not really make the calculation a lot faster (and the scaling is poor too). I used 8 cores but only got speed up of 3-4 times for 10**8 atompairs calculation. Hai |
@richardgowers : Keep in mind that selections need distance_array() and all of this needs to run without MDAnalysis.analysis. We do not want any dependencies on MDAnalysis.analysis in the core (in the future we might split that part of to make the core library light-weight). As long as the above is fulfilled I'm happy with restructuring. Also note that we have core.transformations that do somewhat similar calculations and your fast dihedral and angle calculations and then there's KD-tree. Perhaps we should think about bundling all of them under MDAnalysis.lib or numerical or similar? Oliver On 11 May, 2015, at 11:48, Richard Gowers wrote:
Oliver Beckstein * orbeckst@gmx.net |
I get pretty close to n times faster in the thin wrapper, it's just the copy statement (in serial) that kills it. |
@orbeckst ahhh, maybe it is more complicated than I hoped. Our numerics should all be dependency free.. So bundling them into lib or numerical sounds smart... I'll see what makes most sense |
it's interesting if you can include iterating time too. For example. You can check my notebook here. (my testing system is not that huge though, only ~17 K atoms, 1000 frames). Any way, just want to make the point that sometimes the bottle neck is in I/O. |
Not to add noise to the conversation, but perhaps related to #238? |
Ok so with commit bb7f382 I've moved the cython interface to
The parallel distances in this PR will become
|
Sounds good � although, how about adding more order right away, at least for serial/parallel versions: lib.parallel._distances especially if we do more of this stuff in the future � we might end up with lib.cuda, lib.mpi, or lib.parallel.cuda, lib.parallel.openmp ... not sure if this is already overspecifying things but something along those lines would make it a bit more easy to slot new stuff in. Alternatively, focus at the top level on the functionality and e.g. lib.geometry.distances.openmp etc, but then almost each submodule is named serial, which seems a bit dumb. (Or do 'import .distances_serial as serial') Opinions??? On 15 May, 2015, at 10:24, Richard Gowers wrote:
Oliver Beckstein * orbeckst@gmx.net |
@richardjgowers , should we just close the pull request, or does it still contain stuff that is not in bb7f382 ? For the organization of lib I'd prefer (along your original lines):
because that leaves us open for |
Yeah the branch in the original PR is outdated now, so I'll close this. @orbeckst Ok I'll organise like that. If anyone wants to buy me a CUDA rig to play on I'll write distances.cuda too! |
On 19 May, 2015, at 11:53, Richard Gowers wrote:
Hypothetically speaking, what would you like :-) ? Oliver Beckstein * orbeckst@gmx.net |
Haha, I think I was using M2050s before. Of course I'd need 2 to test the multi gpu performance. |
On 19 May, 2015, at 12:12, Richard Gowers wrote:
GTX 690s are still pretty decent in our hands. Oliver Beckstein * orbeckst@gmx.net |
So currently we have a parallel version of distance_array, but these are done in Cython, and only exist if someone does a separate piece of code for a parallel version.
By putting openmp statements behind conditional compilation options, we can use the same C code twice, and this then generates a serial and parallel version of each function. This is nice as it keeps the code DRY, and also all functions exist in the serial/parallel namespace, regardless of whether the directives have been written yet.
This needs to have a separate .pyx file, and will generate a different submodule. This is a little confusing, and arguably there should be a single point of reference for people accessing these functions. It would be nice to change the call signature to something like
Then in analysis.distances it chooses one of the correct Cython backends. Using Flags to default to parallel/not is also possible.
So questions that need answering are: