Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel trajectory analysis (revisited) #618

Conversation

mattiafelice-palermo
Copy link
Contributor

With reference to #617
MDAnalysis Discussion --> link
Branch --> link

Aim

Speed up trajectory analysis by spreading the computation on multiple threads. Slices of the trajectory are assigned to each thread, which then proceeds running AnalysisBase istances.

Current syntax (may change)

In order to run a parallel analysis, just invoke the run() method with the optional arguments parallel=True and threads=<nthreads>.

import MDAnalysis as mda
from MDAnalysisTests.datafiles import DCD, PSF
import electromagnetism # some analysis module

universe = mda.Universe(PSF, DCD)
selection = universe.select_atoms('resname TIP3')

el = electromagnetism.TotalDipole(universe=universe, selection=selection)]
el.run(parallel=True, nthreads=4)

To do

  • Implement a run(parallel=True) method in the AnalysisBase class so that single analysis jobs can be parallelized without using the ParallelProcessor class.
  • Print progressbar to stderr so that stdout can be redirected to file
  • Create a progressbar with percentage of completion, configuration counter and a ETA estimate.
  • Make progressbar optional
  • Prepare tests
  • Add documentation to modules, classes and methods once the design of the class is stable.

Benchmarks

Benchmark performed on a system of N=250 k atoms (294 configurations, size ~1 GB).

scaling

Efficiency computed as the ratio between core speed (cfg s-1 threads-1) and the speed of a serial analysis.

Compatibility issues

Due to the implementation of the parallelization, AnalysisBase needs to take a Universe object as argument.

Last update: Fri Jan 22 11:34:18 CET 2016

…ent test analysis (electromagnetism.py), added parallel_jobs.py module, added test for the parallel_jobs.py module
…s has its own universe, eliminating the problem of accessing to arrays that have been already deallocated. Adapted base.py, parallel_jobs.py and test_parallel_jobs.py according to PEP8 style.
@richardjgowers richardjgowers self-assigned this Jan 13, 2016
@@ -96,10 +92,14 @@ def run(self):
self._prepare()
for i, ts in enumerate(
self._trajectory[self.start:self.stop:self.step]):
self._ts = ts
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why this line has to go? Could we call _single_frame with ts as the argument?
ie.

for i, ts in enumerate(self._traj[etc]):
    self._single_frame(ts)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, calling _single_frame with ts as the argument looks much cleaner.

@richardjgowers
Copy link
Member

This is looking good, at the efficiency is much higher than I was expecting.

@mattiafelice-palermo
Copy link
Contributor Author

Yes! If you or anyone have some analysis class that you think it's easy to adapt (by adding the __iadd__ method as shown in #617 ), let me know if you get similar results.

@mattiafelice-palermo
Copy link
Contributor Author

Hi everybody!

So, as @richardjgowers suggested, I modified the AnalysisBase class so that the run() method can be invoked with the optional arguments parallel=True and threads=<nthreads>. I tried benchmarking if two analyses run with run(parallel=True) are slower than computing them in batch with ParallelProcessor but did not find a relevant difference - even though this seems strange to me. When running analyses with some our internal program, reading dcds usually takes a non negligible amount of time. Are we 100% sure that when running two _singleframe() methods of two AnalysisBase objects within the same ts in trajectory loop, the coordinates are not re-read? If that is the case, then I would opt for removing the ParallelProcessor class and just leave the AnalysisBase run(parallel=True)method.

Let me know what you think!

P.s.: I also added a basic progressbar in a folder utils within package/MDAnalysis/analysis, I hope that is fine.

@@ -0,0 +1,102 @@
""" Add docstring later
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should all go into lib.log --- there's already ProgressMeter. ProgressBar would nicely complement it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, moved code to lib.log and also modified the bar with threading so that it supports working with serial jobs. Hopefully that's what you meant.

@kain88-de
Copy link
Member

BTW did any of you have a look at joblib for this?

self._conclude()

else:
if threads is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer if all this code was contained inside _parallel_run, the run function can just end up as:

def run(self, parallel=False, nthreads=None):
    if not parallel:
        self._serial_run()
    else:
        self._parallel_run(nthreads)

So _serial_run just contains the "original" run code, and parallel run has your new code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense, it's much cleaner this way. I modified the code accordingly.

…ppropriare serial or parallel method. Adapted AnalysisBase and ParallelProcessor to run with new progressbar, which now works also for serial analyses
@mattiafelice-palermo
Copy link
Contributor Author

@kain88-de seems a very interesting lib, thanks for pointing it out! It could probably solve the problem of allocating/deallocating array with positions/masses etc when using a single DCDReader object that we used to discuss a while ago (?), even though the actual implementation circumvent this by having each AnalysisBase object initialized and use its own DCDReader object.

@richardjgowers richardjgowers added this to the 0.14 milestone Jan 20, 2016
@richardjgowers
Copy link
Member

Hey Matti,

This is looking good, I think next step is to fix all the things we've broke:

https://travis-ci.org/MDAnalysis/mdanalysis/jobs/103233684

I think these are because you changed the signature of _setup_frames to want a Universe not a Reader.

You should be able to run those tests locally by going to <your repo>/testsuite/MDAnalysisTests/analysis and running nosetests

…ich now takes a universe and not a trajectory, and requires method _single_frame to take a timestep as argument
@mattiafelice-palermo
Copy link
Contributor Author

Good day Richard,

I modified the modules that conflicted with the new signatures of _setup_frames and _single_frame methods. I run nosetests without any errors.

There is one big issue, though. I noticed that e.g. the rdf analysis takes as input two selections g1 and g2, from which the Universe is accessed through g1.universe. I'm pretty sure this is not going to work with the paralell implementation. This is because when creating deep copies of the original analysis object, the self._universe/self._trajectory attributes are stripped away from the __getstate__ method. Then, each copied object is initialized again from the topology and coordinate filenames. If you run the analysis in parallel, what happens is that the original selections you passed as input are still a copy of the original ones, which still reference to the original universe, which is not the one being used in the copied object. Thus, the frames won't be cycled in for ts in trajectory loop because the loop cycles on the new Universe and not the original one. The nosetests are passed because, I think, they just work on a single frame, the first one, so no cycling is done. I did not test this issue with the new code but I've already stumbled on it some time ago when trying to pass selections as input.

Anyway, this issue can be easily solved by making sure that selections are made WITHIN the analysis object, solely in the _prepare method, which is called after each copied object is initialized. Thus the user would be asked to pass a "selection_string" and not the selection itself when initializing an analysis object. I understand this would require some changes in the already existing code, but it's the only easy approach that came to my mind. Any ideas?

@richardjgowers
Copy link
Member

Yeah I had the same thought too. I think a way around this is to store the AtomGroups within the class not as AtomGroups, but as indices (use ag.indices to get this). Then when the objects are initialised again you use the indices to slice the "local" version of Universe.atoms and you've reconstructed the AtomGroups.

Would this work?

@mattiafelice-palermo
Copy link
Contributor Author

I see what you mean, but still the we do not know a priori how many selections an analysis object is going to need, nor we know which name they were given by the developer of a new analysis object, thus we wouldn't know how to "reference" them when we want to use the indices to slice a local version of Universe.atom. Don't you think?

@richardjgowers
Copy link
Member

Maybe, but if we stored them as a list called self._ags or something, then we could put something like:

# Storing
try:
    ag_indices = [ag.indices for ag in self._ags]
except AttributeError:  # catches when _ags didn't exist
    pass

# Rebuilding
self._ags = [self.universe.atoms[ag_ids] for ag_ids in ag_indices]

So then InterRDF would use self._ags[0] & self._ags[1] instead of g1 & g2

@mattiafelice-palermo
Copy link
Contributor Author

@mnmelo Thank you very much! I think the shared memory approach is definitely the way to go in the future. As for now, in my opinion perhaps it would be better to test the present working version so that we can get hold of how it works with the already existing analysis routines and if there are any issues that we overlooked.

Anyway, I'm just a contributor to the project, I guess the final word on how to proceed should be of a coredev. @richardjgowers what do you think?

@mnmelo
Copy link
Member

mnmelo commented Jan 21, 2016

Yup, let's see what others say. Now it'd be a good time to hit it while the iron is hot (and admittedly, it'd push me towards actually contributing to the parallelization, which I have been lazily putting off!)

@richardjgowers
Copy link
Member

Maybe simple tests for the parallel analysis. So for the serial AnalysisBase there's the stupid FrameAnalysis which just records frame numbers....

We could make a FrameSummer(AnalysisBase) which just does sum(ts.frame for ts in u.trajectory). Then try and find race conditions by doing that in parallel. Then to check that correct frames are being selected, extend this to sum(ts.frame for ts in u.trajectory[start:stop:step]), with a variety of slice indices.

Edit:
And maybe do FrameAnalysis in serial to make sure that the list gets concatenated in the correct order, so you get [0, 1, 2, 3, 4, 5, 6] and not [4, 5, 6, 0, 1, 2, 3]

@dotsdl
Copy link
Member

dotsdl commented Jan 21, 2016

I think once #363 is finished, we can then do for example with three threads:

Universe1                       Universe2                             Universe3
   |                               |                                     |
   |_______________________________|_____________________________________|
                                   |
                                   v
                                Topology

where each of our Universes shares literally the same Topology object, but they each have their own of everything else, including their own trajectory Reader so they can visit frames independently. I think that's the thing that would save the lion's share of memory and would require very little initialization. Universe objects are very thin in the new scheme; the Topology object is where the actual system information is contained.

In fact, @richardjgowers, we could even start playing with shared Topology objects now, since everything should work as is except for needing the ability to add a Topology object to a Universe in place of a topology file.

@mnmelo
Copy link
Member

mnmelo commented Jan 22, 2016

@dotsdl While encapsulation is certainly a good thing, I think it brings little extra to this case. We may have a lighter Universe, but the Topology must then also somehow make its way to the parallel workers. We end up with the same question, whether to pass the Topology by pickling or by shared memory. Or am I missing your point?

@mattiafelice-palermo
Copy link
Contributor Author

Ok so while we decide on how to share the topology between the processes, I might start working on the tests suggested by Richard, so we can make sure future versions of the parallel code work properly. On a side node, perhaps in this case it makes more sense to use a shared memory paradigm, even though if the new universe objects are really "thin", then passing copies of the objects between main process and the workers shouldn't impact performance much?

@richardjgowers
Copy link
Member

Yeah the tests should be independent of how we choose to make it work

@mattiafelice-palermo
Copy link
Contributor Author

I edited FrameAnalysis and tests so that now it works also in parallel. The order of the frames is checked by the tests. I didn't get the last part about FrameSummer though. I think that as long as the parallelization strategy works in "read-only" mode, no race conditions should happen, right?

@richardjgowers
Copy link
Member

I was trying to expose the race conditions that we were having at the start of this issue, when we only had 1 reader for many threads. You're right that our current strategy will avoid this, but as we play with different ideas, we need to know we've not accidentally gone backwards.

Maybe it's a little overkill for now though.

@mattiafelice-palermo
Copy link
Contributor Author

So do you think we should wait for the topology refactor or perhaps make this first working version available and then modify it if future changes affect it? In the second case, I might start working on docstrings.

@richardjgowers
Copy link
Member

I think finish #670, merge the new version of develop into this branch and finish this before the refactor. The refactor won't break anything here, it will just make progressing easier. So having this as an experimental start point is a good idea.

@dotsdl
Copy link
Member

dotsdl commented Feb 18, 2016

Retarget to 0.15?

@mnmelo
Copy link
Member

mnmelo commented Feb 18, 2016

I vote so.

@kain88-de
Copy link
Member

Yeah sure

@dotsdl dotsdl modified the milestones: Topology refactor - 0.15, 0.14 Feb 18, 2016
@dotsdl
Copy link
Member

dotsdl commented Feb 18, 2016

Retargeted. I think with the discussion in #719 this will need to bake a bit longer than we have before 0.14 is released.

@mattiafelice-palermo
Copy link
Contributor Author

Fine by me :) I'll be available once we decide to start working on this again!

@orbeckst
Copy link
Member

Once we pick it up again: a good candidate for parallelization is the density module: it's slow but pleasingly parallel and you can get bragging rights if you are faster than eg VMD's VolMap.

@orbeckst
Copy link
Member

See also https://www.mdanalysis.org/pmda/

@orbeckst orbeckst added the close? Evaluate if issue/PR is stale and can be closed. label Sep 28, 2018
@orbeckst
Copy link
Member

With @yuxuanzhuang 's serialization of universes now possible since PR #2723 , this approach should be re-evaluated in a new PR.

@orbeckst orbeckst closed this Aug 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants