Replace KDTree library with alternative #383

kain88-de · 2015-08-05T19:26:39Z

Is there something against switching the KDTree library?

Currently you are using a pretty old C implementation. It would for example be possible
to use the scikit-learn KDTree.

hainm · 2015-08-05T19:56:41Z

what's wrong with 'old'?

richardjgowers · 2015-08-05T20:09:35Z

There's also scipy.spatial.KDTree which wouldn't introduce a new dependency

So I guess things to prove would be

benchmarks of 250,000 3d coordinates, both creating the Tree and querying it
checking all methods have a counterpart (same functionality)

The other thing is all we currently use KDTree for is distance based selections I think... And the downside with it is it doesn't handle PBC. So a better solution might be to introduce a capped distance search method (using domain decomposition into cells). I was playing around with making one last month, https://github.com/richardjgowers/cellgrid

This would also be awesome for lots of common stuff in very large systems

kain88-de · 2015-08-05T20:09:47Z

It's SWIG interface currently doesn't support Python3. scikit-learn would support py2 and py3. I also have good experience with it's performance. It would also mean one less thing which has to be maintained.

kain88-de · 2015-08-05T20:12:27Z

@richardjgowers what would you suggest is a good test system?

richardjgowers · 2015-08-06T08:19:41Z

Timeit in ipython is good enough, and we can assume that particles are randomly space, so something like...

import numpy as np
from scipy import spatial
from MDAnalysis.lib import KDTree

coords = np.random.random(750000).reshape(250000, 3).astype(np.float32)

t = KDTree.KDTree(3, 10)
%timeit t.set_coords(coords)

%timeit t2 = spatial.KDTree(coords)

I get this:

In [13]: %timeit t.set_coords(coords)
1 loops, best of 3: 833 ms per loop

In [16]: %timeit t2 = spatial.KDTree(coords)
1 loops, best of 3: 805 ms per loop

I think the setup cost of the KDTree will be the largest cost, but it's also worth checking that all methods are available in the replacement. We don't directly use the KDTree, but wrap it in a MDAnalysis.lib.KDTree.NeighbourSearch.

kain88-de · 2015-08-06T08:42:22Z

http://nbviewer.ipython.org/gist/kain88-de/3617d6ec20aaf3cb1ef4

It looks like the sklearn implementation is a lot faster then the one from scipy.

richardjgowers · 2015-08-06T09:48:24Z

Hmm ok, and the sklearn KDTree has 96%+ test coverage and supports py3.

For fairness the mda kdtree search was 2x as fast, so after 700,000 searches it's quicker :D

Ok, if you can rewrite NeighborSearch to use the sklearn kdtree then this should be ok unless there's something I've mussed

orbeckst · 2015-08-06T17:05:03Z

One reason for bundling KDTree was that we didn't really like introducing scipy or sklearn as hard dependencies for core MDAnalysis because in the past this was awful to just pip install.

By default, KDTree is only used for the around selection; point uses distance_matrix if I remember correctly. However, building manual KDTree selections has been pretty useful.

Issue #137 was looking at an implementation of KDTree with PBC.

richardjgowers · 2015-08-06T18:54:10Z

Re: #137, I don't think there's a way to add PBC at the C level. My best solution to that was to add too many coordinates to the KDTree (replicate the images), and this is done outside of the KDTree. So having our own KDTree doesn't affect the issue

kain88-de · 2015-08-09T12:58:36Z

I'm just porting the selection object to use the sklearn KDTree. Some of the tests are failing right now. But just seeing that the tests in test_analysis:TestHydrogenBondingAnalysisChecking fail doesn't tell me much what the problem is due to the test design. It is not even that easy to figure out that the test
does exactly.

mnmelo · 2015-08-09T13:33:19Z

I hit a similar test failure in my half-way cleanup of the selection module (I left that on stand-by pending the development of the new AtomGroup model and use of pyparsing; see #371).

I had reflowed much of the point and around code, to reduce duplication, and was not surprised I broke something. But now you find a failure in the same tests. Maybe

Those are the only tests that really use the KDTree feature, and both our fixes are indeed borked; or
That module has a bug somewhere.

I'm not sure which I prefer to be true...

Here's my test error log (notice that something else also fails for test_atomgroup.py).
Let me know if you'd like me to upload my half-done branch.

kain88-de · 2015-08-09T13:44:14Z

Well then we should first refactor the tests to be readable. I have serious problems understanding what the current tests are doing exactly

kain88-de · 2015-08-09T14:29:15Z

Ok the tests pass now. the test_atomgroup.py tests are much better to finding out what happens and what is wrong.

mnmelo · 2015-08-09T16:32:12Z

So, was it something in your code, then?

kain88-de · 2015-08-09T16:55:49Z

yup. didn't understand the old original code. now everything is working. you can check out my progress here. https://github.com/kain88-de/mdanalysis/tree/sklearn-kdtree

dotsdl · 2015-08-16T19:00:38Z

Did we decide we actually wanted to include scikit-learn as a dependency for a core component? I'm not sure we resolved the question yet.

orbeckst · 2015-08-17T03:14:52Z

On 16 Aug, 2015, at 12:00, David Dotson wrote:

Did we decide we actually wanted to include scikit-learn as a dependency for a core component? I'm not sure we resolved the question yet.

I am against introducing such a heavy dependency unless there's a REALLY good reason.

If the code is well encapsulated, we could simply take the code and replace the KD-tree one.

Alternatively, trying a grid-search type algorithm (and writing it in cython ourselves) might be a good alternative.

Oliver Beckstein * orbeckst@gmx.net
skype: orbeckst * orbeckst@gmail.com

kain88-de · 2015-08-17T07:12:29Z

I am against introducing such a heavy dependency unless there's a REALLY good reason.

It's faster then the BioPython version and very well maintained.

If the code is well encapsulated, we could simply take the code and replace the KD-tree one.

I don't like this approach. It just puts burden on MDAnalysis to maintain the forked code and merge upstream bug fixes. It is a better and more pythonic approach to use the other libraries directly.

I have switched the KDTree in my branch to use the BioPython KDTree directly. This will also work
for Python3

mnmelo · 2015-08-17T11:22:14Z

Then, if switching in place is that easy I'd say that we depend on either BioPython or scikit-learn. If none are present we install the lighter BioPython. If scikit-learn is available we use that. I'm not sure, though, how to teach these kind of dependencies to setuptools or pip.

The code then uses scikit-learn if available, otherwise it falls back to BioPython. We write a big notice on the install instructions, and maybe even issue a warning when BioPython is used saying speed is to be gained if scikit-learn is installed.

This is all assuming it's easy to just switch frameworks. @kain88-de might share more insight on that (I imagine it won't be as straightforward as having a try: except: on the imports...).

kain88-de · 2015-08-17T11:56:16Z

@mnmelo it is not that easy since the BioPython KDTree has a different API then the scikit-learn KDTree. Besides I would prefer to use just one to keep the code simple.

mnmelo · 2015-08-17T12:13:00Z

Ah, that's a shame. But I understand that putting effort into making wrappers around equivalent foreign libraries shouldn't really be our priority here.

orbeckst · 2015-08-19T23:44:17Z

Introducing sklearn as a dependency requires some serious discussion. I definitely don't want to rip out the old KD-Tree for 0.11.0.

For a start, one thing that vanilla Python (I mean pip/setuptools) is decidedly not good at is making sure that you can easily install packages—have you ever tried to get a package to work somewhere that does not have the full scientific python stack installed? In order to make this installation-hell sort-of-manageable you actually need 3rd party stuff like EPD or conda or at least a good understanding of your distribution's package management system. To be facetious: In some cases "pythonic" does not always equal "user friendly". Therefore, I think that inclusion of 3rd party libs into more complicated packages is actually the right thing to do in many instances because it shifts load from users to developers. And ultimately, I feel that firstly, MDAnalysis must make it easy for users to get work done.

I don't think that inclusion of sklearn as a hard dependency is justified at the moment because we are using only a tiny part of it to provide a tiny part of MDAnalysis functionality. If you look at the other hard dependencies, then they are either essential and provide a lot of functionality (numpy) or small-ish and/or install cleanly (networkx, GridDataFormats) or optional to provide very specific functionality (netCDF4 for Amber NCDF format). You could make a case that sklearn is similar to biopython, which also only contributes a fraction of functionality (one version of the PDBReader, sequence handling), but at least we never had any problems with biopython as a dependency as it always builds cleanly in my experience. (But I would like to get rid of that dependency, too, especially with a view towards Py3k!) Basically, I want to avoid one more stumbling block that prevents a user from easily installing the package (because these are the users that never come back and leave no feedback on what went wrong).

I am eager to listen to everyone's opinions.

hainm · 2015-08-19T23:49:46Z

I agree with @orbeckst about not to depend too much about sklearn. There is no point to replace the old lib if it works fine. Not all users is expert in Python and not all of them (currently) knows about conda.

dotsdl · 2015-08-20T00:14:56Z

I agree with @orbeckst as well. Although I very much respect scikit-learn and its high development standards, I don't consider it to be low-level enough in the python stack to make sense as a piece of the core of mdanalysis. It provides high-quality implementations of common machine learning algorithms with a consistent API; that is great, but I don't think that fits for this case.

I think a Cythonized KDTree that is easily able to handle periodic boundary conditions is something we would find desirable. That is something we will have to tackle, and I think that's the only reasonable choice.

richardjgowers · 2015-08-20T07:36:42Z

We've already got scipy as a dependency, and we could use the KDTree there, but @kain88-de has already shown it's slightly slower (http://nbviewer.ipython.org/gist/kain88-de/3617d6ec20aaf3cb1ef4)

I didn't see a problem with using sklearn, it looks like installing it has the same dependencies as numpy/scipy, which would fall into @orbeckst 's "install cleanly" clause. I've just tried making a virtual environment and installing sklearn and it looks painless enough?

@dotsdl the periodic KDTree will be a wrapper around a regular KDTree, so I'd rather use an external (and hopefully optimised) solution and wrap around that

hainm · 2015-08-20T07:42:19Z

I 've looked at benchmark for
'Test finding in radius of time'
mda has the lowest time.

On Aug 20, 2015, at 3:36 AM, Richard Gowers notifications@github.com wrote:

We've already got scipy as a dependency, and we could use the KDTree there, but @kain88-de has already shown it's slightly slower (http://nbviewer.ipython.org/gist/kain88-de/3617d6ec20aaf3cb1ef4)

I didn't see a problem with using sklearn, it looks like installing it has the same dependencies as numpy/scipy, which would fall into @orbeckst 's "install cleanly" clause. I've just tried making a virtual environment and installing sklearn and it looks painless enough?

@dotsdl the periodic KDTree will be a wrapper around a regular KDTree, so I'd rather use an external (and hopefully optimised) solution and wrap around that

—
Reply to this email directly or view it on GitHub.

kain88-de · 2015-08-20T07:54:32Z

The PR #395 is currently using the BioPython KDTree. That solution is slightly slower then the scikit-learn version but that shouldn't be noticed to much unless one is doing a lot of different selections
in a loop. (scikit-learn will be overall faster since it builds up the Tree incredibly fast, so a little slower search doesn't matter that much here). As far as Py3 is concerned BioPython supports python 2.6 - 3.4 so we are good there as well.

I personally have scikit-learn installed by default anyway. They offer a huge variety of dimension reduction algorithms and clusterings. But I also understand if you want to keep the dependencies
for MDAnalysis light.

With regard to installing things. In my experience pip works pretty great today if you are using it consistently. I install pip packages always only for my user with (--user) and have any system-wide
installation handled by my distribution package-manager. But I also have seen people who have a messed up environment when they used pip to install packages system-wide and then updated with the distribution package manager. But that's a problem people will have independent of MDAnalysis.

orbeckst · 2015-08-20T16:35:59Z

Scipy is not a hard dependency, the core runs without it. I think a fast grid search would be a viable alternative.

…

We've already got scipy as a dependency,

orbeckst · 2015-08-20T16:37:30Z

I 've looked at benchmark for
'Test finding in radius of time'
mda has the lowest time.

Yes, but tree setup time really matters, as @richardjgowers pointed out in an earlier comment.

orbeckst · 2015-08-20T17:51:12Z

@kain88-de , using the BioPython KD-tree implementation seems a good compromise. Can you clean up #395 (or submit a new clean PR); at the moment it does not pass the tests.

However, the NeighborSearch wrapper is being used in MDAnalysis.analysis.density and other tools dependent on MDAnalysis so I'd like to keep it. (I am mentioning this because in #395 you asked about it).

kain88-de · 2015-08-21T13:30:00Z

Only the atomneighbor search is used there. This class still exists but in its own module now. @richardjgowers just has to tell me where that module should ideally live and then I'll fix the failing test.

richardjgowers · 2015-08-21T14:22:24Z

Keep it in lib, analysis is more for whole tools rather than components.

kain88-de · 2015-08-23T18:19:35Z

However, the NeighborSearch wrapper is being used in MDAnalysis.analysis.density and other tools dependent on MDAnalysis so I'd like to keep it. (I am mentioning this because in #395 you asked about it).

Oh I missed analysis.density because It doesn't have unit-tests. I change the API and I'm quite confident that it still works but I can't test that currently. I wanted to quickly run the example given
in the doc string but that doesn't work anymore.

Further standardized XDR error message reporting and exceptions (now mostly IOError). Removed reference to the old SWIG numpy.i in the credits.

- changed "Maintained by Oliver Beckstein and Elizabeth Denning" to "Maintained by the MDAnalysis Core Devs Team" - added Wouter Boomsma (for PR #976) - removed explicit mentions for KDTree (Hamelryck) because since 0.11.0 we do no longer include the actual code (see PR #395 and #383) - removed explicit mentions for helanal (Hall) because Ben Hall is a contributing author

JmhX7IYP+2pnu4iv7mzp+xy8W5x1pzFOZhzfS4LslB2Ac+4SENM1eAZebiZQYSYl qyYkj9bLS2RqC40dQVfAA4dMrAuWpNfgZkTeCCU7pm29TodN0hmljNG/rLoGG/iH ieA+pSultKwyryY0UZXHAMmNRIViEfy1VsjE2Hswc/AIiY9W1yfLONXVjtELjFsU jv4u1ljR+lhuP6ZllEwucKeD5ojQDyXBU9WXPi2bUwFEFpNdn5zXVRTV/kntb86d sbOE1TQpy+WeSNTyqmEtdjQRuLH21Jm9kTnWFjkKjw2lL+odUOAoOEC4tLSin30i 6KK2lVvSfXJi9e5XOL+nXWqvwr+jU5jqNV+PC60aEe9N1xhVapYroY84bIJwcuYQ 2Axu0vJWNLOpnLGAH7kYYaeEFJdsZt283kDy1paagWmXqyy3p3hh1VD8NEWWcXQ4 Da07VzZh61RvLHFWIzyXkizPjaCMCfCPIpwRACht5OJdRH0LwEhGRhJhSHEUdh7p WGNEQr5XVnd2Xb+dHfpBeq6a0vPmOv8oDXzRCnGpDeTjGE3kEQy8PGeDqm4dsD+B wOmeCYN/NmUnurpZrW2x151bKgCbcjwiTksCsIhZmN1kRHqAR+NkoHSUOiji6INw LvpyU91X7ji1yyYlhPTu =N0P/ -----END PGP SIGNATURE----- updated AUTHORS - changed "Maintained by Oliver Beckstein and Elizabeth Denning" to "Maintained by the MDAnalysis Core Devs Team" - added Wouter Boomsma (for PR MDAnalysis#976) - removed explicit mentions for KDTree (Hamelryck) because since 0.11.0 we do no longer include the actual code (see PR MDAnalysis#395 and MDAnalysis#383) - removed explicit mentions for helanal (Hall) because Ben Hall is a contributing author

richardjgowers added Python3 Component-Analysis labels Aug 6, 2015

orbeckst added the Component-Selections label Aug 6, 2015

orbeckst added the Component-Core label Aug 6, 2015

kain88-de mentioned this issue Aug 16, 2015

WIP: Sklearn kdtree #395

Merged

4 tasks

richardjgowers added this to the 0.12 milestone Aug 20, 2015

richardjgowers closed this as completed in #395 Aug 24, 2015

orbeckst mentioned this issue Oct 27, 2015

replace netCDF4 with scipy.io.netcdf for the Amber NCDFReader and NCDFWriter #506

Closed

orbeckst referenced this issue Feb 1, 2016

Updated LICENSE and AUTHOR info regarding xdrfile (closes #679)

7b1bc13

Further standardized XDR error message reporting and exceptions (now mostly IOError). Removed reference to the old SWIG numpy.i in the credits.

orbeckst mentioned this issue Feb 11, 2016

Improved, Best-like contact analysis (Issue #702) #708

Closed

mnmelo mentioned this issue Sep 7, 2016

"around" selection is slow and memory-intensive #974

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace KDTree library with alternative #383

Replace KDTree library with alternative #383

kain88-de commented Aug 5, 2015

hainm commented Aug 5, 2015

richardjgowers commented Aug 5, 2015

kain88-de commented Aug 5, 2015

kain88-de commented Aug 5, 2015

richardjgowers commented Aug 6, 2015

kain88-de commented Aug 6, 2015

richardjgowers commented Aug 6, 2015

orbeckst commented Aug 6, 2015

richardjgowers commented Aug 6, 2015

kain88-de commented Aug 9, 2015

mnmelo commented Aug 9, 2015

kain88-de commented Aug 9, 2015

kain88-de commented Aug 9, 2015

mnmelo commented Aug 9, 2015

kain88-de commented Aug 9, 2015

dotsdl commented Aug 16, 2015

orbeckst commented Aug 17, 2015

kain88-de commented Aug 17, 2015

mnmelo commented Aug 17, 2015

kain88-de commented Aug 17, 2015

mnmelo commented Aug 17, 2015

orbeckst commented Aug 19, 2015

hainm commented Aug 19, 2015

dotsdl commented Aug 20, 2015

richardjgowers commented Aug 20, 2015

hainm commented Aug 20, 2015

kain88-de commented Aug 20, 2015

orbeckst commented Aug 20, 2015 via email

orbeckst commented Aug 20, 2015

orbeckst commented Aug 20, 2015

kain88-de commented Aug 21, 2015

richardjgowers commented Aug 21, 2015

kain88-de commented Aug 23, 2015

Replace KDTree library with alternative #383

Replace KDTree library with alternative #383

Comments

kain88-de commented Aug 5, 2015

hainm commented Aug 5, 2015

richardjgowers commented Aug 5, 2015

kain88-de commented Aug 5, 2015

kain88-de commented Aug 5, 2015

richardjgowers commented Aug 6, 2015

kain88-de commented Aug 6, 2015

richardjgowers commented Aug 6, 2015

orbeckst commented Aug 6, 2015

richardjgowers commented Aug 6, 2015

kain88-de commented Aug 9, 2015

mnmelo commented Aug 9, 2015

kain88-de commented Aug 9, 2015

kain88-de commented Aug 9, 2015

mnmelo commented Aug 9, 2015

kain88-de commented Aug 9, 2015

dotsdl commented Aug 16, 2015

orbeckst commented Aug 17, 2015

kain88-de commented Aug 17, 2015

mnmelo commented Aug 17, 2015

kain88-de commented Aug 17, 2015

mnmelo commented Aug 17, 2015

orbeckst commented Aug 19, 2015

hainm commented Aug 19, 2015

dotsdl commented Aug 20, 2015

richardjgowers commented Aug 20, 2015

hainm commented Aug 20, 2015

kain88-de commented Aug 20, 2015

orbeckst commented Aug 20, 2015 via email

orbeckst commented Aug 20, 2015

orbeckst commented Aug 20, 2015

kain88-de commented Aug 21, 2015

richardjgowers commented Aug 21, 2015

kain88-de commented Aug 23, 2015