-
Notifications
You must be signed in to change notification settings - Fork 648
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace KDTree library with alternative #383
Comments
what's wrong with 'old'? |
There's also scipy.spatial.KDTree which wouldn't introduce a new dependency So I guess things to prove would be
The other thing is all we currently use KDTree for is distance based selections I think... And the downside with it is it doesn't handle PBC. So a better solution might be to introduce a capped distance search method (using domain decomposition into cells). I was playing around with making one last month, https://github.com/richardjgowers/cellgrid This would also be awesome for lots of common stuff in very large systems |
It's SWIG interface currently doesn't support Python3. scikit-learn would support py2 and py3. I also have good experience with it's performance. It would also mean one less thing which has to be maintained. |
@richardjgowers what would you suggest is a good test system? |
Timeit in ipython is good enough, and we can assume that particles are randomly space, so something like... import numpy as np
from scipy import spatial
from MDAnalysis.lib import KDTree
coords = np.random.random(750000).reshape(250000, 3).astype(np.float32)
t = KDTree.KDTree(3, 10)
%timeit t.set_coords(coords)
%timeit t2 = spatial.KDTree(coords) I get this: In [13]: %timeit t.set_coords(coords) In [16]: %timeit t2 = spatial.KDTree(coords) I think the setup cost of the KDTree will be the largest cost, but it's also worth checking that all methods are available in the replacement. We don't directly use the KDTree, but wrap it in a |
http://nbviewer.ipython.org/gist/kain88-de/3617d6ec20aaf3cb1ef4 It looks like the sklearn implementation is a lot faster then the one from scipy. |
Hmm ok, and the sklearn KDTree has 96%+ test coverage and supports py3. For fairness the mda kdtree search was 2x as fast, so after 700,000 searches it's quicker :D Ok, if you can rewrite NeighborSearch to use the sklearn kdtree then this should be ok unless there's something I've mussed |
One reason for bundling KDTree was that we didn't really like introducing scipy or sklearn as hard dependencies for core MDAnalysis because in the past this was awful to just By default, KDTree is only used for the Issue #137 was looking at an implementation of KDTree with PBC. |
Re: #137, I don't think there's a way to add PBC at the C level. My best solution to that was to add too many coordinates to the KDTree (replicate the images), and this is done outside of the KDTree. So having our own KDTree doesn't affect the issue |
I'm just porting the selection object to use the sklearn KDTree. Some of the tests are failing right now. But just seeing that the tests in |
I hit a similar test failure in my half-way cleanup of the selection module (I left that on stand-by pending the development of the new I had reflowed much of the
I'm not sure which I prefer to be true... Here's my test error log (notice that something else also fails for |
Well then we should first refactor the tests to be readable. I have serious problems understanding what the current tests are doing exactly |
Ok the tests pass now. the |
So, was it something in your code, then? |
yup. didn't understand the old original code. now everything is working. you can check out my progress here. https://github.com/kain88-de/mdanalysis/tree/sklearn-kdtree |
Did we decide we actually wanted to include scikit-learn as a dependency for a core component? I'm not sure we resolved the question yet. |
On 16 Aug, 2015, at 12:00, David Dotson wrote:
I am against introducing such a heavy dependency unless there's a REALLY good reason. If the code is well encapsulated, we could simply take the code and replace the KD-tree one. Alternatively, trying a grid-search type algorithm (and writing it in cython ourselves) might be a good alternative. Oliver Beckstein * orbeckst@gmx.net |
It's faster then the BioPython version and very well maintained.
I don't like this approach. It just puts burden on MDAnalysis to maintain the forked code and merge upstream bug fixes. It is a better and more pythonic approach to use the other libraries directly. I have switched the KDTree in my branch to use the BioPython KDTree directly. This will also work |
Then, if switching in place is that easy I'd say that we depend on either BioPython or scikit-learn. If none are present we install the lighter BioPython. If scikit-learn is available we use that. I'm not sure, though, how to teach these kind of dependencies to The code then uses scikit-learn if available, otherwise it falls back to BioPython. We write a big notice on the install instructions, and maybe even issue a warning when BioPython is used saying speed is to be gained if scikit-learn is installed. This is all assuming it's easy to just switch frameworks. @kain88-de might share more insight on that (I imagine it won't be as straightforward as having a |
@mnmelo it is not that easy since the BioPython KDTree has a different API then the scikit-learn KDTree. Besides I would prefer to use just one to keep the code simple. |
Ah, that's a shame. But I understand that putting effort into making wrappers around equivalent foreign libraries shouldn't really be our priority here. |
Introducing For a start, one thing that vanilla Python (I mean I don't think that inclusion of I am eager to listen to everyone's opinions. |
I agree with @orbeckst about not to depend too much about |
I agree with @orbeckst as well. Although I very much respect scikit-learn and its high development standards, I don't consider it to be low-level enough in the python stack to make sense as a piece of the core of mdanalysis. It provides high-quality implementations of common machine learning algorithms with a consistent API; that is great, but I don't think that fits for this case. I think a Cythonized KDTree that is easily able to handle periodic boundary conditions is something we would find desirable. That is something we will have to tackle, and I think that's the only reasonable choice. |
We've already got scipy as a dependency, and we could use the KDTree there, but @kain88-de has already shown it's slightly slower (http://nbviewer.ipython.org/gist/kain88-de/3617d6ec20aaf3cb1ef4) I didn't see a problem with using sklearn, it looks like installing it has the same dependencies as numpy/scipy, which would fall into @orbeckst 's "install cleanly" clause. I've just tried making a virtual environment and installing sklearn and it looks painless enough? @dotsdl the periodic KDTree will be a wrapper around a regular KDTree, so I'd rather use an external (and hopefully optimised) solution and wrap around that |
I 've looked at benchmark for
|
The PR #395 is currently using the BioPython KDTree. That solution is slightly slower then the scikit-learn version but that shouldn't be noticed to much unless one is doing a lot of different selections I personally have scikit-learn installed by default anyway. They offer a huge variety of dimension reduction algorithms and clusterings. But I also understand if you want to keep the dependencies With regard to installing things. In my experience pip works pretty great today if you are using it consistently. I install pip packages always only for my user with ( |
Scipy is not a hard dependency, the core runs without it.
I think a fast grid search would be a viable alternative.
… We've already got scipy as a dependency,
|
Yes, but tree setup time really matters, as @richardjgowers pointed out in an earlier comment. |
@kain88-de , using the BioPython KD-tree implementation seems a good compromise. Can you clean up #395 (or submit a new clean PR); at the moment it does not pass the tests. However, the |
Only the atomneighbor search is used there. This class still exists but in its own module now. @richardjgowers just has to tell me where that module should ideally live and then I'll fix the failing test. |
Keep it in lib, analysis is more for whole tools rather than components. |
Oh I missed analysis.density because It doesn't have unit-tests. I change the API and I'm quite confident that it still works but I can't test that currently. I wanted to quickly run the example given |
Further standardized XDR error message reporting and exceptions (now mostly IOError). Removed reference to the old SWIG numpy.i in the credits.
- changed "Maintained by Oliver Beckstein and Elizabeth Denning" to "Maintained by the MDAnalysis Core Devs Team" - added Wouter Boomsma (for PR #976) - removed explicit mentions for KDTree (Hamelryck) because since 0.11.0 we do no longer include the actual code (see PR #395 and #383) - removed explicit mentions for helanal (Hall) because Ben Hall is a contributing author
JmhX7IYP+2pnu4iv7mzp+xy8W5x1pzFOZhzfS4LslB2Ac+4SENM1eAZebiZQYSYl qyYkj9bLS2RqC40dQVfAA4dMrAuWpNfgZkTeCCU7pm29TodN0hmljNG/rLoGG/iH ieA+pSultKwyryY0UZXHAMmNRIViEfy1VsjE2Hswc/AIiY9W1yfLONXVjtELjFsU jv4u1ljR+lhuP6ZllEwucKeD5ojQDyXBU9WXPi2bUwFEFpNdn5zXVRTV/kntb86d sbOE1TQpy+WeSNTyqmEtdjQRuLH21Jm9kTnWFjkKjw2lL+odUOAoOEC4tLSin30i 6KK2lVvSfXJi9e5XOL+nXWqvwr+jU5jqNV+PC60aEe9N1xhVapYroY84bIJwcuYQ 2Axu0vJWNLOpnLGAH7kYYaeEFJdsZt283kDy1paagWmXqyy3p3hh1VD8NEWWcXQ4 Da07VzZh61RvLHFWIzyXkizPjaCMCfCPIpwRACht5OJdRH0LwEhGRhJhSHEUdh7p WGNEQr5XVnd2Xb+dHfpBeq6a0vPmOv8oDXzRCnGpDeTjGE3kEQy8PGeDqm4dsD+B wOmeCYN/NmUnurpZrW2x151bKgCbcjwiTksCsIhZmN1kRHqAR+NkoHSUOiji6INw LvpyU91X7ji1yyYlhPTu =N0P/ -----END PGP SIGNATURE----- updated AUTHORS - changed "Maintained by Oliver Beckstein and Elizabeth Denning" to "Maintained by the MDAnalysis Core Devs Team" - added Wouter Boomsma (for PR MDAnalysis#976) - removed explicit mentions for KDTree (Hamelryck) because since 0.11.0 we do no longer include the actual code (see PR MDAnalysis#395 and MDAnalysis#383) - removed explicit mentions for helanal (Hall) because Ben Hall is a contributing author
Is there something against switching the KDTree library?
Currently you are using a pretty old C implementation. It would for example be possible
to use the scikit-learn KDTree.
The text was updated successfully, but these errors were encountered: