-
Notifications
You must be signed in to change notification settings - Fork 648
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
replace netCDF4 with scipy.io.netcdf for the Amber NCDFReader and NCDFWriter #506
Comments
FWIW, I originally had an implementation of NetCDF readers and writers that was backend-agnostic (supporting scipy, netCDF4, and ScientificPython). The main annoyance there was actually in Python 3, where some packages returned I have since ripped out that code and replaced it with the scipy backend (and I pulled out |
@tylerjereddy do you have a little bit of time to benchmark the implementation in PR #503 so that we have a better idea of the real read/write performance? (– I just can't do it at the moment.) It is unlikely that the timings from the test suite are a good measure. Perhaps take the longest/biggest Amber trajectory that we have as a testfile and just time a couple of read/writes. |
@swails thanks for the pointer; my feeling is that simplicity would be better than supporting multiple libraries. The main decision would be if to bundle |
I have performed some initial benchmarks (summarized in plot below). As I have never used AMBER it is probably sensible for at least one other person to check my benchmark code, with the Jupyter notebook displayed in the github repo here: https://github.com/tylerjereddy/netcdf_bench/blob/master/netcdf_bench_MDA.ipynb You should be able to clone the test files I used as well, but do exercise caution with kernel restarts when building different versions of MDA (mentioned in the notebook as well). Also, does it make sense for MDAnalysis to write a larger |
can you annotate those lines? which belongs to which (scipy or netCDF) In [8]:
#MDA develop branch WRITE testing (commit hash: aac24f9)
%timeit -n 10 -r 5 test_write_ncdf()
10 loops, best of 5: 75 ms per loop
In [4]:
#MDA scipy-ncdf-debug branch WRITE testing (commit hash: 565aa0)
%timeit -n 10 -r 5 test_write_ncdf()
10 loops, best of 5: 173 ms per loop |
Afaik the file sizes should be identical, which might explain (some of) the difference in speed you're seeing. The filesize looks roughly double, so it could be either
|
|
Ok, I'll chip away at these things and update the benchmark post above as I go along. I also need to remove universe object generation from |
I've updated the plot above with the revised writing benchmarks. I can confirm that MDAnalysis is writing Also, if @orbeckst wants benchmarks with However, I think it is sensible to make sure the writer is behaving properly before investing any more time into refining the benchmarks. |
@tylerjereddy raise an issue for the format mix-up. I haven't heard any complaints but we should stay as close to the standards as possible. Eventually I'd like to see the effect of |
@tylerjereddy are we getting any closer to deciding what we want to do? It would be good to figure out for 0.13.0 which way to go. With the benchmarks so far #506 (comment) , we would seem to be sacrificing speed for ease of installation but perhaps these results were confounded by #518? |
may be profiling speed (%prun in ipython) and memory (%memit from memory_profiler in ipython) to see what really happened here? |
@orbeckst It would be sensible to repeat the benchmarks now that the netCDF writer actually works properly. We also wanted to expand the benchmarks to include |
So I played around with some benchmarks, I made a 1001 frame trajectory from another format. Reading this (
So I'm seeing the 5-6x speedup that was previously mentioned in #488 For writing, I see that scipy is slower than netcdf, 1-2 minutes for scipy and 30secs for netcdf. But I'd say that reading is more important than writing. I think all our slower benchmarks have come from devs using OSX(?) and I'm using Ubuntu. So it might be that the scipy implementation is a little gimped on OSX? |
if this happens, just use two back-ends (same api) for read (scipy) and write (netcdf)? :)) |
ah, just FYI that if using mmap=False Filename: ./memory/netcdf_file_scipy.py
Line # Mem usage Increment Line Contents
================================================
9 45.6 MiB 0.0 MiB @profile
10 def load_(filename=filename):
11 3384.6 MiB 3339.0 MiB with netcdf_file(filename, mmap=False) as fh:
12 3384.6 MiB 0.0 MiB c= fh.variables['coordinates'] mmap=True Filename: ./memory/netcdf_file_scipy.py
Line # Mem usage Increment Line Contents
================================================
9 45.6 MiB 0.0 MiB @profile
10 def load_(filename=filename):
11 81.2 MiB 35.6 MiB with netcdf_file(filename, mmap=True) as fh:
12 81.2 MiB 0.0 MiB c= fh.variables['coordinates'] |
@richardjgowers, did you do your tests with @hainm, we'd like to get rid of dependencies ;-) --- so supporting two different ncdf backends is not very attractive. |
@richardjgowers Yeah, the early stage benchmarks I did were on Mac OS X. |
- API is different (less feature rich in the pure python implementation) so it is not straightforward to support both netCDF4 and fall back on scipy.io.netcdf (or perhaps even pupyere, which is a single-file package from which the scipy code originated) - uses float32 for - positions - velocities - forces (port of resolution of issue #518) - change NCDFReader, NCDFWriter and test_netcdf - see discussion of issue #506 for performance benchmarks
Just as a note to ourselves: This issue is stalled because the writing performance of the pure-python (scipy) implementation is awful. Most of the discussion is in the PR #503 but has also been noted at ParmEd/ParmEd#619. A potential solution (short of an upstream fix... one can hope) is to use scipy (or the bundled netcdf) for reading, use netcdf for writing if available and fall back to the crappy scipy ncdf writer if nothing faster is available. (see #503 (comment)) |
ParmEd did at one point have support for the 3 most popular NetCDF libraries (scipy, ScientificPython, and netCDF4). The APIs were all quite similar. It won't be hard to add it back (once I find the time). I'll write back here when I do that in case it's any use to MDAnalysis. |
As a ParmEd update here -- I finally got around to optionally optimizing NetCDF writing with netCDF4 while still using scipy's implementation for reading (and writing when netCDF4 is not available). It was remarkably easy to do that: ParmEd/ParmEd#722 shows everything I had to do. Here's another argument in favor of using this solution: Many Amber users will probably have a hard time getting netCDF4 to work. This is what happens to me when I try:
The reason for this (which I figured out via But many users will have neither the patience or expertise to hunt that problem down and figure out how to fix it. If scipy's NetCDF module is available (which doesn't link against NetCDF libraries and so will never have this problem), then that adds convenience to a potentially large section of your user base (basically anyone with Amber installed). |
@swails thanks so much, ParmEd/ParmEd#722 does not look too awful so I think for the time being we'll do the same. The Amber |
I think do the same is good since |
Yes, default to reading fast with scipy.io.netcdf and writing fast with netCDF4 but fall back to writing slowly with scipy.io.netcdf (and maybe @tylerjreddy can raise an issue with the scipy folks to address the problem upstream). Oliver Beckstein * orbeckst@gmx.net |
You could also just cut out OTOH, anybody who's anybody doing data analysis will have scipy installed, so may not be worth it. |
In the PR it’s already bundled and used as a fall-back when scipy is not available.
|
- removed all imports of netCDF4 - API is different (less feature rich in the pure python implementation) so it is not straightforward to support both netCDF4 and fall back on scipy.io.netcdf (or perhaps even pupyere, which is a single-file package from which the scipy code originated) - uses float32 for - positions - velocities - forces (port of resolution of issue #518) - change NCDFReader, NCDFWriter and test_netcdf - see discussion of issue #506 for performance benchmarks
- removed all imports of netCDF4 - API is different (less feature rich in the pure python implementation) so it is not straightforward to support both netCDF4 and fall back on scipy.io.netcdf (or perhaps even pupyere, which is a single-file package from which the scipy code originated) - uses float32 for - positions - velocities - forces (port of resolution of issue #518) - change NCDFReader, NCDFWriter and test_netcdf - see discussion of issue #506 for performance benchmarks
- netcdf reads MUCH faster than netCDF4 but writes MUCH, MUCH slower than netCDF4: always read with netcdf but write with netCDF4 if available, otherwise use slow netcdf and warn - implements @swails 's solution from ParmEd/ParmEd#722 -- thank you!! - minimal testing: write the same trajectory with netcdf and with netCDF4 - NOTE: netCDF4 is not installed by default, use pip install MDAnalysis[AMBER] to request its installation but Amber users should also be aware of potential issues with the bundled netcdf library of Amber; see #506 (comment) for details - closes #506
- removed all imports of netCDF4 - API is different (less feature rich in the pure python implementation) so it is not straightforward to support both netCDF4 and fall back on scipy.io.netcdf (or perhaps even pupyere, which is a single-file package from which the scipy code originated) - uses float32 for - positions - velocities - forces (port of resolution of issue #518) - change NCDFReader, NCDFWriter and test_netcdf - see discussion of issue #506 for performance benchmarks
- netcdf reads MUCH faster than netCDF4 but writes MUCH, MUCH slower than netCDF4: always read with netcdf but write with netCDF4 if available, otherwise use slow netcdf and warn - implements @swails 's solution from ParmEd/ParmEd#722 -- thank you!! - minimal testing: write the same trajectory with netcdf and with netCDF4 - NOTE: netCDF4 is not installed by default, use pip install MDAnalysis[AMBER] to request its installation but Amber users should also be aware of potential issues with the bundled netcdf library of Amber; see #506 (comment) for details - closes #506
- removed all imports of netCDF4 - API is different (less feature rich in the pure python implementation) so it is not straightforward to support both netCDF4 and fall back on scipy.io.netcdf (or perhaps even pupyere, which is a single-file package from which the scipy code originated) - uses float32 for - positions - velocities - forces (port of resolution of issue #518) - change NCDFReader, NCDFWriter and test_netcdf - see discussion of issue #506 for performance benchmarks
- netcdf reads MUCH faster than netCDF4 but writes MUCH, MUCH slower than netCDF4: always read with netcdf but write with netCDF4 if available, otherwise use slow netcdf and warn - implements @swails 's solution from ParmEd/ParmEd#722 -- thank you!! - minimal testing: write the same trajectory with netcdf and with netCDF4 - NOTE: netCDF4 is not installed by default, use pip install MDAnalysis[AMBER] to request its installation but Amber users should also be aware of potential issues with the bundled netcdf library of Amber; see #506 (comment) for details - closes #506
- removed all imports of netCDF4 - API is different (less feature rich in the pure python implementation) so it is not straightforward to support both netCDF4 and fall back on scipy.io.netcdf (or perhaps even pupyere, which is a single-file package from which the scipy code originated) - uses float32 for - positions - velocities - forces (port of resolution of issue #518) - change NCDFReader, NCDFWriter and test_netcdf - see discussion of issue #506 for performance benchmarks - fixed skipif in netcdf tests
- netcdf reads MUCH faster than netCDF4 but writes MUCH, MUCH slower than netCDF4: always read with netcdf but write with netCDF4 if available, otherwise use slow netcdf and warn - implements @swails 's solution from ParmEd/ParmEd#722 -- thank you!! - minimal testing: write the same trajectory with netcdf and with netCDF4 - NOTE: netCDF4 is not installed by default, use the re-introduced [AMBER] install target pip install MDAnalysis[AMBER] to request its installation but Amber users should also be aware of potential issues with the bundled netcdf library of Amber; see #506 (comment) for details - closes #506
- removed all imports of netCDF4 - API is different (less feature rich in the pure python implementation) so it is not straightforward to support both netCDF4 and fall back on scipy.io.netcdf (or perhaps even pupyere, which is a single-file package from which the scipy code originated) - uses float32 for - positions - velocities - forces (port of resolution of issue #518) - change NCDFReader, NCDFWriter and test_netcdf - see discussion of issue #506 for performance benchmarks - fixed skipif in netcdf tests
- netcdf reads MUCH faster than netCDF4 but writes MUCH, MUCH slower than netCDF4: always read with netcdf but write with netCDF4 if available, otherwise use slow netcdf and warn - implements @swails 's solution from ParmEd/ParmEd#722 -- thank you!! - minimal testing: write the same trajectory with netcdf and with netCDF4 - NOTE: netCDF4 is not installed by default, use the re-introduced [AMBER] install target pip install MDAnalysis[AMBER] to request its installation but Amber users should also be aware of potential issues with the bundled netcdf library of Amber; see #506 (comment) for details - closes #506
- netcdf reads MUCH faster than netCDF4 but writes MUCH, MUCH slower than netCDF4: always read with netcdf but write with netCDF4 if available, otherwise use slow netcdf and warn - implements @swails 's solution from ParmEd/ParmEd#722 -- thank you!! - minimal testing: write the same trajectory with netcdf and with netCDF4 - NOTE: netCDF4 is not installed by default, use the re-introduced [AMBER] install target pip install MDAnalysis[AMBER] to request its installation but Amber users should also be aware of potential issues with the bundled netcdf library of Amber; see #506 (comment) for details - removed zlib and cmplevel kwargs from NETCDFWriter (only used with netCDF4) - closes #506
- netcdf reads MUCH faster than netCDF4 but writes MUCH, MUCH slower than netCDF4: always read with netcdf but write with netCDF4 if available, otherwise use slow netcdf and warn - implements @swails 's solution from ParmEd/ParmEd#722 -- thank you!! - minimal testing: write the same trajectory with netcdf and with netCDF4 - NOTE: netCDF4 is not installed by default, use the re-introduced [AMBER] install target pip install MDAnalysis[AMBER] to request its installation but Amber users should also be aware of potential issues with the bundled netcdf library of Amber; see #506 (comment) for details - removed zlib and cmplevel kwargs from NETCDFWriter (only used with netCDF4) - closes #506
- netcdf reads MUCH faster than netCDF4 but writes MUCH, MUCH slower than netCDF4: always read with netcdf but write with netCDF4 if available, otherwise use slow netcdf and warn - implements @swails 's solution from ParmEd/ParmEd#722 -- thank you!! - minimal testing: write the same trajectory with netcdf and with netCDF4 - NOTE: netCDF4 is not installed by default, use the re-introduced [AMBER] install target pip install MDAnalysis[AMBER] to request its installation but Amber users should also be aware of potential issues with the bundled netcdf library of Amber; see #506 (comment) for details - removed zlib and cmplevel kwargs from NETCDFWriter (only used with netCDF4) - closes #506 - updated CHANGELOG (@tylerjereddy and @orbeckst as authors)
- netcdf reads MUCH faster than netCDF4 but writes MUCH, MUCH slower than netCDF4: always read with netcdf but write with netCDF4 if available, otherwise use slow netcdf and warn - implements @swails 's solution from ParmEd/ParmEd#722 -- thank you!! - minimal testing: write the same trajectory with netcdf and with netCDF4 - NOTE: netCDF4 is not installed by default, use the re-introduced [AMBER] install target pip install MDAnalysis[AMBER] to request its installation but Amber users should also be aware of potential issues with the bundled netcdf library of Amber; see #506 (comment) for details - removed zlib and cmplevel kwargs from NETCDFWriter (only used with netCDF4) - closes #506 - updated CHANGELOG (@tylerjereddy and @orbeckst as authors)
- closes #506 - netcdf reads MUCH faster than netCDF4 but writes MUCH, MUCH slower than netCDF4: always read with netcdf but write with netCDF4 if available, otherwise use slow netcdf and warn - implements @swails 's solution from ParmEd/ParmEd#722 -- thank you!! - minimal testing: write the same trajectory with netcdf and with netCDF4 - NOTE: netCDF4 is not installed by default, use the re-introduced [AMBER] install target pip install MDAnalysis[AMBER] to request its installation but Amber users should also be aware of potential issues with the bundled netcdf library of Amber; see #506 (comment) for details - removed zlib and cmplevel kwargs from NETCDFWriter (only used with netCDF4) - comprehensive docs with background and reference to #506 - updated CHANGELOG (@tylerjereddy and @orbeckst as authors)
closes #506: Replacing netCDF4 with scipy.io.netcdf
Possibly related netcdf discussion in scipy: scipy/scipy#9157 |
As discussed in #488, it is possible to read Amber netcdf trajectories with the
scipy.io.netcdf
reader. According to @swails and others, reading performance is better with thescipy
reader. Additionally, thescipy
implementation is pure python (based on pupynere) and does not require installation of the netcdf libraries.@orbeckst and @tylerjereddy make PR #503 available, in which netCDF4 was replaced with scipy.io.netcdf. The question is: Should we replace
netCDF4
withscipy.io.netcdf
?Content of PR #503
The PR #503 replaces netCDF4 with
scipy.io.netcdf
and also bundles scipynetcdf.py
in order to fall back to it ifscipy
is not available. This makes MDAnalysis netcdf file handling fully self-contained.Performance issues
Preliminary benchmarking shows that reading is faster with
scipy.netcdf
than with netcdf (~5x) but writing is much slower (and the slowdown becomes worse for longer trajectories, which seems to be due to the underlying implementation in scipy.netcdf).The main question is if we want to trade read speed against write speed (with the possibility that it becomes infeasible to write long trajectories).
Points that were considered
scipy
as part of the installation (although we list it as an optional dependency for analysis – andgridDataFormats
also wants it); it is likely that anyone installing MDAnalysis will also install scipy.mmap=True
andmmap=False
, see next point)mmap=True
by default.netcdf_file
usesmmap()
so the trajectory has to remain open while arrays are accessed. This might make it more difficult to implement e.g. File descriptors that are opened and closed as needed for Universes #239 (and might confuse users). Usingmmap=False
could be an option if performance is not degraded (but need to check).Scientific.IO.NetCDF
API) so maintaining both libraries is cumbersome (but possible).History
The text was updated successfully, but these errors were encountered: