-
Notifications
You must be signed in to change notification settings - Fork 648
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
File descriptors that are opened and closed as needed for Universes #239
Comments
So what's the limit on open file handles in Python? I guess an easy way to do this would be to keep a record of the file position at the end of each operation, then seek there next time a file operation happens. ie. with open(self.filename, 'r') as f:
f.seek(self._lastpos)
# do file stuff
self._lastpos = f.tell() I'd just want to make sure that there's no performance hit for this, and if there is, make this behaviour optional. I'm not familiar with chainreader, but it shouldn't really require all the files open at once anyway? |
I suppose one could specifically rewrite the ChainReader to open and close underlying trajectory chunks on demand but I think all of this will require a proper re-thinking of the way trajectory objects are handled at the moment. The current pattern is to open the file descriptor in |
If we made all file access done via context managers then it would solve a lot of #173 .. the setstate would then just load the last accessed frame and everything is as it was. A bigger hammer job would be to have Readers keep track of file positions between frames and build an array of offsets for each frame, ie #208 but format independent and built as frames are accessed. This would make reading the nth frame of a large file much quicker the second time round, even in a new python session. |
Uses decorator to handle file handles Addresses Issue #239
So I've had a play with this, and it looks very possible, I've got Readers to be picklable!
The current problem is that |
Looks like progress, and pickable Readers would be of interest for all the other issues surrounding persistence. with self.open():
for ts in self:
yield ts and have I assume that your changes heavily rely on the fact that you can do random access to frames? If this is the case then we should
|
The way it currently works is the next method (in base) is wrapped in a decorator which opens the file at the previous position, calls _read_next_timestep (in subclass of base), returns the ts, stores current file handle position and closes the file. I just need to check this approach doesn't kill performance! Streams and compressed files are indeed headaches to solve. |
0414ca1#diff-99815fce73651366394eb2dd8fe2f872R1211 So all the magic happens in the base class, then subclasses just write their _read_next to assume that they have a file handle where they left it last |
Uses decorator to handle file handles Addresses Issue #239 Iterating DLPoly reader only opens a single file handle Migrated TRZReader to MultiframeReader Still TODO: Test on 2gb+ file (where seeking involves long ints) XYZReader migrated to new style
What's the status on this one? Do we want to postpone it to a later milestone? It's not one of the big API-breaking ones, is it? |
It won't be ready for a while, but I think it is doable. It shouldn't break anything when it does arrive though, so we can leave it out of 0.11. It will actually dovetail very nicely with #314, because this requires knowledge of offsets to make it work cleanly. |
Ok, postponed. |
Possibly move to 0.14? |
It has come up in #850 that opening and closing file descriptors for every read is a problem when I run several (tens to thousands) processes on a HPC filesystem like GPFS. Rapid opening and closing of files on these filesystems will have a negative impact on their performance. Seems like we need to have a tradeoff. Number of open Universe objects vs running a lot of mdanalysis processes in parallel. My use case is that I usually have only 2-3 open universes per process but a lot of processes during analysis. |
So are we (@kain88-de) sure that opening files a lot is a bad idea? If so we can abandon this idea. |
Can we keep the file descriptor open when using the trajectory as an iterator? That is, doing: for ts in u.trajectory:
# do stuff, and keep file descriptor open
# using trajectory as iterator doesn't have to re-seek, open/close fds
pass vs. doing for i in range(1000):
# trajectory indexing opens file, seeks to frame, reads, then closes file descriptor
u.trajectory[i] Is this a nice compromise? The issue with a file descriptor open for each trajectory is especially a problem when working with the |
Maybe this warrants a |
@richardjgowers is this as simple as just making the |
@dotsdl off the top of my head.... I suspect each Reader implementation will use a random name for their file handle, so that will require squashing out. DCD and other C Readers might expect a name which is non obvious from the Python. I think every "front door" to the base.Reader will require a |
@dotsdl your suggestions would be possible to implement. It actually sounds like a nice compromise if it works also for sliced iterators. But since you and me both reach different limits of filesystems and others might reach them to we should definitely document this behavior, maybe a doc section "MDAnalysis on HPC cluster do's and don'ts". @richardjgowers I don't think ProtoReader needs to know the file handle names. |
@kain88-de yeah and in addition to responding to slicing in this way, we should make it so that this also works for fancy indexing. Basically, anything fed into |
You'll need to make sure that the "front doors" are only entered once too, for example currently |
Added NewReader which doesn't maintain an open filehandle XYZReader changed to use NewReader
In an effort to make MDAnalysis scale well, it would be useful to make file descriptors open and close only as needed by a Universe. This will make it possible to have hundreds of Universes present in a single python session without triggering an
IOError: [Errno 24] Too many open files
exception.For trajectories this would involve wrapping all methods that directly access trajectory details with a decorator to open, seek frame, and close the file on return. This wouldn't come as much of a performance hit if we have readers iterate by chunking (Issue #238).
This issue affects especially the
ChainReader
, since any number of trajectory files can be loaded and these each have open file descriptors while the Universe is present.The text was updated successfully, but these errors were encountered: