Skip to content
This repository has been archived by the owner on Jun 21, 2022. It is now read-only.

Add close method to ROOTDictionary. Fixes #438 #439

Merged
merged 4 commits into from
Jan 16, 2020
Merged

Add close method to ROOTDictionary. Fixes #438 #439

merged 4 commits into from
Jan 16, 2020

Conversation

tamasgal
Copy link
Contributor

When working with multiple files, the GC is taking too long to release some big memory chunks which are created by the ROOTDictionary sources due to cyclic reference (see #438). A manual call to gc.collect() is a workaround, however closing the source of the ROOTDictionary also triggers the clean-up reliable.

This pull request adds the close() method to ROOTDictionary which is also called in the __exit__ method when used with a context manager.

@sbinet
Copy link

sbinet commented Jan 16, 2020

s/ROOTDictionary/ROOTDirectory/ ?

@tamasgal
Copy link
Contributor Author

s/ROOTDictionary/ROOTDirectory/ ?

Yes 🤦‍♂ thanks 😉

@jpivarski
Copy link
Member

In doing this, we must remove

https://github.com/scikit-hep/uproot/blob/2ac123fa14df312c14f7807eaa5219f099f26e7c/uproot/source/memmap.py#L35-L36

because otherwise, the following would cause a segfault:

with uproot.open("...") as f:
    t = f["sometree"]

t.arrays(...)

All the other sources should have good "close" behavior. Somebody needs to review this and make sure it's sane.

I just looked now and I can't find the HTTP preload function. I could have sworn there used to be one there (that's why HTTP has a "parallel" parameter). Does anybody know what happened to this?

Somebody needs to "own" the uproot sources. I haven't been "owning" them sufficiently: I've been letting you guys (@tamasgal, @nsmith-, @chrisburr, @PerilousApricot) make changes that benefit downstream applications, because after all, that's what uproot exists for, but without some centralized coordination, what's good for one application might cause harm to another application. It doesn't help that I am not as knowledgeable about HTTP and XRootD as you are. Does anybody have a suggestion about how to make sure the "sources" part of uproot gets pwned?

@tamasgal
Copy link
Contributor Author

Hmm, I am of course aware of the segfault situation, however I usually do not even try things like this, I mean accessing objects created inside a context where memory mapping may happen in the "outer scope". It technically isn't another scope but lazy data will inevitably cause problems.

In such cases I usually grab the data inside the context in a way that I know that I am not depending on the resources which are about to be closed.

I think transferring the ownership is one way out of this, but it still feels unnatural to me since I would assume that my file is closed outside of the with context.

Maybe it's just me 🦆

@jpivarski
Copy link
Member

Educating users about proper context management is okay, but only if it happens via an exception with a good error message, not a segfault. The problem with segfaults is that they could mean a wide variety of things, and the general contract with Python is that segfaults simply shouldn't happen—users shouldn't have to debug them, ever. (The only cases I know of that allow segfaults in Python go through NumPy, the standard library ctypes module, or other extension modules. In most of these cases, the user has to actively seek a segfault, unlike the closed memmap, which could happen accidentally.)

So there's another alternative: MemmapSource.close can be allowed to close files—I think someone requested that once—as long as any attempts to use the memmap afterward raise exceptions. I can assert that all attempts to read the memmap go through the MemmapSource.data method:

https://github.com/scikit-hep/uproot/blob/2ac123fa14df312c14f7807eaa5219f099f26e7c/uproot/source/memmap.py#L38-L49

so if this method can check for a closed file (I don't know if it has a flag, but we can add one), it can raise an exception if the file has been closed, narrowly escaping the segfault. It should also be relatively lightweight, as nearly all file-reading goes through this data method.

The issue of setting good, consistent policies for all uproot sources (i.e. "owning" them) is a bigger issue, though.

@tamasgal
Copy link
Contributor Author

Yes, I fully agree that we should avoid segfaults by any means, don't get me wrong! I am not satisfied with my PR either, but I hope we figure out some nice way to get this in since calling the GC manually also feels awkward 😉

@tamasgal
Copy link
Contributor Author

I pushed another commit which avoids the segfault. Maybe this goes into the right direction?

@tamasgal
Copy link
Contributor Author

Here is it in action:

░ tgal@staticbox:~/tmp/uproot-mem km3io 
░ 17:50:50 1 > cat own.py 
import uproot

with uproot.open("testfile.root") as f:
    t = f['E']
    a = t['Evt']['frame_index'].array()
    b = t['Evt']['frame_index'].lazyarray()

print(a)
print(b)

░ tgal@staticbox:~/tmp/uproot-mem km3io 
░ 17:50:54 > python own.py
[   558    577    577 ... 152999 153000 153000]
Traceback (most recent call last):
  File "own.py", line 9, in <module>
    print(b)
  File "/home/tgal/Dev/km3io/venv/lib/python3.8/site-packages/awkward/array/chunked.py", line 273, in __str__
    return super(ChunkedArray, self).__str__()
  File "/home/tgal/Dev/km3io/venv/lib/python3.8/site-packages/awkward/array/base.py", line 101, in __str__
    first = self[:3]
  File "/home/tgal/Dev/km3io/venv/lib/python3.8/site-packages/awkward/array/chunked.py", line 428, in __getitem__
    chunk = self._chunks[chunkid][(slice(local_start, local_stop, step),)]
  File "/home/tgal/Dev/km3io/venv/lib/python3.8/site-packages/awkward/array/virtual.py", line 369, in __getitem__
    return self.array[where]
  File "/home/tgal/Dev/km3io/venv/lib/python3.8/site-packages/awkward/array/virtual.py", line 295, in array
    return self.materialize()
  File "/home/tgal/Dev/km3io/venv/lib/python3.8/site-packages/awkward/array/virtual.py", line 326, in materialize
    array = self._util_toarray(self._generator(*self._args, **self._kwargs), self.DEFAULTTYPE)
  File "/home/tgal/Dev/uproot/uproot/tree.py", line 1952, in __call__
    return self.branch.array(interpretation=self.interpretation, entrystart=entrystart, entrystop=entrystop, flatten=self.flatten, awkwardlib=self.awkwardlib, cache=None, basketcache=self.basketcache, keycache=self.keycache, executor=self.executor, blocking=True)
  File "/home/tgal/Dev/uproot/uproot/tree.py", line 1381, in array
    basket_itemoffset = self._basket_itemoffset(interpretation, basketstart, basketstop, keycache)
  File "/home/tgal/Dev/uproot/uproot/tree.py", line 1334, in _basket_itemoffset
    for j, key in enumerate(self._threadsafe_iterate_keys(keycache, True, basketstart, basketstop)):
  File "/home/tgal/Dev/uproot/uproot/tree.py", line 1040, in _threadsafe_iterate_keys
    key = self._basketkey(keysource, i, complete)
  File "/home/tgal/Dev/uproot/uproot/tree.py", line 1757, in _basketkey
    return self._BasketKey(source.parent(), Cursor(self._fBasketSeek[i]), self.compression, complete)
  File "/home/tgal/Dev/uproot/uproot/tree.py", line 1635, in __init__
    self._fNbytes, self._fVersion, self._fObjlen, self._fDatime, self._fKeylen, self._fCycle, self._fSeekKey, self._fSeekPdir = cursor.fields(source, TBranchMethods._BasketKey._format_small)
  File "/home/tgal/Dev/uproot/uproot/source/cursor.py", line 46, in fields
    return format.unpack(source.data(start, stop))
  File "/home/tgal/Dev/uproot/uproot/source/memmap.py", line 49, in data
    if stop > len(self.source):
  File "/home/tgal/Dev/uproot/uproot/source/memmap.py", line 26, in source
    raise IOError("The file handler has already been closed.")
OSError: The file handler has already been closed.

@tamasgal
Copy link
Contributor Author

Sorry for the spam, I just read your comment about the MemmapSource.data method being the only entry point. I thought it might be better to be closer to the "source", but if you think it's a better location I have no objections!

@jpivarski
Copy link
Member

A property also works; it also protects against users who dig down to the Source level and try to use it in unapproved ways.

There's a tiny question about performance because this is hit so often, but I'm willing to forego that.

Anyway, I don't want to lose the thread about ownership of uproot sources, since they're tweaked so often like this. Maybe this PR isn't the best place to do that—maybe an issue, though "ownership" is not a very concrete thing, unlike a bug.

@tamasgal
Copy link
Contributor Author

tamasgal commented Jan 16, 2020

I think the ownership is a separate issue since it would only shift the memory problem to another place. I also tried to do del self._context.source inside the __exit__ method of the ROOTDictionary which decreases the refcount, but the assignment to another variable inside the with context of course increases it. The source is alive after the context closing but the memory management stays the same, so it does not seem to have any positive effect on the GC behaviour.

@nsmith-
Copy link
Member

nsmith- commented Jan 16, 2020

Well to transfer ownership, it's worth thinking again about the size of the API. I'd re-iterate here that having a almost-trivial buffer interface as the extent of the source API seems like the best plan, where the uproot code would have to find some other mechanism to supply optimization hints to the concrete Source. Regardless, let's make an issue to discuss the Source thing.

@jpivarski
Copy link
Member

Yes, let's move the question of ownership to an issue. Before we do that, however, I think there's a (very understandable!) confusion about what I meant: not "which object in the code owns references to the data" but "which human being is responsible for making sure that reasonable-looking changes to one thing don't break another thing."

I brought it up here because the segfault thing would have been a disaster, and I don't know what other disasters lurk. Another example was that I couldn't even find the HTTPSource.preload that I'm sure had been there once and I don't know how it got lost. Despite being the author, I'm pretty far removed from experience with remote file access, particularly XRootD.

All in all, it's not a lot of code:

> cd uproot/source
> wc source.py memmap.py chunked.py file.py http.py xrootd.py
   46   114  1144 source.py
   49   112  1266 memmap.py
  150   477  5158 chunked.py
   50   148  1751 file.py
   59   188  2172 http.py
  109   344  4189 xrootd.py
  463  1383 15680 total

But let's move that to an issue and just make this PR about ROOTDirectory.__exit__ and closing MemmapSource. The tests are failing because mmap.mmap doesn't have a closed attribute in Python 2.7. Just add a boolean to the MemmapSource object and use that instead.

@tamasgal
Copy link
Contributor Author

Yes, let's move the question of ownership to an issue. Before we do that, however, I think there's a (very understandable!) confusion about what I meant: not "which object in the code owns references to the data" but "which human being is responsible for making sure that reasonable-looking changes to one thing don't break another thing."

Ok now I see your point. I fully agree with the segfault potentially causing a disaster.
I don't know either where or how the responsibilities should be handled. You offer a quite extensive test suite with uproot and also an awesome documentation. Other than common sense and taking advantage of previously gained experience, there is in my point of view not much one can do. There is also this thing that users should be aware of problems which may occur when software updates happen (that's why we maintain changelogs, which I count as part of the documentation). I am also author of low level libraries (of course not at all comparable with the outreach of your work) and try my best to think of every possible use case which may be harmful or cost time (=money). I really feel bad that I did not bring up the nullpointer issue earlier since I was aware of it. So in this case I messed it up, however, you jumped in and brought it up 🙈 which brings me to the last component of good software: team work and open source development 😉

...but yes, the responsibility of a low-level library architect is a quite hot topic...

@tamasgal
Copy link
Contributor Author

Btw. where should I put the tests? I am a bit unsure...

@jpivarski
Copy link
Member

If you have tests, they wouldn't be too out of place in tests/test_issues.py. The tests in there are numbered by GitHub issue number, which are disjoint from GitHub PR number, so that would be a good place for a little test.

That said, I don't see much need for a test for something like this, especially now that the closed attribute is on MemmapSource, where we can see it and control it. A test of that would be almost a tautology. So if you don't already have a test that you want to add, I can merge this now because the tests that we do have have passed. Let me know explicitly that you're done before I merge it (so that we don't cross paths).

@tamasgal
Copy link
Contributor Author

Alright, let me just add one test for the exception, I think that should be ok.

@jpivarski jpivarski merged commit 8f93f04 into scikit-hep:master Jan 16, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants