Iterate over two trees (with same entries numbering) over many files ? #545
-
Hi, I have a bunch of files, where each file contains two trees : a How do I iterate over those two trees over many files ? Should I iterate over one tree and get the entry limits of each batch to read the second file (assuming there's some naming convention to infer the name of the second file from the first one) ? or is there a magic Thanks, |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Since the two trees are aligned by construction and you don't have to match values of one of TTree A's branches with one of TTree B's branches (as a Iteration goes over batches of data, not individual entries, so the main thing is that you'll want both Here's an example that uses very small >>> import uproot, skhep_testdata
>>> left_iterator = uproot.iterate(
... skhep_testdata.data_path("uproot-Zmumu-zlib.root") + ":events",
... filter_names=["p[xyz][12]", "E[12]"],
... step_size=1000,
... )
>>> right_iterator = uproot.iterate(
... skhep_testdata.data_path("uproot-Zmumu-lzma.root") + ":events",
... filter_names=["p[xyz][12]", "E[12]"],
... step_size=1000,
... )
>>> for left, right in zip(left_iterator, right_iterator):
... print(repr(left.px1))
... print(repr(right.px1))
... print()
...
<Array [-41.2, 35.1, 35.1, ... -5.52, -26.7] type='1000 * float64'>
<Array [-41.2, 35.1, 35.1, ... -5.52, -26.7] type='1000 * float64'>
<Array [26, 26, 26, ... -39.2, -39.2, 32.3] type='1000 * float64'>
<Array [26, 26, 26, ... -39.2, -39.2, 32.3] type='1000 * float64'>
<Array [-43.4, -43.4, -43.2, ... 32.4, 32.5] type='304 * float64'>
<Array [-43.4, -43.4, -43.2, ... 32.4, 32.5] type='304 * float64'> If you use uproot.iterate to iterate over many files, rather than uproot.TTree.iterate to iterate over one TTree, you'll need to be careful that the files themselves align. If you use a wildcard (which goes to Python's glob.glob), you're at the mercy of whatever order the filesystem decides to give you the files. I personally wouldn't trust that; I'd either generate two explicit lists (evaluate glob.glob explicitly) and sort them, or generate one set of filenames from the other. If you pass explicit lists of strings (no wildcards) to uproot.iterate, it will use them in the order you request. However, you can also just do your own loop over files and call uproot.TTree.iterate on each one. Be sure to put the uproot.open calls in a |
Beta Was this translation helpful? Give feedback.
Since the two trees are aligned by construction and you don't have to match values of one of TTree A's branches with one of TTree B's branches (as a
JOIN
key, as in SQL), then you can just use Python'szip
. I was about to say "itertools.izip
" because you absolutely want the twouproot.iterate
iterators to be evaluated lazily through the iteration (so the memory use doesn't explode), but I'm behind the times: Python 3's built-inzip
does that.Iteration goes over batches of data, not individual entries, so the main thing is that you'll want both
uproot.iterate
iterators to give you the same number of entries in every batch. You can do that by passing an integer tostep_size
(see documentation