Fastest way to check content of many files with uproot? #539

MoAly98 · 2022-01-17T16:40:38Z

MoAly98
Jan 17, 2022

Hi,

I'm writing a framework which will process root files using uproot 4. In the framework, the user can request that TBranches with names branches = [br1, br2] are read from a TTree tree N root files. Before I throw these branches and tree to uproot to do the reading, I would like to check that the branches and the tree are available in each of the files and skip the files where the branches/tree is missing. What would be the fastest way of doing this with uproot? At the moment I'm doing this check in a for loop with uproot.open(file) then examining the available keys, but this is rather slow. I thought about using uproot.lazy and examine the keys, but somehow this is slower than using uproot.open.

I also wonder if you have suggestions on handling this at file reading time without breaking the code, in a way that would avoid the for loop all together so that the project scales.

Thank you very much.

jpivarski · 2022-01-17T17:28:46Z

jpivarski
Jan 17, 2022
Maintainer

Avoiding for loops only improves performance of numerical data processing when the data are in an array (flat memory buffer, not Python objects). It also can only apply if deriving the data doesn't do a lot of Pythonic for loops itself.

Finding out what TObjects and TBranches exist in a ROOT file is non-numeric, operating on data that are not array-like at all. In Uproot, this is implemented the only way it can be in Python: with loops, classes, dict-lookups, etc. After all of that work has been done, avoiding a for loop when inspecting it (which isn't possible, the way it's set up) wouldn't provide any performance advantage.

But anyway, this fits the intended performance model of Uproot anyway: the idea is that there's a small number of TBranches (only thousands or tens of thousands) but a large number of numerical entries in the TTree (millions or billions). Pythonic code is used for the small stuff and NumPy casting is used for the big stuff, wherever possible (i.e. as long as the data types are numeric).

uproot.lazy is the same thing plus more work, setting up data-reading on demand. It's well-suited to exploration, in which it would be annoying to have to specify up front which TBranches you need. If you know which TBranches you need—for instance, because you're writing a framework—then uproot.lazy is only counter-productive. (This is especially true for remote files, since lazy patterns generate many more individual requests, as it discovers what you're interested in, and latency-limited remote files are optimized by specifying what you want to read in as few requests as possible.)

If you know the name of the TTree you want to read, ask for it in square brackets rather than looping over the names of TObjects in the file:

with uproot.open(filename) as file:
    tree = file["name_of_ttree"]

This will cause Uproot to read the file header, use that to find the root directory, read the root directory (but none of the TObjects or subdirectories), put the "key names → TKey objects" into a dict for fast lookup, use the dict once to get the TKey corresponding to your TTree, and then read the TTree header, which includes the names and types of all the TBranches. At this point, none of the numerical data has been read.

If you instead loop over the TObject names in the file:

with uproot.open(filename) as file:
    for keyname in file:
        if keyname = "name_of_ttree":
            tree = file[keyname]
            break

then (I think) it does a recursive walk, which unnecessarily reads all the subdirectories. If your file is on a latency-limited remote network and you have a lot of subdirectories, this would save you a round-trip request-response for each subdirectory. If you want to ensure that the TTree exists, there's the in keyword (which I think reuses the "key names → TKey objects" dict). Catching KeyErrors can be slow because Uproot KeyErrors get a full list of key names and sorts them by similarity in spelling to what you asked for to try to improve the user experience.

If your ROOT files contain only one object and it's the TTree of interest, then the above isn't going to help at all.

Also, be sure that you're not checking for the existence of TBranches by reading the TBranches. You can do

"branch_name" in tree

to see if the TBranch exists by querying the already-read TTree metadata, rather than

array = tree["branch_name"].array()

to actually read the numerical data from the file.

Anyway, that's the bottom of the barrel; there isn't much faster it could go, given what you're trying to do. ROOT (C++) and UnROOT (Julia) avoid the cost of interpreting the ROOT file headers in Python, which could be an issue if and only if you're not limited by disk access or remote network latency, which are independent of programming language.

Rereading your original question, it sounds like you're doing a first pass to see which files exist and have the TTrees and TBranches you need, then a second pass over the good ones. Since there is an unavoidable amount of hardware (disk and/or network) and software (Python interpretation) work involved in opening a file and reading TTree metadata, why do it twice? Why not skip missing files/missing TTrees/missing TBranches in the same loop that does the final processing of the data, so that you only read it once? Is it because a user might decide not to do the processing based on whether all the expected data exist or not?

Alternatively, if you have some control over your set of files, maybe you'd want to inspect them all once and put the relevant metadata into a database? Databases are a much faster format for querying than the headers of ROOT files. There would even be a potential to read the TTree metadata zero times in the user processing loop by also storing the TBranch interpretations and seek locations of the TBaskets, but that would be an advanced project (see Coffea).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fastest way to check content of many files with uproot? #539

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Fastest way to check content of many files with uproot? #539

MoAly98 Jan 17, 2022

Replies: 1 comment

jpivarski Jan 17, 2022 Maintainer

MoAly98
Jan 17, 2022

jpivarski
Jan 17, 2022
Maintainer