Replies: 1 comment
-
Avoiding Finding out what TObjects and TBranches exist in a ROOT file is non-numeric, operating on data that are not array-like at all. In Uproot, this is implemented the only way it can be in Python: with loops, classes, dict-lookups, etc. After all of that work has been done, avoiding a But anyway, this fits the intended performance model of Uproot anyway: the idea is that there's a small number of TBranches (only thousands or tens of thousands) but a large number of numerical entries in the TTree (millions or billions). Pythonic code is used for the small stuff and NumPy casting is used for the big stuff, wherever possible (i.e. as long as the data types are numeric).
If you know the name of the TTree you want to read, ask for it in square brackets rather than looping over the names of TObjects in the file: with uproot.open(filename) as file:
tree = file["name_of_ttree"] This will cause Uproot to read the file header, use that to find the root directory, read the root directory (but none of the TObjects or subdirectories), put the "key names → TKey objects" into a dict for fast lookup, use the dict once to get the TKey corresponding to your TTree, and then read the TTree header, which includes the names and types of all the TBranches. At this point, none of the numerical data has been read. If you instead loop over the TObject names in the file: with uproot.open(filename) as file:
for keyname in file:
if keyname = "name_of_ttree":
tree = file[keyname]
break then (I think) it does a recursive walk, which unnecessarily reads all the subdirectories. If your file is on a latency-limited remote network and you have a lot of subdirectories, this would save you a round-trip request-response for each subdirectory. If you want to ensure that the TTree exists, there's the If your ROOT files contain only one object and it's the TTree of interest, then the above isn't going to help at all. Also, be sure that you're not checking for the existence of TBranches by reading the TBranches. You can do "branch_name" in tree to see if the TBranch exists by querying the already-read TTree metadata, rather than array = tree["branch_name"].array() to actually read the numerical data from the file. Anyway, that's the bottom of the barrel; there isn't much faster it could go, given what you're trying to do. ROOT (C++) and UnROOT (Julia) avoid the cost of interpreting the ROOT file headers in Python, which could be an issue if and only if you're not limited by disk access or remote network latency, which are independent of programming language. Rereading your original question, it sounds like you're doing a first pass to see which files exist and have the TTrees and TBranches you need, then a second pass over the good ones. Since there is an unavoidable amount of hardware (disk and/or network) and software (Python interpretation) work involved in opening a file and reading TTree metadata, why do it twice? Why not skip missing files/missing TTrees/missing TBranches in the same loop that does the final processing of the data, so that you only read it once? Is it because a user might decide not to do the processing based on whether all the expected data exist or not? Alternatively, if you have some control over your set of files, maybe you'd want to inspect them all once and put the relevant metadata into a database? Databases are a much faster format for querying than the headers of ROOT files. There would even be a potential to read the TTree metadata zero times in the user processing loop by also storing the TBranch interpretations and seek locations of the TBaskets, but that would be an advanced project (see Coffea). |
Beta Was this translation helpful? Give feedback.
-
Hi,
I'm writing a framework which will process root files using uproot 4. In the framework, the user can request that TBranches with names
branches = [br1, br2]
are read from a TTreetree
N root files. Before I throw these branches and tree to uproot to do the reading, I would like to check that the branches and the tree are available in each of the files and skip the files where the branches/tree is missing. What would be the fastest way of doing this with uproot? At the moment I'm doing this check in afor
loop withuproot.open(file)
then examining the available keys, but this is rather slow. I thought about usinguproot.lazy
and examine the keys, but somehow this is slower than usinguproot.open
.I also wonder if you have suggestions on handling this at file reading time without breaking the code, in a way that would avoid the
for
loop all together so that the project scales.Thank you very much.
Beta Was this translation helpful? Give feedback.
All reactions