-
Hi all, I was reading carefully the uproot documentation but I can't find exactly what I want to do. I need to read a root file, define some new branches using the data in existing branches and apply some cuts to this tree. Using RDataframe for example is very easy because I can read the tree, .Define(---) variables and .Filter(---) them but I cannot do the same in uproot. I have tried to read the files using Thanks in advance for your time, this might be a stupid question. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Not a stupid question, of course! I was going to say that a similar thing was asked on Gitter, but I see that was you. Since you've been working on this for a few days, I'll write out a detailed explanation. The only hard/not currently possible part of what you want to do is adding branches to an existing TTree. The problem there is that Uproot would have to overwrite the TTree metadata (the part of the file that specifies what branches exist and what their types are) but not the preexisting data. That function just hasn't been written, and I can tell you that such a function wouldn't be a small extension of what already exists. It would be a large project. If you want to write the results to a different file (or not write the results at all, but histogram them or something), then that's a completely different thing. I'll address that. Instead of jumping to the answer, I'm going to walk through the thought process that I'd use to approach a problem like this. Suppose that we have a bunch of ROOT files. I'd start by picking one of them and opening that in an interactive prompt. I use the basic Python prompt directly, but IPython and Jupyter are also good. The main thing is to avoid putting all of your code in a script and then puzzling over why it doesn't run or produce any output. I don't have your sample, so let's say that the file is uproot-Zmumu.root: >>> import uproot, skhep_testdata
>>> skhep_testdata.data_path("uproot-Zmumu.root")
'/home/jpivarski/.local/skhepdata/uproot-Zmumu.root' First, I'll open it without even using a >>> infile = uproot.open(skhep_testdata.data_path("uproot-Zmumu.root"))
>>> events = infile["events"]
>>> events.show()
name | typename | interpretation
---------------------+--------------------------+-------------------------------
Type | char* | AsStrings()
Run | int32_t | AsDtype('>i4')
Event | int32_t | AsDtype('>i4')
E1 | double | AsDtype('>f8')
px1 | double | AsDtype('>f8')
py1 | double | AsDtype('>f8')
pz1 | double | AsDtype('>f8')
pt1 | double | AsDtype('>f8')
eta1 | double | AsDtype('>f8')
phi1 | double | AsDtype('>f8')
Q1 | int32_t | AsDtype('>i4')
E2 | double | AsDtype('>f8')
px2 | double | AsDtype('>f8')
py2 | double | AsDtype('>f8')
pz2 | double | AsDtype('>f8')
pt2 | double | AsDtype('>f8')
eta2 | double | AsDtype('>f8')
phi2 | double | AsDtype('>f8')
Q2 | int32_t | AsDtype('>i4')
M | double | AsDtype('>f8') What I want to do next is going to come in two stages: I want to read data and then perform calculations. RDataFrame "lazily" gets data when you've defined some actions for it, but Uproot is "eager"—it does what you say when you tell it to. (There's an uproot.lazy, but we're rethinking how it ought to work because we'd like to incorporate Dask, so I won't talk about that.) One way to do this is to read all the data from the file into memory and then use what you want for calculations, but this is wasteful. Another is to read each array as you need it (what uproot.lazy currently does), but that's not a good pattern, either. (RDataFrame's pattern of reading all of the branches you need when you say "go" is what Dask does, which is why we're thinking about that.) Since you know that you want to calculate things with >>> events.arrays(["px1", "py1", "E1"])
<Array [{px1: -41.2, py1: 17.4, ... E1: 81.6}] type='2304 * {"px1": float64, "py...'> In a realistic situation, you'd probably want to read "everything that starts with 'Muon_'," which would be laborious to type this way. Instead of a list >>> events.arrays(filter_name=["p*1", "E1"])
<Array [{E1: 82.2, ... phi1: 0.037}] type='2304 * {"E1": float64, "px1": float64...'> However, you might have just attempted to read too much from disk. What if there are a lot of TBranches starting with "p" and ending with "1"? You can check this more carefully by passing exactly the same >>> events.keys(filter_name=["p*1", "E1"])
['E1', 'px1', 'py1', 'pz1', 'pt1', 'phi1'] That filter is too broad. Let's narrow it to >>> events.keys(filter_name=["p[xy]1", "E1"])
['E1', 'px1', 'py1'] Better. In the two-step procedure, the first step is to read the data. Let's put it in a variable (suggestively) named >>> batch = events.arrays(filter_name=["p[xy]1", "E1"])
>>> batch
<Array [{E1: 82.2, px1: -41.2, ... py1: 1.2}] type='2304 * {"E1": float64, "px1"...'> This is an Awkward Array (because we didn't ask for NumPy with >>> batch.type
2304 * {"E1": float64, "px1": float64, "py1": float64} but it's very easy (computationally inexpensive) to pull out a purely numerical array for each field individually: >>> batch.E1
<Array [82.2, 62.3, 62.3, ... 81.3, 81.3, 81.6] type='2304 * float64'>
>>> batch.px1
<Array [-41.2, 35.1, 35.1, ... 32.4, 32.5] type='2304 * float64'>
>>> batch.py1
<Array [17.4, -16.6, -16.6, ... 1.2, 1.2, 1.2] type='2304 * float64'> Furthermore, these can be used in calculations to make new arrays: >>> import numpy as np
>>> pt1 = np.sqrt(batch.px1**2 + batch.py1**2)
>>> pt1
<Array [44.7, 38.8, 38.8, ... 32.4, 32.4, 32.5] type='2304 * float64'> As in NumPy, expressions involving full arrays apply to each element of the arrays. It has the same for as if we had applied it to just the first element: >>> np.sqrt(batch.px1[0]**2 + batch.py1[0]**2)
44.73220000003612 but doing it one array ("column") at a time is much faster in Python. The cut is just another expression, though its result is a boolean. >>> cut = (pt1 > 50) & ((batch.E1 > 100) | (batch.E1 < 90))
>>> cut
<Array [False, False, False, ... False, False] type='2304 * bool'> An array of booleans applied to an Awkward Array performs the cut. >>> batch[cut]
<Array [{E1: 133, px1: 71.1, ... py1: -26.1}] type='268 * {"E1": float64, "px1":...'> (note that the length of this array is about 10% of the original). Now I should talk about the "aliases" keyword argument. The uproot.TTree.arrays function can compute quantities, and the "aliases" argument defines subexpressions that can be used in other expressions. The first argument of uproot.TTree.arrays is interpreted as a list of expressions to compute: >>> events.arrays(["sqrt(px1**2 + py1**2)"])
<Array [{'sqrt(px1**2 + py1**2)': 44.7, ... ] type='2304 * {"sqrt(px1**2 + py1**...'> You probably don't want the field name of that Awkward Array to be " >>> events.arrays(["pt1"], aliases={"pt1": "sqrt(px1**2 + py1**2)"})
<Array [{pt1: 44.7}, ... {pt1: 32.4}] type='2304 * {"pt1": float64}'> So far, so good. You also wanted to apply " >>> events.arrays(["pt1"], cut="(pt1 > 50) & ((E1>100) | (E1<90))", aliases={"pt1": "sqrt(px1**2 + py1**2)"})
<Array [{pt1: 77}, ... {pt1: 72.9}] type='269 * {"pt1": float64}'> And if you really did want to read every single TBranch in the TTree, you could just leave off the first argument. >>> events.arrays(cut="(pt1 > 50) & ((E1>100) | (E1<90))", aliases={"pt1": "sqrt(px1**2 + py1**2)"})
<Array [{Type: 'GT', Run: 148031, ... M: 96.1}] type='269 * {"Type": string, "Ru...'> So I could have answered this question by just saying, "Put the letters ' The reason that matters are:
Continuing from there, the next thing to consider is scaling this up to multiple files. The uproot.TTree.iterate function supports the same arguments as uproot.TTree.arrays, but it iterates over batches. Thus, when we get a block of code working for one batch, we can replace "arrays" with "iterate" and put it in a loop: >>> for batch in events.iterate(filter_name=["p[xy]1", "E1"]):
... pt1 = np.sqrt(batch.px1**2 + batch.py1**2)
... cut = (pt1 > 50) & ((batch.E1 > 100) | (batch.E1 < 90))
... print(batch[cut])
...
[{E1: 133, px1: 71.1, py1: 29.5}, {E1: 88.1, ... {E1: 169, px1: -68, py1: -26.1}] This finished in one iteration because the file is so small. Note that you can tune "step_size" as needed. Now for multiple files: uproot.iterate supports the same arguments as uproot.TTree.iterate, but it takes wildcarded file names (and a TTree name after a colon). >>> filenames = "~/.local/skhepdata/uproot-Zmumu*.root:events"
>>> for batch in uproot.iterate(filenames, filter_name=["p[xy]1", "E1"]):
... pt1 = np.sqrt(batch.px1**2 + batch.py1**2)
... cut = (pt1 > 50) & ((batch.E1 > 100) | (batch.E1 < 90))
... print(batch[cut])
...
[{E1: 133, px1: 71.1, py1: 29.5}, {E1: 88.1, ... {E1: 169, px1: -68, py1: -26.1}]
[{E1: 133, px1: 71.1, py1: 29.5}, {E1: 88.1, ... {E1: 169, px1: -68, py1: -26.1}]
[{E1: 133, px1: 71.1, py1: 29.5}, {E1: 88.1, ... {E1: 169, px1: -68, py1: -26.1}]
[{E1: 133, px1: 71.1, py1: 29.5}, {E1: 88.1, ... {E1: 169, px1: -68, py1: -26.1}]
[{E1: 133, px1: 71.1, py1: 29.5}, {E1: 88.1, ... {E1: 169, px1: -68, py1: -26.1}]
[{E1: 133, px1: 71.1, py1: 29.5}, {E1: 88.1, ... {E1: 169, px1: -68, py1: -26.1}] (I happen to have a lot of files in that directory matching the Now if you want these results to go into a new file, you can open a file for writing: >>> outfile = uproot.recreate("outfile.root")
>>> outfile.mktree("events", {"E1": np.float64, "px1": np.float64, "py1": np.float64})
<WritableTree '/events' at 0x7fb540f91fd0>
>>> for batch in uproot.iterate(filenames, filter_name=["p[xy]1", "E1"]):
... pt1 = np.sqrt(batch.px1**2 + batch.py1**2)
... cut = (pt1 > 50) & ((batch.E1 > 100) | (batch.E1 < 90))
... outfile["events"].extend(batch[cut])
... In a script, you would definitely want to use a with uproot.recreate("outfile.root") as outfile:
outfile.mktree("events", {"E1": np.float64, "px1": np.float64, "py1": np.float64})
for batch in uproot.iterate(filenames, filter_name=["p[xy]1", "E1"]):
pt1 = np.sqrt(batch.px1**2 + batch.py1**2)
cut = (pt1 > 50) & ((batch.E1 > 100) | (batch.E1 < 90))
outfile["events"].extend(batch[cut]) Now outfile.root contains a TTree named "events" with filtered TBranches E1, px1, and py1 in it. If that's not what you wanted, I'm sure you can refine the workflow. The main, overarching point that I wanted to make, though, is the process. You don't start with a code block like the one I've written above. You open an interactive prompt and try out small things, then scale them up. Uproot's interface was designed around that kind of process: uproot.TTree.arrays, uproot.TTree.iterate, and uproot.iterate take most of the same arguments so that you can try small tests with one and then swap it for the other. Same thing for uproot.TTree.keys and uproot.TTree.arrays: they take the same "filter_name" argument so that you can test it out without reading data (which might be slow). The expression that you wrote, |
Beta Was this translation helpful? Give feedback.
Not a stupid question, of course! I was going to say that a similar thing was asked on Gitter, but I see that was you. Since you've been working on this for a few days, I'll write out a detailed explanation.
The only hard/not currently possible part of what you want to do is adding branches to an existing TTree. The problem there is that Uproot would have to overwrite the TTree metadata (the part of the file that specifies what branches exist and what their types are) but not the preexisting data. That function just hasn't been written, and I can tell you that such a function wouldn't be a small extension of what already exists. It would be a large project.
If you want to write the result…