Aliases and cuts when reading ROOT file #543

CaetanoE · 2022-01-19T12:15:44Z

CaetanoE
Jan 19, 2022

Hi all, I was reading carefully the uproot documentation but I can't find exactly what I want to do. I need to read a root file, define some new branches using the data in existing branches and apply some cuts to this tree.

Using RDataframe for example is very easy because I can read the tree, .Define(---) variables and .Filter(---) them but I cannot do the same in uproot.

I have tried to read the files using uproot.concatenate and then apply the cuts and aliases similar to how it is shown in the tutorial events.arrays("(pt1 > 50) & ((E1>100) | (E1<90))", aliases={"pt1": "sqrt(px1**2 + py1**2)"}) but I can't make it work.

Thanks in advance for your time, this might be a stupid question.

Answered by jpivarski

Jan 19, 2022

Not a stupid question, of course! I was going to say that a similar thing was asked on Gitter, but I see that was you. Since you've been working on this for a few days, I'll write out a detailed explanation.

The only hard/not currently possible part of what you want to do is adding branches to an existing TTree. The problem there is that Uproot would have to overwrite the TTree metadata (the part of the file that specifies what branches exist and what their types are) but not the preexisting data. That function just hasn't been written, and I can tell you that such a function wouldn't be a small extension of what already exists. It would be a large project.

If you want to write the result…

View full answer

jpivarski · 2022-01-19T22:18:37Z

jpivarski
Jan 19, 2022
Maintainer

Not a stupid question, of course! I was going to say that a similar thing was asked on Gitter, but I see that was you. Since you've been working on this for a few days, I'll write out a detailed explanation.

The only hard/not currently possible part of what you want to do is adding branches to an existing TTree. The problem there is that Uproot would have to overwrite the TTree metadata (the part of the file that specifies what branches exist and what their types are) but not the preexisting data. That function just hasn't been written, and I can tell you that such a function wouldn't be a small extension of what already exists. It would be a large project.

If you want to write the results to a different file (or not write the results at all, but histogram them or something), then that's a completely different thing. I'll address that.

Instead of jumping to the answer, I'm going to walk through the thought process that I'd use to approach a problem like this. Suppose that we have a bunch of ROOT files. I'd start by picking one of them and opening that in an interactive prompt. I use the basic Python prompt directly, but IPython and Jupyter are also good. The main thing is to avoid putting all of your code in a script and then puzzling over why it doesn't run or produce any output.

I don't have your sample, so let's say that the file is uproot-Zmumu.root:

>>> import uproot, skhep_testdata
>>> skhep_testdata.data_path("uproot-Zmumu.root")
'/home/jpivarski/.local/skhepdata/uproot-Zmumu.root'

First, I'll open it without even using a with statement (something you should always do in scripts). The reason is that I don't want it to close right away.

>>> infile = uproot.open(skhep_testdata.data_path("uproot-Zmumu.root"))
>>> events = infile["events"]
>>> events.show()
name                 | typename                 | interpretation                
---------------------+--------------------------+-------------------------------
Type                 | char*                    | AsStrings()
Run                  | int32_t                  | AsDtype('>i4')
Event                | int32_t                  | AsDtype('>i4')
E1                   | double                   | AsDtype('>f8')
px1                  | double                   | AsDtype('>f8')
py1                  | double                   | AsDtype('>f8')
pz1                  | double                   | AsDtype('>f8')
pt1                  | double                   | AsDtype('>f8')
eta1                 | double                   | AsDtype('>f8')
phi1                 | double                   | AsDtype('>f8')
Q1                   | int32_t                  | AsDtype('>i4')
E2                   | double                   | AsDtype('>f8')
px2                  | double                   | AsDtype('>f8')
py2                  | double                   | AsDtype('>f8')
pz2                  | double                   | AsDtype('>f8')
pt2                  | double                   | AsDtype('>f8')
eta2                 | double                   | AsDtype('>f8')
phi2                 | double                   | AsDtype('>f8')
Q2                   | int32_t                  | AsDtype('>i4')
M                    | double                   | AsDtype('>f8')

What I want to do next is going to come in two stages: I want to read data and then perform calculations. RDataFrame "lazily" gets data when you've defined some actions for it, but Uproot is "eager"—it does what you say when you tell it to. (There's an uproot.lazy, but we're rethinking how it ought to work because we'd like to incorporate Dask, so I won't talk about that.)

One way to do this is to read all the data from the file into memory and then use what you want for calculations, but this is wasteful. Another is to read each array as you need it (what uproot.lazy currently does), but that's not a good pattern, either. (RDataFrame's pattern of reading all of the branches you need when you say "go" is what Dask does, which is why we're thinking about that.) Since you know that you want to calculate things with px1, py1, and E1, we can read just those with uproot.TTree.arrays.

>>> events.arrays(["px1", "py1", "E1"])
<Array [{px1: -41.2, py1: 17.4, ... E1: 81.6}] type='2304 * {"px1": float64, "py...'>

In a realistic situation, you'd probably want to read "everything that starts with 'Muon_'," which would be laborious to type this way. Instead of a list ["px1", "py1", "E1"], you could use wildcards in filter_name:

>>> events.arrays(filter_name=["p*1", "E1"])
<Array [{E1: 82.2, ... phi1: 0.037}] type='2304 * {"E1": float64, "px1": float64...'>

However, you might have just attempted to read too much from disk. What if there are a lot of TBranches starting with "p" and ending with "1"? You can check this more carefully by passing exactly the same filter_name to uproot.TTree.keys, which will list the names of the TBranches without trying to read them.

>>> events.keys(filter_name=["p*1", "E1"])
['E1', 'px1', 'py1', 'pz1', 'pt1', 'phi1']

That filter is too broad. Let's narrow it to "p[xy]1":

>>> events.keys(filter_name=["p[xy]1", "E1"])
['E1', 'px1', 'py1']

Better.

In the two-step procedure, the first step is to read the data. Let's put it in a variable (suggestively) named batch.

>>> batch = events.arrays(filter_name=["p[xy]1", "E1"])
>>> batch
<Array [{E1: 82.2, px1: -41.2, ... py1: 1.2}] type='2304 * {"E1": float64, "px1"...'>

This is an Awkward Array (because we didn't ask for NumPy with library="np"). Formally, it's an array of records with fields E1, px1, and py1:

>>> batch.type
2304 * {"E1": float64, "px1": float64, "py1": float64}

but it's very easy (computationally inexpensive) to pull out a purely numerical array for each field individually:

>>> batch.E1
<Array [82.2, 62.3, 62.3, ... 81.3, 81.3, 81.6] type='2304 * float64'>
>>> batch.px1
<Array [-41.2, 35.1, 35.1, ... 32.4, 32.5] type='2304 * float64'>
>>> batch.py1
<Array [17.4, -16.6, -16.6, ... 1.2, 1.2, 1.2] type='2304 * float64'>

Furthermore, these can be used in calculations to make new arrays:

>>> import numpy as np
>>> pt1 = np.sqrt(batch.px1**2 + batch.py1**2)
>>> pt1
<Array [44.7, 38.8, 38.8, ... 32.4, 32.4, 32.5] type='2304 * float64'>

As in NumPy, expressions involving full arrays apply to each element of the arrays. It has the same for as if we had applied it to just the first element:

>>> np.sqrt(batch.px1[0]**2 + batch.py1[0]**2)
44.73220000003612

but doing it one array ("column") at a time is much faster in Python.

The cut is just another expression, though its result is a boolean.

>>> cut = (pt1 > 50) & ((batch.E1 > 100) | (batch.E1 < 90))
>>> cut
<Array [False, False, False, ... False, False] type='2304 * bool'>

An array of booleans applied to an Awkward Array performs the cut.

>>> batch[cut]
<Array [{E1: 133, px1: 71.1, ... py1: -26.1}] type='268 * {"E1": float64, "px1":...'>

(note that the length of this array is about 10% of the original).

Now I should talk about the "aliases" keyword argument. The uproot.TTree.arrays function can compute quantities, and the "aliases" argument defines subexpressions that can be used in other expressions. The first argument of uproot.TTree.arrays is interpreted as a list of expressions to compute:

>>> events.arrays(["sqrt(px1**2 + py1**2)"])
<Array [{'sqrt(px1**2 + py1**2)': 44.7, ... ] type='2304 * {"sqrt(px1**2 + py1**...'>

You probably don't want the field name of that Awkward Array to be "sqrt(px1**2 + py1**2)", so defining an alias can let you shorten it:

>>> events.arrays(["pt1"], aliases={"pt1": "sqrt(px1**2 + py1**2)"})
<Array [{pt1: 44.7}, ... {pt1: 32.4}] type='2304 * {"pt1": float64}'>

So far, so good. You also wanted to apply "(pt1 > 50) & ((E1>100) | (E1<90))" as a cut. We can do that by passing it to the "cut" keyword argument:

>>> events.arrays(["pt1"], cut="(pt1 > 50) & ((E1>100) | (E1<90))", aliases={"pt1": "sqrt(px1**2 + py1**2)"})
<Array [{pt1: 77}, ... {pt1: 72.9}] type='269 * {"pt1": float64}'>

And if you really did want to read every single TBranch in the TTree, you could just leave off the first argument.

>>> events.arrays(cut="(pt1 > 50) & ((E1>100) | (E1<90))", aliases={"pt1": "sqrt(px1**2 + py1**2)"})
<Array [{Type: 'GT', Run: 148031, ... M: 96.1}] type='269 * {"Type": string, "Ru...'>

So I could have answered this question by just saying, "Put the letters 'cut=" after your first parenthesis." That would have solved the immediate problem. However, you were thinking of the "aliases" argument as something like RDataFrame's Define and the "cut" argument as something like RDataFrame's Filter, which they are not. RDataFrame sets up a workflow to lazily draw data from your ROOT files; Define and Filter construct steps in that workflow. uproot.TTree.arrays eagerly reads data from your ROOT files; "aliases" and "cut" are shortcuts (as in "syntactic sugar") for computing expressions on Awkward Arrays.

The reason that matters are:

Laziness vs eagerness: you might not care until you run out of memory, but then you do!
The code in an RDataFrame string is C++ (because it sets up a C++ workflow). The code in an uproot.TTree.arrays string is Python, acting on arrays. (The "E1" and the "px1" in the expressions are whole arrays.)
If your analysis gets any more complicated than defining and filtering, you'll find the uproot.TTree.arrays strings to be confining. A lot of people have been asking me how to shoehorn their problems into these strings, but they're just convenient shorthands. Pulling the data out of the strings into Python allows you to write more than one line of code. (This is not to say RDataFrame strings are confining: you can write multi-line C++ in those strings.)

Continuing from there, the next thing to consider is scaling this up to multiple files. The uproot.TTree.iterate function supports the same arguments as uproot.TTree.arrays, but it iterates over batches. Thus, when we get a block of code working for one batch, we can replace "arrays" with "iterate" and put it in a loop:

>>> for batch in events.iterate(filter_name=["p[xy]1", "E1"]):
...     pt1 = np.sqrt(batch.px1**2 + batch.py1**2)
...     cut = (pt1 > 50) & ((batch.E1 > 100) | (batch.E1 < 90))
...     print(batch[cut])
... 
[{E1: 133, px1: 71.1, py1: 29.5}, {E1: 88.1, ... {E1: 169, px1: -68, py1: -26.1}]

This finished in one iteration because the file is so small. Note that you can tune "step_size" as needed.

Now for multiple files: uproot.iterate supports the same arguments as uproot.TTree.iterate, but it takes wildcarded file names (and a TTree name after a colon).

>>> filenames = "~/.local/skhepdata/uproot-Zmumu*.root:events"
>>> for batch in uproot.iterate(filenames, filter_name=["p[xy]1", "E1"]):
...     pt1 = np.sqrt(batch.px1**2 + batch.py1**2)
...     cut = (pt1 > 50) & ((batch.E1 > 100) | (batch.E1 < 90))
...     print(batch[cut])
... 
[{E1: 133, px1: 71.1, py1: 29.5}, {E1: 88.1, ... {E1: 169, px1: -68, py1: -26.1}]
[{E1: 133, px1: 71.1, py1: 29.5}, {E1: 88.1, ... {E1: 169, px1: -68, py1: -26.1}]
[{E1: 133, px1: 71.1, py1: 29.5}, {E1: 88.1, ... {E1: 169, px1: -68, py1: -26.1}]
[{E1: 133, px1: 71.1, py1: 29.5}, {E1: 88.1, ... {E1: 169, px1: -68, py1: -26.1}]
[{E1: 133, px1: 71.1, py1: 29.5}, {E1: 88.1, ... {E1: 169, px1: -68, py1: -26.1}]
[{E1: 133, px1: 71.1, py1: 29.5}, {E1: 88.1, ... {E1: 169, px1: -68, py1: -26.1}]

(I happen to have a lot of files in that directory matching the uproot-Zmumu*.root pattern, all with different compression settings.)

Now if you want these results to go into a new file, you can open a file for writing:

>>> outfile = uproot.recreate("outfile.root")
>>> outfile.mktree("events", {"E1": np.float64, "px1": np.float64, "py1": np.float64})
<WritableTree '/events' at 0x7fb540f91fd0>
>>> for batch in uproot.iterate(filenames, filter_name=["p[xy]1", "E1"]):
...     pt1 = np.sqrt(batch.px1**2 + batch.py1**2)
...     cut = (pt1 > 50) & ((batch.E1 > 100) | (batch.E1 < 90))
...     outfile["events"].extend(batch[cut])
...

In a script, you would definitely want to use a with statement (because of these reasons). Same for any input files.

with uproot.recreate("outfile.root") as outfile:
    outfile.mktree("events", {"E1": np.float64, "px1": np.float64, "py1": np.float64})
    for batch in uproot.iterate(filenames, filter_name=["p[xy]1", "E1"]):
        pt1 = np.sqrt(batch.px1**2 + batch.py1**2)
        cut = (pt1 > 50) & ((batch.E1 > 100) | (batch.E1 < 90))
        outfile["events"].extend(batch[cut])

Now outfile.root contains a TTree named "events" with filtered TBranches E1, px1, and py1 in it. If that's not what you wanted, I'm sure you can refine the workflow.

The main, overarching point that I wanted to make, though, is the process. You don't start with a code block like the one I've written above. You open an interactive prompt and try out small things, then scale them up. Uproot's interface was designed around that kind of process: uproot.TTree.arrays, uproot.TTree.iterate, and uproot.iterate take most of the same arguments so that you can try small tests with one and then swap it for the other. Same thing for uproot.TTree.keys and uproot.TTree.arrays: they take the same "filter_name" argument so that you can test it out without reading data (which might be slow).

The expression that you wrote, events.arrays("(pt1 > 50) & ((E1>100) | (E1<90))", aliases={"pt1": "sqrt(px1**2 + py1**2)"}) does compute something: it makes an array of booleans that are true for each event that passes the cut and false for the ones that don't. It has a crazy-long field name, but if you print out a few values, you'd see that it's almost there. I don't know this for sure, but I'm guessing this was in a script and therefore hard to inspect? It really was very close.

1 reply

CaetanoE Jan 20, 2022
Author

Thank you very much @jpivarski for taking the time to write this very nice and useful tutorial on uproot. I've been trying some things the last few days (thats why the question on glitter), but at the end I came to similar problems. It is true that I might be approaching the problem thinking about how to do it with Rdataframe and that why it ended up being quite a mess.

Now thanks to you I think everything is crystal clear for me. Thank you again for your time, i hope a lot of people encounter this post when starting using uproot because it is really useful.

Cheers, Caetano.

PS: I feel like I want to send you a box of chocolates in gratitude or something. 😊

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aliases and cuts when reading ROOT file #543

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Aliases and cuts when reading ROOT file #543

CaetanoE Jan 19, 2022

Replies: 1 comment · 1 reply

jpivarski Jan 19, 2022 Maintainer

CaetanoE Jan 20, 2022 Author

CaetanoE
Jan 19, 2022

Replies: 1 comment 1 reply

jpivarski
Jan 19, 2022
Maintainer

CaetanoE Jan 20, 2022
Author