-
Notifications
You must be signed in to change notification settings - Fork 67
Unable to read ntuple with std::vector after doing 'hadd' #38
Comments
The root files (they are zipped because GitHub doesn't allow .root file type): |
Thanks! I'll try to solve this before the end of the day tomorrow. |
It's not urgent. But thank you in advance! |
Indeed, Nevertheless, files with this issue exist and we need to be able to read them. We can see the lack of most STL streamers in PyROOT: >>> import ROOT
>>> tfile = ROOT.TFile("stlvector_after_hadd.root")
>>> tfile.ShowStreamerInfo()
... # (no STL streamers except vector<double> and vector<string>; compare to the other file) as well as in uproot: >>> import uproot
>>> f = uproot.open("stlvector_after_hadd.root")
>>> f.showstreamers()
... # (same issue) The difference is also apparent when you "show" the branch info (new, undocumented method): >>> t = f["ntupler/tree"]
>>> t.show()
v_int16 (no streamer) None
v_int32 (no streamer) None
v_int64 (no streamer) None
v_uint16 (no streamer) None
v_uint32 (no streamer) None
v_uint64 (no streamer) None
v_bool (no streamer) None
v_float (no streamer) None
v_double TStreamerSTL asstlvector(asdtype('>f8')) Before this bug-fix, the algorithm for assigning interpretations for branches depended on having an associated streamer, and after Somehow, however, ROOT is still able to interpret these branches: >>> ttree = tfile.Get("ntupler/tree") # (back in PyROOT now)
>>> for x in t:
... print list(x.v_int16)
...
[1, 2, 3]
[1, 2, 3]
[1, 2, 3] So I searched through 100% of the data available in the file associated with this branch, and the only type-indication that ROOT might be using is the fact that the # desperation cases, usually because streamer info was dropped by hadd
elif branch.fClassName == b"vector<bool>":
return asstlvector(asdtype("bool"))
elif branch.fClassName == b"vector<char>":
return asstlvector(asdtype("i1"))
elif branch.fClassName == b"vector<unsigned char>":
return asstlvector(asdtype("u1"))
elif branch.fClassName == b"vector<short>":
return asstlvector(asdtype("i2"))
elif branch.fClassName == b"vector<unsigned short>":
return asstlvector(asdtype("u2"))
elif branch.fClassName == b"vector<int>":
return asstlvector(asdtype("i4"))
elif branch.fClassName == b"vector<unsigned int>":
return asstlvector(asdtype("u4"))
elif branch.fClassName == b"vector<long>":
return asstlvector(asdtype("i8"))
elif branch.fClassName == b"vector<unsigned long>":
return asstlvector(asdtype("u8"))
elif branch.fClassName == b"vector<float>":
return asstlvector(asdtype("f4"))
elif branch.fClassName == b"vector<double>":
return asstlvector(asdtype("f8")) but for the reason given above, they should only work if the branch is manually declared to have type Ordinarily, I'd be against adding code that makes a codebase more complex while only stamping out an individual case, moreover, a case generated by another bug elsewhere. (I can't believe that behavior in >>> t.show()
v_int16 (no streamer) asstlvector(asdtype('>i2'))
v_int32 (no streamer) asstlvector(asdtype('>i4'))
v_int64 (no streamer) asstlvector(asdtype('>i8'))
v_uint16 (no streamer) asstlvector(asdtype('>u2'))
v_uint32 (no streamer) asstlvector(asdtype('>u4'))
v_uint64 (no streamer) asstlvector(asdtype('>u8'))
v_bool (no streamer) asstlvector(asdtype('bool'))
v_float (no streamer) asstlvector(asdtype('>f4'))
v_double TStreamerSTL asstlvector(asdtype('>f8')) They don't claim (wrongly) to have a streamer, but they nevertheless have an interpretation (right column) for reasons unrelated to streamers. It's something that I'll be able to ask users to print out to diagnose future bugs. |
The fix is released in GitHub and PyPI ( |
Thanks for the detailed explanations and fixing it so quickly! 😮 I can confirm uproot 2.5.11 is able to read my analysis ntuple (after hadd) correctly. And it's reeaaally fast compared to rootpy, which is what I have been using. It does seems like hadd is optimizing away the streamer info for other types of std::vector , because they are not needed by ROOT. In any case, I'm happy with your solution. |
I'm glad it's working for you! Large performance improvements with respect to PyROOT or rootpy are expected, so that's good (I'd worry if the difference wasn't evident). Labeling something as a bug or a feature is subjective, but I'd say this behavior in In fact, |
Oh, and since you're working with STL vectors, consider installing Numba if you haven't already. When we read STL vectors, we have to touch each event individually (to remove a header), which is done in a Python for loop if you don't have Numba and in compiled code if you do. For large ntuples (where the time spent processing the ntuple exceeds the half second or so spent compiling the code), you should see an additional speedup. But if it's fast enough and installing Numba is a bother, don't bother. |
Ok, I'll make a bug report regarding hadd to the ROOT team and see what they would say. But I won't insist that they fix it. Regarding Numba, what do I have to do to activate it? Do I simply put >>> import numba
>>> print numba.__version__
0.33.0 My next question is off-topic. But I'm trying to port my rootpy script to uproot. I'm analyzing a list of hits, and each hit has its variables like endcap, station, ring. phi, theta, etc. The way I store them in my ntuple is the following:
In the example you gave at LPC about nested structures, you did the following to analyze a list of electrons: for event in events:
mht_px = 0.0
mht_py = 0.0
mht_pz = 0.0
for electron in event.Electron:
mht_px += electron.pt * cosh(electron.eta) * sin(electron.phi)
mht_py += electron.pt * cosh(electron.eta) * cos(electron.phi)
mht_pz += electron.pt * sinh(electron.eta) I would like to do the same, but unfortunately I'm using What is the closest thing I can do? In rootpy, I use this nice function tree.define_collection(name='hits', prefix='vh_', size='vh_size')
for evt in tree:
for hit in evt.hits:
print hit.endcap, hit.station, hit.ring, hit.sim_phi, hit.sim_theta I would appreciate any suggestion, thanks! |
Regarding Numba, there's no explicit activation. If Actually, it's CMSSW that's bundled with Numba; it's not installed by default on CMSLPC. |
The nested structures thing wasn't a difference between TClonesArray and vector. Both of these types of branches are interpreted as JaggedArrays, so they behave exactly the same way. The nested structures talk was actually using two codebases together: uproot to read ROOT data and oamap (object-array mapping). Since then, I've rewritten uproot (so that you can use it with vector now) and I'm in the process of rewriting oamap as well. There isn't a combination of versions of the two packages that (a) interface with one another and (b) give you the vector that you want. rootpy's However, that doesn't have to be a chore: it looks like you want to combine all the branches that start with tree->Branch("vh_size", &vh_size, "vh_size/I");
tree->Branch("vh_endcap", &vh_endcap, "vh_endcap[vh_size]/F");
// etc... But that's a digression. You could get a bunch of branch names starting with hitnames = tree.allkeys(filtername=lambda x: x.startswith("vh_")) And then extract only these arrays with arrays = tree.arrays(hitnames, outputtype=tuple) where the zip(*arrays) which effectively makes each array in >>> for event in zip(*arrays):
... print "new event"
... for name, data in zip(hitnames, event):
... print name, data
...
new event
v_int16 [1 2 3]
v_int32 [1 2 3]
v_int64 [1 2 3]
v_uint16 [1 2 3]
v_uint32 [1 2 3]
v_uint64 [1 2 3]
v_bool [False True]
v_float [ 999. -999.]
v_double [ 999. -999.]
new event
v_int16 [1 2 3]
v_int32 [1 2 3]
v_int64 [1 2 3]
v_uint16 [1 2 3]
v_uint32 [1 2 3]
v_uint64 [1 2 3]
v_bool [False True]
v_float [ 999. -999.]
v_double [ 999. -999.]
new event
v_int16 [1 2 3]
v_int32 [1 2 3]
v_int64 [1 2 3]
v_uint16 [1 2 3]
v_uint32 [1 2 3]
v_uint64 [1 2 3]
v_bool [False True]
v_float [ 999. -999.]
v_double [ 999. -999.] In the future, let's move these discussions to the uproot users Google Group. |
Later, when you need more speed (because you've got your process set up— all exploration done), you'll probably want to put this loop inside of a compiled Numba function, which might complain about the use of |
Thanks a lot for answering all my questions. I'll use the Google Group in the future. I thought the fact that you could loop over Meanwhile, the snippet that you posted will do the job. That's much better than me writing a really long |
Use the JaggedArray type of the methods class passed in!
I ran into a strange case where I could not read a ntuple with
std::vector
after doing 'hadd'. I created a test ntuple to reproduce the issue. It contains one event and has several branches:In uproot 2.5.10, I tried the following and it worked:
However, if I use hadd [1], I got errors (like [2] for
std::vector<int16>
) for doing the same thing as above:I also tried
So it appears that the Streamer info is missing somehow after doing 'hadd'? It's strange... I attach the root files 'stlvector.root' and 'stlvector_after_hadd.root'. I'd appreciate it if you could look into this. Thank you!
Best regards,
Jia Fu
[1]
[2]
[3]
The text was updated successfully, but these errors were encountered: