-
Notifications
You must be signed in to change notification settings - Fork 67
Reading std::map< std::string, std::vector<float> > #367
Comments
std::map< std::string, std::vector<float> >
The short answer is "no," a type like that has a record structure that can't be deserialized with Numpy (it requires an internal Python loop). Can you read it at all? I don't remember if we covered that type or not. If you can read it, you can pass it into |
I tried reading the map and it doesn't work: |
Can you supply an example of the file? |
I produced a simple file with only the map branch: map.root.zip. The map is of type The error I posted previously was when trying to use a pandas dataframe in the complete file I have. It doesn't appear in the simple file I uploaded, so that specific error could be due to something unrelated and we can ignore that for the moment. In both the simple file or the complete file, the branch can be read successfully but produces an array that is a list of the size of the map, so I tried using Is there a way to access |
See PR #380. The import uproot
import awkward
f = uproot.open("tests/samples/issue367.root")
t = f["tree"]
t.show()
# weights (no streamer) asdtype('>i4')
# weights.first (no streamer) asgenobj(SimpleArray(STLString()))
# weights.second (no streamer) asgenobj(STLVector(asdtype('>f8')))
keys = awkward.fromiter(t.array("weights.first"))
values = awkward.fromiter(t.array("weights.second"))
keys
# <JaggedArray [[b'expskin_FluxUnisim' b'genie_AGKYpT_Genie' ...
# b'reinteractions_proton_Reinteraction' b'splines_general_Spline']
# [b'expskin_FluxUnisim' b'genie_AGKYpT_Genie' ...
# b'reinteractions_proton_Reinteraction' b'splines_general_Spline']]>
values
# <JaggedArray [[0.9990982157550495 1.0014540015924693 0.9991243403482902 ...
# 1.0010513561615884 1.0007043308873849 0.9998340035940907]
# [0.944759093019904 1.0890682745548674 0.9463594161938594 ...
# 1.0644032852098082 1.0431454388908392 0.9898315011942]]>
keys.counts
# array([43, 43])
values.counts
# array([1000, 1000]) So that's why if you try to zip them together awkward.JaggedArray.zip(keys, values) you get an error:
Do you know why there are 43 strings but 1000 values? The data seem to be properly deserialized (strings are readable, weight values are all close to 1.0). |
The length of the vector is 1000 for most string keys, but can also be 10 or 1 depending on the specific string key. Also, indeed, the keys and values seem to be read out appropriately. |
In this file, it seems to be 43 strings in both of the events. Is that okay or is it another bug? I tried to read the file in ROOT and it just crashed (in PyROOT and also pure-ROOT |
There are 43 strings for both events, indeed. Most of the strings have an associated vector of length 1000 but some are shorter. I can read the file and output the map with ROOT using this macro:
This is a good file, or at least it is not different from other more complete files as far as reading the map in ROOT goes. |
In that case, my interpretation was wrong! Now it's >>> t.show()
weights (no streamer) asdtype('>i4')
weights.first (no streamer) asgenobj(SimpleArray(STLString()))
weights.second (no streamer) asgenobj(SimpleArray(STLVector(asdtype('>f8')))) in other words, the >>> keys = awkward.fromiter(t.array("weights.first"))
>>> values = awkward.fromiter(t.array("weights.second"))
>>> keys
<JaggedArray [[b'expskin_FluxUnisim' b'genie_AGKYpT_Genie' b'genie_AGKYxF_Genie' ...
>>> values
<JaggedArray [[[0.9990982157550495 1.0014540015924693 0.9991243403482902 ...
>>> weights = awkward.JaggedArray.zip(keys, values)
>>> weights
<JaggedArray [[(b'expskin_FluxUnisim', [0.99909822 1.001454 0.99912434 1.00138767... Both of the events have 43 items, and each of those items have 1000, 10, 1, etc. subitems. |
Please check that it works on your full file! Thanks! |
I checked and yes this seems to work. |
I have a different input file that was generated in such a way that the map keys and values are respectively only Listing the content of the tree with
Could the fix be extended to this naming scheme? |
I've looked into this, and the unsplit (single-branch) version of this is "semi-columnar" in a way that would require some deep investigation. (There's more header at the front, the strings are all given in a binary blob, followed by the numerical values, but without branch/basket boundaries to make it clear how to interpret it.) uproot will always have better support for split data than unsplit data—even if we decode unsplit STL structures, it would be going through more Python code than Numpy—so if it's possible for you to keep splitting on (i.e. generate |
My group uses TTrees in which we store
std::map< std::string, std::vector<float> >
as weights relative to certain parameters identified by the strings.Is there a way to read it efficiently with uproot, and maybe to convert it into a Jagged Array?
The text was updated successfully, but these errors were encountered: