Skip to content
This repository has been archived by the owner on Jun 21, 2022. It is now read-only.

Reading std::map< std::string, std::vector<float> > #367

Closed
nfoppiani opened this issue Oct 4, 2019 · 13 comments
Closed

Reading std::map< std::string, std::vector<float> > #367

nfoppiani opened this issue Oct 4, 2019 · 13 comments

Comments

@nfoppiani
Copy link

My group uses TTrees in which we store std::map< std::string, std::vector<float> > as weights relative to certain parameters identified by the strings.
Is there a way to read it efficiently with uproot, and maybe to convert it into a Jagged Array?

@nfoppiani nfoppiani changed the title Reading std::map< std::string, std::vector<float> > Reading std::map< std::string, std::vector<float> > Oct 4, 2019
@jpivarski
Copy link
Member

The short answer is "no," a type like that has a record structure that can't be deserialized with Numpy (it requires an internal Python loop).

Can you read it at all? I don't remember if we covered that type or not.

If you can read it, you can pass it into awkward.fromiter to turn it into jagged arrays that can be more efficiently processed. It will create a Table in which every key in the std::map is a column name; if different events have different sets of keys, it will become a UnionArray of all the different types. If different sets of keys are the rule, rather than the exception, then columnar processing would make less sense—you'd essentially have JSON processing.

@sebprince
Copy link

I tried reading the map and it doesn't work: ValueError: cannot interpret branch b'weights' as a Python type in file: [...] If it can't be read straightforwardly, is there a more complicated way where we can enable this type?

@jpivarski
Copy link
Member

Can you supply an example of the file?

@sebprince
Copy link

I produced a simple file with only the map branch: map.root.zip. The map is of type std::map<std::string, std::vector<double>>. To interact with it in ROOT, it requires generating a dictionary. The map in the file has 43 string keys and the length for the std::vector<double> depends on the string key. There is two events in the file. The events have the same string keys.

The error I posted previously was when trying to use a pandas dataframe in the complete file I have. It doesn't appear in the simple file I uploaded, so that specific error could be due to something unrelated and we can ignore that for the moment.

In both the simple file or the complete file, the branch can be read successfully but produces an array that is a list of the size of the map, so [43, 43] in this case.

I tried using awkward.fromiter with the array and while it technically worked (since the array is seen as [43, 43]), it obviously does not produce a meaningful result.

Is there a way to access weights.first and weights.second instead of just the size?

jpivarski added a commit that referenced this issue Oct 17, 2019
@jpivarski
Copy link
Member

See PR #380.

The weights.second needed to be interpreted as asgenobj(STLVector(asdtype('>f8'))), and now it is. I was trying to set up an example where you can zip the keys (weights.first) and values (weights.second) together, but they have different sizes, which confuses me. If they originally came from a single weights map, I'd think they'd need to have the same lengths.

import uproot
import awkward

f = uproot.open("tests/samples/issue367.root")
t = f["tree"]
t.show()
# weights                  (no streamer)            asdtype('>i4')
# weights.first            (no streamer)            asgenobj(SimpleArray(STLString()))
# weights.second           (no streamer)            asgenobj(STLVector(asdtype('>f8')))

keys = awkward.fromiter(t.array("weights.first"))
values = awkward.fromiter(t.array("weights.second"))

keys
# <JaggedArray [[b'expskin_FluxUnisim' b'genie_AGKYpT_Genie' ...
#                b'reinteractions_proton_Reinteraction' b'splines_general_Spline']
#               [b'expskin_FluxUnisim' b'genie_AGKYpT_Genie' ...
#               b'reinteractions_proton_Reinteraction' b'splines_general_Spline']]>

values
# <JaggedArray [[0.9990982157550495 1.0014540015924693 0.9991243403482902 ...
#                1.0010513561615884 1.0007043308873849 0.9998340035940907]
#               [0.944759093019904 1.0890682745548674 0.9463594161938594 ...
#                1.0644032852098082 1.0431454388908392 0.9898315011942]]>

keys.counts
# array([43, 43])
values.counts
# array([1000, 1000])

So that's why if you try to zip them together

awkward.JaggedArray.zip(keys, values)

you get an error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jpivarski/miniconda3/lib/python3.7/site-packages/awkward-0.12.12-py3.7.egg/awkward/util.py", line 48, in <lambda>
    return lambda *args, **kwargs: self.fcn(True, typ, *args, **kwargs)
  File "/home/jpivarski/miniconda3/lib/python3.7/site-packages/awkward-0.12.12-py3.7.egg/awkward/array/jagged.py", line 1753, in zip
    columns2[i] = x._tojagged(first._starts, first._stops, copy=False)
  File "/home/jpivarski/miniconda3/lib/python3.7/site-packages/awkward-0.12.12-py3.7.egg/awkward/array/jagged.py", line 900, in _tojagged
    raise ValueError("cannot fit contents of JaggedArray into the given starts and stops arrays")
ValueError: cannot fit contents of JaggedArray into the given starts and stops arrays

Do you know why there are 43 strings but 1000 values? The data seem to be properly deserialized (strings are readable, weight values are all close to 1.0).

@sebprince
Copy link

sebprince commented Oct 17, 2019

The length of the vector is 1000 for most string keys, but can also be 10 or 1 depending on the specific string key. Also, indeed, the keys and values seem to be read out appropriately.

@jpivarski
Copy link
Member

In this file, it seems to be 43 strings in both of the events. Is that okay or is it another bug?

I tried to read the file in ROOT and it just crashed (in PyROOT and also pure-ROOT TTree::Scan). If this is a bad file, could you confirm that PR #380 fixes your issue for a good file? Thanks!

@sebprince
Copy link

There are 43 strings for both events, indeed. Most of the strings have an associated vector of length 1000 but some are shorter.

I can read the file and output the map with ROOT using this macro:

{
    gInterpreter->GenerateDictionary("map<string,vector<double>>","map");

    TFile file("map.root");

    TTree* tree = nullptr;
    file.GetObject("tree",tree);

    map<string,vector<double>>* weights = nullptr;
    TBranch* b_weights = tree->GetBranch("weights");
    
    b_weights->SetAddress(&weights);
    
    b_weights->GetEntry(0);

    for (auto& kv: *weights){
        for(auto& val: kv.second){
            cout<<kv.first<<" "<<val<<endl;
        }
    }

    gApplication->Terminate();
}

This is a good file, or at least it is not different from other more complete files as far as reading the map in ROOT goes.

@jpivarski
Copy link
Member

In that case, my interpretation was wrong! Now it's

>>> t.show()
weights              (no streamer)      asdtype('>i4')
weights.first        (no streamer)      asgenobj(SimpleArray(STLString()))
weights.second       (no streamer)      asgenobj(SimpleArray(STLVector(asdtype('>f8'))))

in other words, the weights.second is doubly jagged (SimpleArray of STLVector). With that fix (see PR #380), we can now zip them together to make something like the original weights.

>>> keys = awkward.fromiter(t.array("weights.first"))
>>> values = awkward.fromiter(t.array("weights.second"))
>>> keys
<JaggedArray [[b'expskin_FluxUnisim' b'genie_AGKYpT_Genie' b'genie_AGKYxF_Genie' ...
>>> values
<JaggedArray [[[0.9990982157550495 1.0014540015924693 0.9991243403482902 ...
>>> weights = awkward.JaggedArray.zip(keys, values)
>>> weights
<JaggedArray [[(b'expskin_FluxUnisim', [0.99909822 1.001454   0.99912434 1.00138767...

Both of the events have 43 items, and each of those items have 1000, 10, 1, etc. subitems.

@jpivarski
Copy link
Member

Please check that it works on your full file! Thanks!

@sebprince
Copy link

I checked and yes this seems to work.
Thanks a lot for the prompt responses!

jpivarski added a commit that referenced this issue Oct 18, 2019
@sebprince
Copy link

I have a different input file that was generated in such a way that the map keys and values are respectively only first and second instead of mapname.first and mapname.second, for example map2.root.zip.

Listing the content of the tree with show gives

>>> t.show()
weights                    TStreamerSTL               None

Could the fix be extended to this naming scheme?

@jpivarski
Copy link
Member

I've looked into this, and the unsplit (single-branch) version of this is "semi-columnar" in a way that would require some deep investigation. (There's more header at the front, the strings are all given in a binary blob, followed by the numerical values, but without branch/basket boundaries to make it clear how to interpret it.)

uproot will always have better support for split data than unsplit data—even if we decode unsplit STL structures, it would be going through more Python code than Numpy—so if it's possible for you to keep splitting on (i.e. generate first and second branches, as before), I would recommend that.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants