-
Notifications
You must be signed in to change notification settings - Fork 67
Fixes #465 by explicitly trimming STL vector size (some ROOT files have unused/padding/junk in each event after the vector's serialized data). #466
Conversation
…ve unused/padding/junk in each event after the vector's serialized data).
Oh, I forgot to say: please stress-test this. Thanks! |
Thanks Jim! I'd like to add tests if it's ok... |
I'm afraid it's going deeper in the rabbit hole... The fix now nicely cuts off the uninitialised values, but the shape is still not right. In [4]: import uproot
In [5]: f = uproot.open("usr-nested.root")
In [6]: f['E']['Evt']['mc_trks']['mc_trks.usr_names'].array()
Out[6]: <ObjectArray [[b'bx', b'by', b'ichan', b'cc'] [b'bx', b'by', b'ichan', b'cc'] [b'bx', b'by', b'ichan', b'cc'] ... [b'bx', b'by', b'ichan', b'cc'] [b'bx', b'by', b'ichan', b'cc'] [b'bx', b'by', b'ichan', b'cc']] at 0x000119099c10>
In [7]: f['E']['Evt']['mc_trks']['mc_trks.usr'].array()[0]
Out[7]: array([0.048692, 0.058846, 3. , 2. ]) In this case, the In [11]: f['E']['Evt']['mc_trks']['mc_trks.pos.x'].array()[0]
Out[11]:
array([32.263, 32.263, 32.263, 32.263, 32.263, 32.263, 32.263, 32.263,
32.263, 32.263, 32.263, 32.263, 32.263, 32.263, 32.263, 32.263,
32.263, 32.263, 32.263, 32.263, 32.263]) So the correct output output of >>> f['E']['Evt']['mc_trks']['mc_trks.usr'].array()[0]
array([0.048692, 0.058846, 3. , 2. ] [63.2413] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [])) and the >>> f['E']['Evt']['mc_trks']['mc_trks.usr_names'].array()[0]
[[b'bx', b'by', b'ichan', b'cc'], ['energy_lost_in_can'], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], []] At least in the
Since we expect 19 missing entries (0 length strings) after In [20]: b = b"\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\
...: x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
...: \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0
...: 0\x00\x00\x00\x00\x00\x00\x00\x00\x00"
In [21]: len(b)
Out[21]: 76
In [22]: len(b)/19
Out[22]: 4.0 One question is, how to interpret this as a nested list? |
Btw. another similar structure is the This is from the same file, which contains 636 tracks in the first event and therefore 636 sub-arrays in the
|
Well, I don't see anything in the C structs or the ROOT metadata that suggest that But also, the two branches you're looking at have the same jagged structure (correcting for the >>> mc_trks_usr_names = awkward.fromiter(f['E']['Evt']['mc_trks']['mc_trks.usr_names'].array())
>>> mc_trks_usr = f['E']['Evt']['mc_trks']['mc_trks.usr'].array()
>>> numpy.array_equal(mc_trks_usr_names.counts, mc_trks_usr.counts)
True As a demonstration of this, we can match a string name to each numerical value—they're one-to-one: >>> named_pairs = awkward.JaggedArray.zip(mc_trks_usr_names, mc_trks_usr)
>>> named_pairs
<JaggedArray [[(b'bx', 0.048692) (b'by', 0.058846) (b'ichan', 3.0) (b'cc', 2.0)] [(b'bx', 0.146713) (b'by', 0.400233) (b'ichan', 3.0) (b'cc', 2.0)] [(b'bx', 0.134258) (b'by', 0.077309) (b'ichan', 3.0) (b'cc', 2.0)] ... [(b'bx', 0.0) (b'by', 0.0) (b'ichan', 5.284424934e-315) (b'cc', 1.3742287462714573e-122)] [(b'bx', 5.304989477e-315) (b'by', 5e-324) (b'ichan', 175.476) (b'cc', 0.384255)] [(b'bx', 3.0) (b'by', 2.0) (b'ichan', 2.655811498e-314) (b'cc', -2.0868684099274518e-254)]] at 0x7f15ee0ad850> |
This is very weird 😕 The data is definitely nested. I was always confused about this What do you think? Edit, I mean that it's |
The surprising thing is that ROOT is not trimming the serialized It showed up as a bug in uproot because I made a strong assumption that the serialized The fact that it's never come up before suggests that this is a rare feature-or-bug. Therefore, the fact that it happens in some branches and not others is also not too surprising. |
I see.. what do you suggest? How do we proceed? Should I I fix this upstream and write an own method to parse the |
After it's been written to the file, the dots in the branch names are just a naming convention. When I say I see no evidence that they should be nested, I mean that they're not variable length of variable length objects (only one level of "variable length"). Nesting of same length or fixed-length objects is a matter of convention, whether you consider >>> record_x = numpy.arange(10) + 0.1
>>> record_y = numpy.arange(10) + 0.2
>>> record_z = numpy.arange(10) + 0.3
>>> records = awkward.Table({"x": record_x, "y": record_y, "z": record_z})
>>> records.tolist()
[{'x': 0.1, 'y': 0.2, 'z': 0.3}, {'x': 1.1, 'y': 1.2, 'z': 1.3}, {'x': 2.1, 'y': 2.2, 'z': 2.3}, {'x': 3.1, 'y': 3.2, 'z': 3.3}, {'x': 4.1, 'y': 4.2, 'z': 4.3}, {'x': 5.1, 'y': 5.2, 'z': 5.3}, {'x': 6.1, 'y': 6.2, 'z': 6.3}, {'x': 7.1, 'y': 7.2, 'z': 7.3}, {'x': 8.1, 'y': 8.2, 'z': 8.3}, {'x': 9.1, 'y': 9.2, 'z': 9.3}]
>>> records.x
array([0.1, 1.1, 2.1, 3.1, 4.1, 5.1, 6.1, 7.1, 8.1, 9.1])
>>> records.y
array([0.2, 1.2, 2.2, 3.2, 4.2, 5.2, 6.2, 7.2, 8.2, 9.2])
>>> records.z
array([0.3, 1.3, 2.3, 3.3, 4.3, 5.3, 6.3, 7.3, 8.3, 9.3])
>>> records_x
>>> record_x
array([0.1, 1.1, 2.1, 3.1, 4.1, 5.1, 6.1, 7.1, 8.1, 9.1])
>>> record_y
array([0.2, 1.2, 2.2, 3.2, 4.2, 5.2, 6.2, 7.2, 8.2, 9.2])
>>> record_z
array([0.3, 1.3, 2.3, 3.3, 4.3, 5.3, 6.3, 7.3, 8.3, 9.3]) Uproot reads branches as completely separate objects. If they're supposed to be in a hierarchy of nested fixed-size objects (like the record above, which has the same fixed size for x, y, and z), then you can put them together into a If the inner data have variable sizes, then that's a problem you need Uproot to resolve. Fixed-size things are so fluidly combinable and separable that maybe that shouldn't be a rigid concept anymore, but variable-sized nesting is an irreducible thing. Have we been talking past each other on the use of the word "nested"? |
Other than the wrong |
OK, maybe we are talking past each other regarding the nested thing ;)
The problem is that only the first element of the list of lists is fixed now. But it's still top level. In fact, the Let me try to summarise better. The following array gives the number of tracks in each event (it's a bit unusual that this branch lists numbers, but we got used to it):
Here you see that the first event has 21 tracks, the second 13 and so on. Now, each attribute of this Before the fix, the first entry (for the first event) had a length of 15, which seemed kind of random (instead of 21) and it was a flat list of values. The first 4 entries were in fact values, which one would expect in the first array of the entry ( After the fix, the length is OK, but it's still not nested and is missing the remaining 20 elements (as sub arrays):
The expected output is the following:
Each containing 21 sub-arrays for the first event. Especially the last expected output is kind of easy to check, since the
|
Oh, then I have again misunderstood your intent! Instead of >>> f['E']['Evt']['mc_trks']['mc_trks.usr'].interpretation
asjagged(asdtype('>f8'), 10, 6)
>>> f['E']['Evt']['mc_trks']['mc_trks.usr_names'].interpretation
asgenobj(STLVector(STLString())) you want the interpretation to be >>> branch = f['E']['Evt']['mc_trks']['mc_trks.usr']
>>> branch.array(uproot.asgenobj(uproot.SimpleArray(uproot.STLVector(uproot.asdtype(">f8"))), branch._context, 6))[0]
[[0.048692, 0.058846, 3.0, 2.0], [63.2413], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], []]
>>> branch = f['E']['Evt']['mc_trks']['mc_trks.usr_names']
>>> branch.array(uproot.asgenobj(uproot.SimpleArray(uproot.STLVector(uproot.STLString())), branch._context, 6))[0]
[[b'bx', b'by', b'ichan', b'cc'], [b'energy_lost_in_can'], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], []] so the stuff after the first vector is not "random junk" or "padding" or anything—you were looking for more data. In that case, this PR is entirely mistaken and I'll be closing it, rather than merging it. If the above is right, then your real problem is why the interpretation thought it was singly jagged when it's doubly jagged. However, when I'm looking a the C++ snippets you've given me, I would also assume that they're supposed to be singly jagged. As I understood it, you have an |
Yes! Sorry that might be due to my bad vocabulary. Indeed I meant double jagged ;) OK, so yes, indeed the C++ implementation is confusing. The superclass for every class is basically this (the struct AAObject : public TObject
{
std::vector<double> usr; ///< user data
std::vector<std::string> usr_names; ///< user keys
...
...
TObject* any; ///< Pointer to "any" user data.
ClassDef(AAObject, 6)
}; And then we have struct Trk: public AAObject
{
int id; ///< track identifier
Vec pos; ///< postion of the track at time t
Vec dir; ///< track direction
double t; ///< track time (when the particle is at pos )
double E; ///< Energy (either MC truth or reconstructed)
double len; ///< length, if applicable
double lik; ///< likelihood or lambda value (for aafit, lambda)
int type; ///< MC: particle type in PDG encoding.
int rec_type; ///< identifyer for the overall fitting algorithm/chain/strategy
std::vector<int> rec_stages; ///< list of identifyers of succesfull fitting stages resulting in this track
int status; ///< MC status code
int mother_id; ///< MC id of the parent particle
std::vector<double> fitinf; ///< place to store additional fit info, for jgandalf, see JFitParameters.hh
std::vector<int> hit_ids; ///< list of associated hit-ids (corresponds to Hit::id).
std::vector<double> error_matrix; ///< (5x5) error covariance matrix (stored as linear vector)
std::string comment; ///< use as you like
...
... And finally the struct Evt: public AAObject
{
int id; ///< offline event identifier
int det_id; ///< detector identifier from DAQ
int mc_id; ///< identifier of the MC event (as found in ascii or antcc file).
int run_id; ///< DAQ run identifier
int mc_run_id; ///< MC run identifier
int frame_index; ///< from the raw data
ULong64_t trigger_mask; ///< trigger mask from raw data (i.e. the trigger bits)
ULong64_t trigger_counter; ///< trigger counter
unsigned int overlays; ///< number of overlaying triggered events
TTimeStamp t; ///< UTC time of the start of the timeslice the event came from
//hits and tracks
std::vector<Hit> hits; ///< list of hits
std::vector<Trk> trks; ///< list of reconstructed tracks (can be several because of prefits,showers, etc).
... |
I see, so the I'm closing this PR because it is a wrong fix: the problem is not that there are any "junk bytes" after the single serialized |
@tamasgal This should fix your issue. As described here, the STL vector includes information about its length that we had been ignoring because ROOT also gives the position of the start of each event, so normally the STL vector size is redundant. But you've found a case where there's unused bytes/padding/junk written to the file after the STL vector itself, making the STL vector size not redundant.
Now
uproot.asjagged
has an additional parameter (sizeat
) for the byte position of an integer that specifies each jagged subarray's size and STL vectors set that parameter. (Other jagged arrays do not because they don't have that information.) In most cases, it's still redundant, so we check that it agrees with the sizes inferred from the ROOT event starts and only trim the data if they disagree.