Skip to content
This repository has been archived by the owner on Jun 21, 2022. It is now read-only.

Cache bug while reading array of data #40

Closed
vkuznet opened this issue Dec 21, 2017 · 1 comment
Closed

Cache bug while reading array of data #40

vkuznet opened this issue Dec 21, 2017 · 1 comment

Comments

@vkuznet
Copy link

vkuznet commented Dec 21, 2017

If I read /afs/cern.ch/user/v/valya/public/nano-RelValTTBar.root file with the following code:

def read(fin, branch='Events'):
    normalBranches = []
    with uproot.open(fin) as istream:
        tree = istream[branch]
        branches = [n for _, n in tree.allitems()]
        normalBranches = []
        jaggedBranches = []
        for key, val in tree.allitems():
            data = val.array()
            if not isinstance(data, uproot.interp.jagged.JaggedArray):
                normalBranches.append(key)
            else:
                jaggedBranches.append(key)
        print("\n### Cache error in reading branches")
        cache = {}
        try:
            for key in branches:
                data = tree[key].array(cache=cache)
        except:
            traceback.print_exc()
        print("\n### Cache error in reading normal branches")
        cache = {}
        try:
            for key in normalBranches:
                data = tree[key].array(cache=cache)
        except:
            traceback.print_exc()
        print("\n### Cache error in reading jagged branches")
        cache = {}
        try:
            for key in normalBranches:
                data = tree[key].array(cache=cache)
        except:
            traceback.print_exc()

read('/opt/cms/data/nano-RelValTTBar.root')

I encounter different errors while applying cache to array, see errors below:

### Cache error in reading branches
Traceback (most recent call last):
  File "./cache_bug.py", line 42, in read
    data = tree[key].array(cache=cache)
  File "/Users/vk/Downloads/uproot-2.5.10/uproot/tree.py", line 539, in __getitem__
    return self.get(name)
  File "/Users/vk/Downloads/uproot-2.5.10/uproot/tree.py", line 275, in get
    raise KeyError("not found: {0}".format(repr(name)))
KeyError: "not found: <TBranch 'run' at 0x0001052366d0>"

### Cache error in reading normal branches
Traceback (most recent call last):
  File "./cache_bug.py", line 49, in read
    data = tree[key].array(cache=cache)
  File "/Users/vk/Downloads/uproot-2.5.10/uproot/tree.py", line 938, in array
    cachekey = self._cachekey(interpretation, entrystart, entrystop)
  File "/Users/vk/Downloads/uproot-2.5.10/uproot/tree.py", line 620, in _cachekey
    return "{0};{1};{2};{3};{4}-{5}".format(self._context.sourcepath, self._context.treename, self.name, interpretation.identifier, entrystart, entrystop)
  File "/Users/vk/Downloads/uproot-2.5.10/uproot/interp/numerical.py", line 93, in identifier
    fromdtype = "{0}{1}{2}".format(self._byteorder_transform[self.fromdtype.byteorder], self.fromdtype.kind, self.fromdtype.itemsize)
KeyError: '|'

### Cache error in reading jagged branches
Traceback (most recent call last):
  File "./cache_bug.py", line 56, in read
    data = tree[key].array(cache=cache)
  File "/Users/vk/Downloads/uproot-2.5.10/uproot/tree.py", line 938, in array
    cachekey = self._cachekey(interpretation, entrystart, entrystop)
  File "/Users/vk/Downloads/uproot-2.5.10/uproot/tree.py", line 620, in _cachekey
    return "{0};{1};{2};{3};{4}-{5}".format(self._context.sourcepath, self._context.treename, self.name, interpretation.identifier, entrystart, entrystop)
  File "/Users/vk/Downloads/uproot-2.5.10/uproot/interp/numerical.py", line 93, in identifier
    fromdtype = "{0}{1}{2}".format(self._byteorder_transform[self.fromdtype.byteorder], self.fromdtype.kind, self.fromdtype.itemsize)
KeyError: '|'
@jpivarski
Copy link
Member

The first KeyError is yours: your branches variable is a list of TBranch objects, not names (strings). The n you used as a dummy variable is suggestive that you're thinking they're names. Thus, when you say tree[key] where key is an element of branches, they're not branch names (keys) but the branches themselves (values).

The other one, the KeyError: '|', is a bug in my interpretation of Numpy byte orders. Numpy has five characters it could use to specify byte endianness, > (big), < (little), ! (network, which is big), = (native, which depends on your system), and | (doesn't matter because it's a one-byte type). I need to map these into alphanumeric characters for the cache key because the cache key might be used as a filename in some applications (where >, <, and | would be very bad!).

The bug was simple: I missed the | case. It's fixed in 2.5.12, which is now working through the mechanations of commit, test, release, deployment on PyPI.

Oh, and here are the timings with cache (first time it's not loaded, subsequent times it is loaded):

import time
import uproot
tree = uproot.open("~/storage/data/nano-RelValTTBar.root")["Events"]

cache = {}
for i in range(5):
    startTime = time.time()
    normalBranches = []
    jaggedBranches = []
    for name, branch in tree.allitems():
        data = branch.array(cache=cache)
        if not isinstance(data, uproot.interp.jagged.JaggedArray):
            normalBranches.append(name)
        else:
            jaggedBranches.append(name)
    endTime = time.time()
    print "#", endTime - startTime

# 1.10440707207
# 0.0707149505615
# 0.0708758831024
# 0.0705630779266
# 0.0711648464203

We saw the 1‒2 seconds to load 9000 entries in #41, this is now 0.07 seconds to recall the 9000 entries.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants