Skip to content
This repository has been archived by the owner on Jun 21, 2022. It is now read-only.

Fix iteration over DataFrames and provide more interfaces #264

Merged
merged 5 commits into from
Mar 27, 2019

Conversation

jpivarski
Copy link
Member

This PR fixes #263 and provides new methods and functions:

  • tree.pandas.iterate is like tree.pandas.df in that it sets some Pandas-friendly defaults, but on tree.iterate, rather than tree.arrays.
  • uproot.pandas.iterate sets those Pandas-friendly defaults on uproot.iterate.

Various bugs were fixed. For instance, tree.iterate really wasn't Pandas-ready: it had fallen considerably behind tree.arrays, but fortunately most of the Pandas-specific stuff is in a function call now, so no code duplication was needed to get tree.iterate up to speed and now it will inherit any future updates.

Also, globalentrystart setting in uproot.iterate was broken for Pandas DataFrames, both for MultiIndex and for RangeIndex. The original code seemed to be expecting Int64Index for some reason.

All in all, the combination of iterate + Pandas was just out of date with respect to the changes that have been more fully tested on arrays + Pandas.

@jpivarski jpivarski merged commit adba8a5 into master Mar 27, 2019
@jpivarski jpivarski deleted the issue-263 branch March 27, 2019 16:26
@afrankenthal
Copy link

Hi Jim, I just manually updated my uproot version to 3.4.16 following the master branch, and I'm running into this error now when using uproot.iterate:

AttributeError Traceback (most recent call last)
in ()
4
5 listofarrays = []
----> 6 for arrays in uproot.iterate(listoffiles[0:2], "SREffi_dsa/cutsTree", ['recoPt','recoEta'],outputtype=pd.DataFrame, flatten=True, executor=executor):
7 listofarrays.append(arrays)

/uscms/homes/a/as2872/nobackup/conda/uprootgit/lib/python3.6/site-packages/uproot/tree.py in iterate(path, treepath, branches, entrysteps, outputtype, namedecode, reportpath, reportfile, reportentries, flatten, flatname, awkwardlib, cache, basketcache, keycache, executor, blocking, localsource, xrootdsource, httpsource, **options)
117 if getattr(outputtype, "name", None) == "DataFrame" and getattr(outputtype, "module", None) == "pandas.core.frame":
118 if type(arrays.index).name == "MultiIndex":
--> 119 index = arrays.index.levels[0].to_numpy()
120 awkward.numpy.add(index, globalentrystart, out=index)
121 elif type(arrays.index).name == "RangeIndex":

AttributeError: 'Int64Index' object has no attribute 'to_numpy'

I notice you mentioned something about Int64Index in the PR note, but I'm not sure how exactly this might relate?

Thanks!

@jpivarski
Copy link
Member Author

I chose the index.to_numpy() method over the index.values property, where index is a Pandas Index. It's possible that this is too new—you need to update your version of Pandas and I need to switch to the old interface for better support.

Does a new version of Pandas fix it? Even if it does, I'm going to want to support more than just the latest version...

@jpivarski
Copy link
Member Author

This method was introduced in Pandas 0.24.0. The new property is array (what I actually want, rather than to_numpy(), which might do a conversion) and the old property is values. I'll make a new version of uproot that tries array before values.

@afrankenthal
Copy link

Cool, I'm trying to install pandas 0.24.2 (current one is 0.23.4). It's probably a good idea to have the latest pandas anyway.

@afrankenthal
Copy link

Cool, that worked! I think you were faster than me and I ended up getting 3.4.17, so it works with pandas 0.24.2. I can also try with my other conda env that still has pandas 0.23.4 to see if 3.4.17 works as well.

@jpivarski
Copy link
Member Author

Running another version through is also a good test of Travis. It looks like the system is back up—mostly. Jobs start in decent time now, but "solving environment" for conda in Python 2.7 now takes longer than Travis is willing to wait. I'm thinking this issue is unrelated to the Travis outage; it just happened at the same time.

I don't like not being able to deploy whenever I want. (Grumble.)

@afrankenthal
Copy link

I feel you! But apart from the conda delay it looks like this was a major Travis outage.

Anyway, I tested 3.4.17 with 0.23.4 and it seems to work as well! I also asked Alexx who manages a conda package in the LPC to update uproot when the latest release gets deployed, so others don't have to install their own env.

Thanks a lot for your help and quick problem-solving!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MultiIndex pandas dataframe from uproot.iterate
2 participants