MultiIndex.contains try-catch exception types too narrow #24570

xwang777 · 2019-01-02T19:58:55Z

Code Sample

import pandas as pd 
import numpy as np  
tx = pd.timedelta_range('09:30:00','16:00:00', freq='30 min')       
# ys is Series with MultiIndex, first level is Timedelta
ys = pd.DataFrame(np.random.randn(len(tx), 2), index=tx, columns = ('c0','c1')).set_index('c0', append=True)['c1']  
# this is what sklearn.metrics.r2_score would do
hasattr(ys, 'fit')

Problem description

sklearn tries to handle numpy array-like args, including pd.Series and pd.DataFrame. In trying to validate the argument, sklearn calls hasattr(ys, 'fit'), which calls into the _can_hold_identifiers_and_holds_name function of the MultiIndex object. In line 2111 of pandas/core/indexes/base.py, it tries to check whether 'fit' is in the MultiIndex, which in turns calls into .
pandas/core/indexes/multi.py line 539 for 'self.get_loc(key)'. The get_loc is surrounded by try-catch, but only for LookupError and TypeError, whereas here it's ValueError that's thrown.

The old version 0.22.0 works fine.

For now I added a try-catch _can_hold_identifiers_and_holds_name function and return False. It works around the problem. It looks like this:

    def _can_hold_identifiers_and_holds_name(self, name):
        if self.is_object() or self.is_categorical():
            try:
                return name in self
            except Exception:
                return False
        return False

Expected Output

In [4]: hasattr(ys, 'fit')

ValueError Traceback (most recent call last)
in
----> 1 hasattr(ys, 'fit')

~/app/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py in getattr(self, name)
4372 return object.getattribute(self, name)
4373 else:
-> 4374 if self._info_axis._can_hold_identifiers_and_holds_name(name):
4375 return self[name]
4376 return object.getattribute(self, name)

~/app/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in _can_hold_identifiers_and_holds_name(self, name)
2109 """
2110 if self.is_object() or self.is_categorical():
-> 2111 return name in self
2112 return False
2113

~/app/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/multi.py in contains(self, key)
547 hash(key)
548 try:
--> 549 self.get_loc(key)
550 return True
551 except (LookupError, TypeError):

~/app/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/multi.py in get_loc(self, key, method)
2235
2236 if not isinstance(key, tuple):
-> 2237 loc = self._get_level_indexer(key, level=0)
2238
2239 # _get_level_indexer returns an empty slice if the key has

~/app/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/multi.py in _get_level_indexer(self, key, level, indexer)
2494 else:
2495
-> 2496 loc = level_index.get_loc(key)
2497 if isinstance(loc, slice):
2498 return loc

~/app/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/timedeltas.py in get_loc(self, key, method, tolerance)
783
784 if _is_convertible_to_td(key):
--> 785 key = Timedelta(key)
786 return Index.get_loc(self, key, method, tolerance)
787

pandas/_libs/tslibs/timedeltas.pyx in pandas._libs.tslibs.timedeltas.Timedelta.new()

pandas/_libs/tslibs/timedeltas.pyx in pandas._libs.tslibs.timedeltas.parse_timedelta_string()

ValueError: unit abbreviation w/o a number

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.32-431.el6.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: 4.0.2
pip: 18.1
setuptools: 40.6.3
Cython: 0.29.2
numpy: 1.15.4
scipy: 1.1.0
pyarrow: 0.11.1
xarray: 0.11.0
IPython: 7.2.0
sphinx: 1.8.2
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.8
feather: None
matplotlib: 3.0.2
openpyxl: 2.5.12
xlrd: 1.2.0
xlwt: 1.3.0
xlsxwriter: 1.1.2
lxml: 4.2.5
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.15
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: 0.1.6
pandas_gbq: None
pandas_datareader: None

, and sklearn: 0.20.1

The text was updated successfully, but these errors were encountered:

jreback · 2019-01-02T22:18:09Z

can you provide an example that doesn't use an external package

xwang777 · 2019-01-03T17:09:26Z

Sure. Here's the reduced example that only uses pandas and numpy. The exception is the same.

import pandas as pd 
import numpy as np  
tx = pd.timedelta_range('09:30:00','16:00:00', freq='30 min')       
# ys is Series with MultiIndex, first level is Timedelta
ys = pd.DataFrame(np.random.randn(len(tx), 2), index=tx, columns = ('c0','c1')).set_index('c0', append=True)['c1']  
# this is what sklearn.metrics would do
hasattr(ys, 'fit')

Output