Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MultiIndex.__contains__ try-catch exception types too narrow #24570

Closed
xwang777 opened this issue Jan 2, 2019 · 12 comments · Fixed by #25268
Closed

MultiIndex.__contains__ try-catch exception types too narrow #24570

xwang777 opened this issue Jan 2, 2019 · 12 comments · Fixed by #25268
Labels
Milestone

Comments

@xwang777
Copy link

xwang777 commented Jan 2, 2019

Code Sample

import pandas as pd 
import numpy as np  
tx = pd.timedelta_range('09:30:00','16:00:00', freq='30 min')       
# ys is Series with MultiIndex, first level is Timedelta
ys = pd.DataFrame(np.random.randn(len(tx), 2), index=tx, columns = ('c0','c1')).set_index('c0', append=True)['c1']  
# this is what sklearn.metrics.r2_score would do
hasattr(ys, 'fit')

Problem description

sklearn tries to handle numpy array-like args, including pd.Series and pd.DataFrame. In trying to validate the argument, sklearn calls hasattr(ys, 'fit'), which calls into the _can_hold_identifiers_and_holds_name function of the MultiIndex object. In line 2111 of pandas/core/indexes/base.py, it tries to check whether 'fit' is in the MultiIndex, which in turns calls into .
pandas/core/indexes/multi.py line 539 for 'self.get_loc(key)'. The get_loc is surrounded by try-catch, but only for LookupError and TypeError, whereas here it's ValueError that's thrown.

The old version 0.22.0 works fine.

For now I added a try-catch _can_hold_identifiers_and_holds_name function and return False. It works around the problem. It looks like this:

    def _can_hold_identifiers_and_holds_name(self, name):
        if self.is_object() or self.is_categorical():
            try:
                return name in self
            except Exception:
                return False
        return False

Expected Output

In [4]: hasattr(ys, 'fit')

ValueError Traceback (most recent call last)
in
----> 1 hasattr(ys, 'fit')

~/app/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py in getattr(self, name)
4372 return object.getattribute(self, name)
4373 else:
-> 4374 if self._info_axis._can_hold_identifiers_and_holds_name(name):
4375 return self[name]
4376 return object.getattribute(self, name)

~/app/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in _can_hold_identifiers_and_holds_name(self, name)
2109 """
2110 if self.is_object() or self.is_categorical():
-> 2111 return name in self
2112 return False
2113

~/app/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/multi.py in contains(self, key)
547 hash(key)
548 try:
--> 549 self.get_loc(key)
550 return True
551 except (LookupError, TypeError):

~/app/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/multi.py in get_loc(self, key, method)
2235
2236 if not isinstance(key, tuple):
-> 2237 loc = self._get_level_indexer(key, level=0)
2238
2239 # _get_level_indexer returns an empty slice if the key has

~/app/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/multi.py in _get_level_indexer(self, key, level, indexer)
2494 else:
2495
-> 2496 loc = level_index.get_loc(key)
2497 if isinstance(loc, slice):
2498 return loc

~/app/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/timedeltas.py in get_loc(self, key, method, tolerance)
783
784 if _is_convertible_to_td(key):
--> 785 key = Timedelta(key)
786 return Index.get_loc(self, key, method, tolerance)
787

pandas/_libs/tslibs/timedeltas.pyx in pandas._libs.tslibs.timedeltas.Timedelta.new()

pandas/_libs/tslibs/timedeltas.pyx in pandas._libs.tslibs.timedeltas.parse_timedelta_string()

ValueError: unit abbreviation w/o a number

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.32-431.el6.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: 4.0.2
pip: 18.1
setuptools: 40.6.3
Cython: 0.29.2
numpy: 1.15.4
scipy: 1.1.0
pyarrow: 0.11.1
xarray: 0.11.0
IPython: 7.2.0
sphinx: 1.8.2
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.8
feather: None
matplotlib: 3.0.2
openpyxl: 2.5.12
xlrd: 1.2.0
xlwt: 1.3.0
xlsxwriter: 1.1.2
lxml: 4.2.5
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.15
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: 0.1.6
pandas_gbq: None
pandas_datareader: None

, and sklearn: 0.20.1

@xwang777 xwang777 changed the title Index._can_hold_identifiers_and_holds_name should try-catch MultiIndex.__contains__ try-catch exception types too narrow Jan 2, 2019
@jreback
Copy link
Contributor

jreback commented Jan 2, 2019

can you provide an example that doesn't use an external package

@xwang777
Copy link
Author

xwang777 commented Jan 3, 2019

Sure. Here's the reduced example that only uses pandas and numpy. The exception is the same.

import pandas as pd 
import numpy as np  
tx = pd.timedelta_range('09:30:00','16:00:00', freq='30 min')       
# ys is Series with MultiIndex, first level is Timedelta
ys = pd.DataFrame(np.random.randn(len(tx), 2), index=tx, columns = ('c0','c1')).set_index('c0', append=True)['c1']  
# this is what sklearn.metrics would do
hasattr(ys, 'fit')

Output

In [4]: hasattr(ys, 'fit')

ValueError Traceback (most recent call last)
in
----> 1 hasattr(ys, 'fit')

~/app/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py in getattr(self, name)
4372 return object.getattribute(self, name)
4373 else:
-> 4374 if self._info_axis._can_hold_identifiers_and_holds_name(name):
4375 return self[name]
4376 return object.getattribute(self, name)

~/app/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in _can_hold_identifiers_and_holds_name(self, name)
2109 """
2110 if self.is_object() or self.is_categorical():
-> 2111 return name in self
2112 return False
2113

~/app/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/multi.py in contains(self, key)
547 hash(key)
548 try:
--> 549 self.get_loc(key)
550 return True
551 except (LookupError, TypeError):

~/app/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/multi.py in get_loc(self, key, method)
2235
2236 if not isinstance(key, tuple):
-> 2237 loc = self._get_level_indexer(key, level=0)
2238
2239 # _get_level_indexer returns an empty slice if the key has

~/app/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/multi.py in _get_level_indexer(self, key, level, indexer)
2494 else:
2495
-> 2496 loc = level_index.get_loc(key)
2497 if isinstance(loc, slice):
2498 return loc

~/app/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/timedeltas.py in get_loc(self, key, method, tolerance)
783
784 if _is_convertible_to_td(key):
--> 785 key = Timedelta(key)
786 return Index.get_loc(self, key, method, tolerance)
787

pandas/_libs/tslibs/timedeltas.pyx in pandas._libs.tslibs.timedeltas.Timedelta.new()

pandas/_libs/tslibs/timedeltas.pyx in pandas._libs.tslibs.timedeltas.parse_timedelta_string()

ValueError: unit abbreviation w/o a number

@jreback
Copy link
Contributor

jreback commented Jan 3, 2019

can you edit the example at the top with this

@jreback
Copy link
Contributor

jreback commented Jan 3, 2019

yeah this should be handled here:

~/pandas/pandas/core/indexes/timedeltas.py in get_loc(self, key, method, tolerance)
    546 
    547         if _is_convertible_to_td(key):
--> 548             key = Timedelta(key)
    549             return Index.get_loc(self, key, method, tolerance)
    550 

~/pandas/pandas/_libs/tslibs/timedeltas.pyx in pandas._libs.tslibs.timedeltas.Timedelta.__new__()

where if the conversion fails (e.g. put a try/except ValueError) around this.

PRs welcome.

@xwang777
Copy link
Author

xwang777 commented Jan 3, 2019

OK, I updated the example and output in the first comment.

Regarding the fix, why not try-catch Exception in the _can_hold_identifiers_and_holds_name function? This example is caused by Timedelta, but it could very well be caused by Timestamp, or any other types that fail to be constructed from a str.

@jreback
Copy link
Contributor

jreback commented Jan 3, 2019

yes u can try that see if anything breaks

@xwang777
Copy link
Author

The patch is very simple, I could create pull request easily. However, the tests (python setup.py test) failed right away, but not due to my changes. Anyone please tell me where the problem might be? Thanks

run test output

(pandas-dev) [02:19:47|xwang0-el7 pandas-xw]$ python setup.py test
running test
running egg_info
writing pandas.egg-info/PKG-INFO
writing dependency_links to pandas.egg-info/dependency_links.txt
writing requirements to pandas.egg-info/requires.txt
writing top-level names to pandas.egg-info/top_level.txt
reading manifest file 'pandas.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
no previously-included directories found matching 'doc/build'
warning: no previously-included files matching '.pyd' found anywhere in distribution
warning: no previously-included files matching '
~' found anywhere in distribution
warning: no previously-included files matching '.DS_Store' found anywhere in distribution
warning: no previously-included files matching '.git*' found anywhere in distribution
warning: no previously-included files matching '#*' found anywhere in distribution
writing manifest file 'pandas.egg-info/SOURCES.txt'
running build_ext
/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/unittest/loader.py:154: FutureWarning: 'pandas.core' is private. Use 'pandas.Categorical'
module = import(module_name)
Traceback (most recent call last):
File "setup.py", line 743, in
**setuptools_kwargs)
File "/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/site-packages/setuptools/init.py", line 143, in setup
return distutils.core.setup(**attrs)
File "/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/distutils/core.py", line 148, in setup
dist.run_commands()
File "/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/distutils/dist.py", line 966, in run_commands
self.run_command(cmd)
File "/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/site-packages/setuptools/command/test.py", line 228, in run
self.run_tests()
File "/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/site-packages/setuptools/command/test.py", line 250, in run_tests
exit=False,
File "/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/unittest/main.py", line 100, in init
self.parseArgs(argv)
File "/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/unittest/main.py", line 124, in parseArgs
self._do_discovery(argv[2:])
File "/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/unittest/main.py", line 244, in _do_discovery
self.createTests(from_discovery=True, Loader=Loader)
File "/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/unittest/main.py", line 154, in createTests
self.test = loader.discover(self.start, self.pattern, self.top)
File "/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/unittest/loader.py", line 347, in discover
tests = list(self._find_tests(start_dir, pattern))
File "/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/unittest/loader.py", line 404, in _find_tests
full_path, pattern, namespace)
File "/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/unittest/loader.py", line 481, in _find_test_path
tests = self.loadTestsFromModule(package, pattern=pattern)
File "/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/site-packages/setuptools/command/test.py", line 54, in loadTestsFromModule
tests.append(self.loadTestsFromName(submodule))
File "/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/unittest/loader.py", line 191, in loadTestsFromName
return self.loadTestsFromModule(obj)
File "/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/site-packages/setuptools/command/test.py", line 54, in loadTestsFromModule
tests.append(self.loadTestsFromName(submodule))
File "/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/unittest/loader.py", line 191, in loadTestsFromName
return self.loadTestsFromModule(obj)
File "/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/site-packages/setuptools/command/test.py", line 54, in loadTestsFromModule
tests.append(self.loadTestsFromName(submodule))
File "/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/unittest/loader.py", line 191, in loadTestsFromName
return self.loadTestsFromModule(obj)
File "/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/site-packages/setuptools/command/test.py", line 54, in loadTestsFromModule
tests.append(self.loadTestsFromName(submodule))
File "/home/xwang/app/anaconda2/envs/pandas-dev/lib/python3.7/unittest/loader.py", line 154, in loadTestsFromName
module = import(module_name)
File "/home/xwang/dev/pandas-xw/pandas/tests/indexes/datetimes/test_misc.py", line 91, in
class TestDatetime64(object):
File "/home/xwang/dev/pandas-xw/pandas/tests/indexes/datetimes/test_misc.py", line 246, in TestDatetime64
None] if tm.get_locales() is None else [None] + tm.get_locales())
File "/home/xwang/dev/pandas-xw/pandas/util/testing.py", line 516, in get_locales
x, encoding=pd.options.display.encoding))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe5 in position 4: invalid continuation byte

I forked from master, here's the top of my 'git log --name-status' output:

git log --name-status

commit 6f9ab14
Author: xwang777 30608423+xwang777@users.noreply.github.com
Date: Tue Jan 8 13:52:31 2019 -0500

Address issue 24570

M pandas/core/indexes/base.py

commit f57bd72
Author: Simon Hawkins simonjayhawkins@gmail.com
Date: Tue Jan 8 13:07:49 2019 +0000

REF: io/formats/html.py (and io/formats/format.py) (#24651)

M pandas/io/formats/format.py
M pandas/io/formats/html.py

@optilude
Copy link

optilude commented Feb 7, 2019

I was about to raise an issue for a problem with pivot_table(), but I believe it's the same bug. The relevant code is:

        chart_data.pivot_table(
            index='age',
            columns='priority',
            values='key',
            aggfunc='count'
        )

Here chart_data is a DataFrame with columns age (a Timedelta), priority (a string), and key (another string). There are no N/A values in any of those columns.

Before upgrading to Pandas 0.24.1 this worked. It now throws:

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/jira_agile_metrics/calculator.py", line 56, in run_calculators
    c.write()
  File "/usr/lib/python3.6/site-packages/jira_agile_metrics/calculators/debt.py", line 83, in write
    self.write_debt_age_chart(chart_data, self.settings['debt_age_chart'])
  File "/usr/lib/python3.6/site-packages/jira_agile_metrics/calculators/debt.py", line 132, in write_debt_age_chart
    aggfunc='count'
  File "/usr/lib/python3.6/site-packages/pandas/core/frame.py", line 5759, in pivot_table
    margins_name=margins_name)
  File "/usr/lib/python3.6/site-packages/pandas/core/reshape/pivot.py", line 84, in pivot_table
    agged = agged.dropna(how='all')
  File "/usr/lib/python3.6/site-packages/pandas/core/frame.py", line 4598, in dropna
    result = self.loc(axis=axis)[mask]
  File "/usr/lib/python3.6/site-packages/pandas/core/indexing.py", line 1500, in __getitem__
    return self._getitem_axis(maybe_callable, axis=axis)
  File "/usr/lib/python3.6/site-packages/pandas/core/indexing.py", line 1859, in _getitem_axis
    if is_iterator(key):
  File "/usr/lib/python3.6/site-packages/pandas/core/dtypes/inference.py", line 157, in is_iterator
    return hasattr(obj, '__next__')
  File "/usr/lib/python3.6/site-packages/pandas/core/generic.py", line 5065, in __getattr__
    if self._info_axis._can_hold_identifiers_and_holds_name(name):
  File "/usr/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 3984, in _can_hold_identifiers_and_holds_name
    return name in self
  File "/usr/lib/python3.6/site-packages/pandas/core/indexes/multi.py", line 841, in __contains__
    self.get_loc(key)
  File "/usr/lib/python3.6/site-packages/pandas/core/indexes/multi.py", line 2397, in get_loc
    loc = self._get_level_indexer(key, level=0)
  File "/usr/lib/python3.6/site-packages/pandas/core/indexes/multi.py", line 2652, in _get_level_indexer
    code = level_index.get_loc(key)
  File "/usr/lib/python3.6/site-packages/pandas/core/indexes/timedeltas.py", line 548, in get_loc
    key = Timedelta(key)
  File "pandas/_libs/tslibs/timedeltas.pyx", line 1167, in pandas._libs.tslibs.timedeltas.Timedelta.__new__
  File "pandas/_libs/tslibs/timedeltas.pyx", line 463, in pandas._libs.tslibs.timedeltas.parse_timedelta_string
ValueError: unit abbreviation w/o a number

If I change the pivot_table call by swapping index and columns (so index="priority", columns="age"), it no longer throws.

@jorisvandenbossche
Copy link
Member

@optilude can you try to give a small reproducible example? (eg by creating a dummy dataframe with the specific dtypes you have that shows the problem)

@optilude
Copy link

optilude commented Feb 7, 2019

Sure:

import pandas as pd
df = pd.DataFrame({
    'key': ['A', 'B', 'C'],
    'age': [pd.Timedelta('1 days'), pd.Timedelta('2 days'), pd.Timedelta('2 days')],
    'priority':  ['Low', 'Low', 'High']})
df.pivot_table(index='age', columns='priority', values='key', aggfunc='count')

Throws:

ValueError: unit abbreviation w/o a number

@jorisvandenbossche
Copy link
Member

OK, thanks. I can confirm the regression. It looks similar in underlying reason as #25087, another reported regression.

@xwang777
Copy link
Author

This is still broken in 0.24.2, with the same stack trace.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants