Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pandas cut() not working as expected in version 0.20.3 #17047

Closed
HugoDLopes opened this issue Jul 21, 2017 · 6 comments
Closed

pandas cut() not working as expected in version 0.20.3 #17047

HugoDLopes opened this issue Jul 21, 2017 · 6 comments
Labels
Duplicate Report Duplicate issue or pull request Interval Interval data type Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@HugoDLopes
Copy link

HugoDLopes commented Jul 21, 2017

Code Sample

>>> import pandas as pd
>>> 
>>> # Toy data
>>> series = pd.Series([0, 1, 5, 6, 10, None, np.nan])
>>> bins = [0, 5, 8]
>>> 
>>> pd.cut(series, bins=bins, include_lowest=True)

Out[23]:
0    (-0.001, 5.0]
1    (-0.001, 5.0]
2    (-0.001, 5.0]
3       (5.0, 8.0]
4              NaN
5              NaN
6              NaN
dtype: category
Categories (2, interval[float64]): [(-0.001, 5.0] < (5.0, 8.0]]

Problem description

It was expected that the lower boundary would be [0, 5] (for the first three instances). However, a value of -0.001 was "invented" (?). It worked in version 0.19.2 (with the output shown in Expected Output). Is there any bug? Is there anything I have to specify to make it work again as before?

Expected Output

0    [0, 5.0]
1    [0, 5.0]
2    [0, 5.0]
3    (5.0, 8.0]
4              NaN
5              NaN
6              NaN
dtype: category
Categories (2, interval[float64]): [ [0, 5.0] < (5.0, 8.0]]

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Linux OS-release: 4.4.0-79-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.20.3
pytest: None
pip: 9.0.1
setuptools: 36.2.0
Cython: None
numpy: 1.13.1
scipy: 0.19.1
xarray: None
IPython: 5.1.0
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.0
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.8
lxml: None
bs4: None
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

@gfyoung gfyoung added Interval Interval data type Regression Functionality that used to work in a prior pandas version labels Jul 21, 2017
@gfyoung
Copy link
Member

gfyoung commented Jul 21, 2017

@HugoDLopes : For reference, could you share with us the output you get when include_lowest=False ?

@gfyoung
Copy link
Member

gfyoung commented Jul 21, 2017

Also, I agree that invented numbers that disrespect the include_lowest=True parameter is a regression. PR to patch this is welcome!

@gfyoung gfyoung added this to the 0.21.0 milestone Jul 21, 2017
@gfyoung
Copy link
Member

gfyoung commented Jul 21, 2017

Good news: there aren't too many commits to this file since its inception (or refactoring by @jreback ).

Your culprit appears to be #16466.

cc @economy

@HugoDLopes
Copy link
Author

HugoDLopes commented Jul 21, 2017

@gfyoung Output with include_lowest=False:

0       NaN
1    (0, 5]
2    (0, 5]
3    (5, 8]
4       NaN
5       NaN
6       NaN
dtype: category
Categories (2, interval[int64]): [(0, 5] < (5, 8]]

@gfyoung
Copy link
Member

gfyoung commented Jul 21, 2017

Okay, so at least these results make sense. If you would like to investigate what happened with #16466 and see what we need to change to patch the regression, go for it!

@jreback
Copy link
Contributor

jreback commented Jul 21, 2017

duplicate of this: #16276

pls read the discussion and comment if you would like. This was a purposeful change.

@jreback jreback closed this as completed Jul 21, 2017
@jreback jreback added the Duplicate Report Duplicate issue or pull request label Jul 21, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Duplicate Report Duplicate issue or pull request Interval Interval data type Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

No branches or pull requests

3 participants