Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Imprecise intervals/labels by pd.cut #16276

Closed
has2k1 opened this issue May 7, 2017 · 6 comments
Closed

Imprecise intervals/labels by pd.cut #16276

has2k1 opened this issue May 7, 2017 · 6 comments
Labels
Interval Interval data type Usage Question

Comments

@has2k1
Copy link
Contributor

has2k1 commented May 7, 2017

Code Sample

import pandas as pd

x = [0, 1, 2, 3, 4]
res = pd.cut(x, (0, 2, 4), include_lowest=True)
print(res, '\n')

for val in res:
    print(repr(val))

Output

[(-0.001, 2.0], (-0.001, 2.0], (-0.001, 2.0], (2.0, 4.0], (2.0, 4.0]]
Categories (2, interval[float64]): [(-0.001, 2.0] < (2.0, 4.0]] 

Interval(-0.001, 2.0, closed='right')
Interval(-0.001, 2.0, closed='right')
Interval(-0.001, 2.0, closed='right')
Interval(2.0, 4.0, closed='right')
Interval(2.0, 4.0, closed='right')

Problem description

The correct cutting behaviour but the labels are awkward (given the input), they leak the internal adjustments used to do the cutting. The expected output has precise intervals/labels. This is a regression from the previous version 0.19.2.

Expected Output

[[0, 2], [0, 2], [0, 2], (2, 4], (2, 4]]
Categories (2, interval[int64]): [[0, 2] < (2, 4]]

Interval(0, 2, closed='both')
Interval(0, 2, closed='both')
Interval(0, 2, closed='both')
Interval(2, 4, closed='right')
Interval(2, 4, closed='right')

Output of pd.show_versions()

INSTALLED VERSIONS ------------------

commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.6-gentoo-r1
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.utf8
LANG: en_US.utf8
LOCALE: en_US.UTF-8

pandas: 0.20.1
pytest: 3.0.7
pip: 9.0.1
setuptools: 35.0.2
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 6.0.0
sphinx: 1.5.5
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: 2.6.1
feather: None
matplotlib: 2.0.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.5.3
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

@has2k1
Copy link
Contributor Author

has2k1 commented May 7, 2017

I have seen PR #15309, looked at the code for pd.cut and IntervalIndex, it is quiet thorny and it looks like the change in API for pd.cut is a by product of the difficult implementation. I have also discovered this:

pd.Categorical([pd.Interval(0, 2, 'both'), pd.Interval(2, 4, 'right')])
# ValueError: intervals must all be closed on the same side

So, the expected input posted above does not indicate an unknown issue. Also, the meaning of the right parameter has changed from this:

Indicates whether the bins include the *rightmost* edge or not.

to this:

Indicates whether the bins include the *right* edge or not.

So it seems like to get back the old (and still documentated) meaning, the vectors of the IntervalIndex type must be allowed to have intervals that are not all closed on the same side. This change need not be chaotic, the restriction could be narrowed to "internal invervals must all be closed on the same side".

Would this change come with any disastrous consequences?

@jreback
Copy link
Contributor

jreback commented May 7, 2017

In the prior implementation it was allowed to have mixed intervals, IOW, ones that didn't conform in a uniform way. This is not allowed directly from .cut, though you could construct it manually I suppose. It is completely non-performant.

In [5]: pd.cut(range(5), (0, 2, 4), include_lowest=True)
Out[5]: 
[(-0.001, 2.0], (-0.001, 2.0], (-0.001, 2.0], (2.0, 4.0], (2.0, 4.0]]
Categories (2, interval[float64]): [(-0.001, 2.0] < (2.0, 4.0]]

In [6]: pd.cut(range(5), (0, 2, 4))
Out[6]: 
[NaN, (0, 2], (0, 2], (2, 4], (2, 4]]
Categories (2, interval[int64]): [(0, 2] < (2, 4]]

This is not a regression but eliminates a non-tested / non-documented case, which IMHO is pretty useless.

It is theoretically possible to support it in an index, but you would need a clear cut usecase.

@jreback jreback added Interval Interval data type Usage Question labels May 7, 2017
@jreback
Copy link
Contributor

jreback commented May 7, 2017

If you really really needed to replicate this, you can do this.

In [4]: pd.cut(range(5), (0, 2, 4), include_lowest=True, labels=['[0, 2]', '(2, 4)'])
Out[4]: 
[[0, 2], [0, 2], [0, 2], (2, 4), (2, 4)]
Categories (2, object): [(2, 4) < [0, 2]]

This is supported as much as scalar labels are, meaning you can directly index them, but this is NOT an IntervalIndex.

@jreback jreback added this to the No action milestone May 7, 2017
@has2k1
Copy link
Contributor Author

has2k1 commented May 8, 2017

Here is a minimal usecase

import pandas as pd
from plotnine import *

df = pd.DataFrame({'x': range(5),
                   'y': range(5),
                   })

p = (ggplot(df, aes('x', 'y'))
     + geom_point()
     + facet_wrap('pd.cut(x, 2, include_lowest=True)')
     + lims(x=(0, 4))
   )

print(p)

pdcut-facetwrap-pandas-0 20 0

The labels to the panels are not precise, so you would have to reexamine the data limits to be certain of what data is at the edges. Second, integers that can be otherwise cut along integer lines have floating-point artefacts. These two issues add up when the code that generates the plots is a script not amendable to changes specific to the data currently being plotted.

Another usecase -- planned for the plotting system -- is a tree faceting feature. In this case without automatic precise intervals/labels I envision a more gnarly tree generating algorithm.

@shoyer
Copy link
Member

shoyer commented May 8, 2017

I have seen PR #15309, looked at the code for pd.cut and IntervalIndex, it is quiet thorny and it looks like the change in API for pd.cut is a by product of the difficult implementation.

Yes, I think this summarizes things correctly.

I agree that exposing a floating point lower bound is non-ideal. This was the simplest thing to do for IntervalIndex, but we should have discussed it more before making this change in the 0.20 release.

I think IntervalIndex could indeed be adjusted to accommodate an external interval that is closed on both sides -- and this might be the nicest solution, especially if the more complex indexing behavior is not really needed -- but it would indeed take some work to make this happen.

@gfyoung
Copy link
Member

gfyoung commented Jul 22, 2017

Closing since this was just a usage question.

@gfyoung gfyoung closed this as completed Jul 22, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Interval Interval data type Usage Question
Projects
None yet
Development

No branches or pull requests

4 participants