Imprecise intervals/labels by pd.cut #16276

has2k1 · 2017-05-07T07:05:27Z

Code Sample

import pandas as pd

x = [0, 1, 2, 3, 4]
res = pd.cut(x, (0, 2, 4), include_lowest=True)
print(res, '\n')

for val in res:
    print(repr(val))

Output

[(-0.001, 2.0], (-0.001, 2.0], (-0.001, 2.0], (2.0, 4.0], (2.0, 4.0]]
Categories (2, interval[float64]): [(-0.001, 2.0] < (2.0, 4.0]] 

Interval(-0.001, 2.0, closed='right')
Interval(-0.001, 2.0, closed='right')
Interval(-0.001, 2.0, closed='right')
Interval(2.0, 4.0, closed='right')
Interval(2.0, 4.0, closed='right')

Problem description

The correct cutting behaviour but the labels are awkward (given the input), they leak the internal adjustments used to do the cutting. The expected output has precise intervals/labels. This is a regression from the previous version 0.19.2.

Expected Output

[[0, 2], [0, 2], [0, 2], (2, 4], (2, 4]]
Categories (2, interval[int64]): [[0, 2] < (2, 4]]

Interval(0, 2, closed='both')
Interval(0, 2, closed='both')
Interval(0, 2, closed='both')
Interval(2, 4, closed='right')
Interval(2, 4, closed='right')

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------

commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.6-gentoo-r1
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.utf8
LANG: en_US.utf8
LOCALE: en_US.UTF-8

pandas: 0.20.1
pytest: 3.0.7
pip: 9.0.1
setuptools: 35.0.2
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 6.0.0
sphinx: 1.5.5
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: 2.6.1
feather: None
matplotlib: 2.0.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.5.3
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

has2k1 · 2017-05-07T12:23:48Z

I have seen PR #15309, looked at the code for pd.cut and IntervalIndex, it is quiet thorny and it looks like the change in API for pd.cut is a by product of the difficult implementation. I have also discovered this:

pd.Categorical([pd.Interval(0, 2, 'both'), pd.Interval(2, 4, 'right')])
# ValueError: intervals must all be closed on the same side

So, the expected input posted above does not indicate an unknown issue. Also, the meaning of the right parameter has changed from this:

Indicates whether the bins include the *rightmost* edge or not.

to this:

Indicates whether the bins include the *right* edge or not.

So it seems like to get back the old (and still documentated) meaning, the vectors of the IntervalIndex type must be allowed to have intervals that are not all closed on the same side. This change need not be chaotic, the restriction could be narrowed to "internal invervals must all be closed on the same side".

Would this change come with any disastrous consequences?

jreback · 2017-05-07T13:20:32Z

In the prior implementation it was allowed to have mixed intervals, IOW, ones that didn't conform in a uniform way. This is not allowed directly from .cut, though you could construct it manually I suppose. It is completely non-performant.

In [5]: pd.cut(range(5), (0, 2, 4), include_lowest=True)
Out[5]: 
[(-0.001, 2.0], (-0.001, 2.0], (-0.001, 2.0], (2.0, 4.0], (2.0, 4.0]]
Categories (2, interval[float64]): [(-0.001, 2.0] < (2.0, 4.0]]

In [6]: pd.cut(range(5), (0, 2, 4))
Out[6]: 
[NaN, (0, 2], (0, 2], (2, 4], (2, 4]]
Categories (2, interval[int64]): [(0, 2] < (2, 4]]

This is not a regression but eliminates a non-tested / non-documented case, which IMHO is pretty useless.

It is theoretically possible to support it in an index, but you would need a clear cut usecase.

jreback · 2017-05-07T13:29:07Z

If you really really needed to replicate this, you can do this.

In [4]: pd.cut(range(5), (0, 2, 4), include_lowest=True, labels=['[0, 2]', '(2, 4)'])
Out[4]: 
[[0, 2], [0, 2], [0, 2], (2, 4), (2, 4)]
Categories (2, object): [(2, 4) < [0, 2]]

This is supported as much as scalar labels are, meaning you can directly index them, but this is NOT an IntervalIndex.

has2k1 · 2017-05-08T04:47:59Z

Here is a minimal usecase

import pandas as pd
from plotnine import *

df = pd.DataFrame({'x': range(5),
                   'y': range(5),
                   })

p = (ggplot(df, aes('x', 'y'))
     + geom_point()
     + facet_wrap('pd.cut(x, 2, include_lowest=True)')
     + lims(x=(0, 4))
   )

print(p)

The labels to the panels are not precise, so you would have to reexamine the data limits to be certain of what data is at the edges. Second, integers that can be otherwise cut along integer lines have floating-point artefacts. These two issues add up when the code that generates the plots is a script not amendable to changes specific to the data currently being plotted.

Another usecase -- planned for the plotting system -- is a tree faceting feature. In this case without automatic precise intervals/labels I envision a more gnarly tree generating algorithm.

shoyer · 2017-05-08T05:23:56Z

I have seen PR #15309, looked at the code for pd.cut and IntervalIndex, it is quiet thorny and it looks like the change in API for pd.cut is a by product of the difficult implementation.

Yes, I think this summarizes things correctly.

I agree that exposing a floating point lower bound is non-ideal. This was the simplest thing to do for IntervalIndex, but we should have discussed it more before making this change in the 0.20 release.

I think IntervalIndex could indeed be adjusted to accommodate an external interval that is closed on both sides -- and this might be the nicest solution, especially if the more complex indexing behavior is not really needed -- but it would indeed take some work to make this happen.

gfyoung · 2017-07-22T01:29:41Z

Closing since this was just a usage question.

jreback added Interval Interval data type Usage Question labels May 7, 2017

jreback added this to the No action milestone May 7, 2017

jreback mentioned this issue Jul 21, 2017

pandas cut() not working as expected in version 0.20.3 #17047

Closed

gfyoung closed this as completed Jul 22, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Imprecise intervals/labels by pd.cut #16276

Imprecise intervals/labels by pd.cut #16276

has2k1 commented May 7, 2017

has2k1 commented May 7, 2017

jreback commented May 7, 2017

jreback commented May 7, 2017

has2k1 commented May 8, 2017

shoyer commented May 8, 2017

gfyoung commented Jul 22, 2017

Imprecise intervals/labels by pd.cut #16276

Imprecise intervals/labels by pd.cut #16276

Comments

has2k1 commented May 7, 2017

Code Sample

Output

Problem description

Expected Output

Output of pd.show_versions()

has2k1 commented May 7, 2017

jreback commented May 7, 2017

jreback commented May 7, 2017

has2k1 commented May 8, 2017

shoyer commented May 8, 2017

gfyoung commented Jul 22, 2017

Output of `pd.show_versions()`