Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add option to make final interval closed for right-open intervals in pd.cut #42212

Closed
jotasi opened this issue Jun 24, 2021 · 4 comments
Closed
Labels
Closing Candidate May be closeable, needs more eyeballs cut cut, qcut Enhancement

Comments

@jotasi
Copy link
Contributor

jotasi commented Jun 24, 2021

Is your feature request related to a problem?

I would like to use pd.cut to sort values into bins that are half-open with the lower boundary being the closed end of the interval (i.e. [0, 5), so setting right=False) but still be able to include the upper bound of the last interval (i.e. have the last interval be closed, something like include_highest=True analogous to include_lowest=True for right=True). I encountered this also for infinite boundaries, where adding a small number is not an option (although as a workaround, one can fillna the result as the only remaining nas are those of the infinite right boundary).

I.e. while I can make the first interval closed for pd.cut:

In [1]: import pandas as pd

In [2]: pd.cut(pd.Series([0, 1, 2, 3]), bins=[0, 1, 2, 3], include_lowest=True, retbins=False)
Out[2]:
0    (-0.001, 1.0]
1    (-0.001, 1.0]
2       (1.0, 2.0]
3       (2.0, 3.0]
dtype: category
Categories (3, interval[float64, right]): [(-0.001, 1.0] < (1.0, 2.0] < (2.0, 3.0]]

I can't do the same for right=False where include_lowest=True seems functionless:

In [3]: pd.cut(pd.Series([0, 1, 2, 3]), bins=[0, 1, 2, 3], right=False, include_lowest=True, retbins=False)
Out[3]:
0    [0.0, 1.0)
1    [1.0, 2.0)
2    [2.0, 3.0)
3           NaN
dtype: category
Categories (3, interval[int64, left]): [[0, 1) < [1, 2) < [2, 3)]

(I would want the last value 3 to be in the final bin [2, 3).)

Describe the solution you'd like

Either, include_lowest could be changed to final_interval_closed or similar to work as include_lowest for right=True and include_highest for right=False (which would break API, see below). This would make the function work somewhat symmetrically for right=True and right=False. Alternatively, such a parameter could be added additionally, which would make include_lowest more or less obsolete though, as far as I can see. Or to make the API more symmetric one could add another parameter include_highest, which does nothing for right=True but makes the last interval closed on both ends for right=False.

API breaking implications

Changing the parameter include_lowest to final_interval_closed or similar would break the API. The alternative solutions (adding either final_interval_closed or include_highest) would add an additional parameter to the function pd.cut (and if the former would be added, potentially include_lowest could be deprecated down the line).

Describe alternatives you've considered

See three alternatives under Describe the solution you'd like

@jotasi jotasi added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 24, 2021
@simonjayhawkins
Copy link
Member

Thanks @jotasi for the report. This looks similar to the discussion in #23164?

@simonjayhawkins simonjayhawkins added Closing Candidate May be closeable, needs more eyeballs cut cut, qcut and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 25, 2021
@jotasi
Copy link
Contributor Author

jotasi commented Jun 26, 2021

I think it is somewhat related but the discussion seems to be about an issue for the case of right=True (default). I'm arguing for a similar option to include_lowest for right=False, which (AFAICS) doesn't exist. So basically, for left-open intervals, you can (with the caveats discussed in #23164) make the outer-most open end closed(-ish) by specifying include_lowest=True and I propose to extend this to also allow the same for right-open intervals.

Nonetheless, depending on the solution to #23164 (changing the docs vs. extending IntervalIndex to actually support a single closed interval), it might be a good idea to fix both together.

@attack68
Copy link
Contributor

linking #40245 for relevance,

@mroeschke
Copy link
Member

I think we can loop in the right=True|False for include_lowest in the same issue #23164 so closing if favor of continuing discussion there

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Closing Candidate May be closeable, needs more eyeballs cut cut, qcut Enhancement
Projects
None yet
Development

No branches or pull requests

4 participants