Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: cut() precision at the left end does not appear as specified (3 digits by default) #33912

Open
2 of 3 tasks
nnworkspace opened this issue May 1, 2020 · 10 comments
Open
2 of 3 tasks
Assignees
Labels
Bug cut cut, qcut

Comments

@nnworkspace
Copy link

nnworkspace commented May 1, 2020

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

...

bins = 10

df_bins = pd.cut(fatalities_asc.fatality_rate, bins, include_lowest=True)

fatalities_asc['fatality_bin'] = df_bins.values

df_counts = fatalities_asc.groupby('fatality_bin', as_index = False).count()

df_counts

You can find the complete code here:
https://github.com/nnworkspace/covid19-insight/blob/master/covid19-insight.ipynb

Problem description

I expected all bounds of the bins appear to be a rounded float with 3 digits of precision. But the output of above code (please pay attention to the first interval, the left bound):

fatality_bin	Confirmed	Deaths	Recovered	Active	fatality_rate

0 (0.057999999999999996, 1.632] 39 39 39 39 39
1 (1.632, 3.19] 33 33 33 33 33
2 (3.19, 4.748] 27 27 27 27 27
3 (4.748, 6.305] 18 18 18 18 18
4 (6.305, 7.863] 12 12 12 12 12
5 (7.863, 9.421] 2 2 2 2 2
6 (9.421, 10.978] 3 3 3 3 3
7 (10.978, 12.536] 6 6 6 6 6
8 (12.536, 14.094] 2 2 2 2 2
9 (14.094, 15.652] 3 3 3 3 3

When the lower and upper bounds used as labels of a plot, it looks like this (pay attention to the first x-tick label )

image

Expected Output

(0.058, 1.632] should be the lower bound of the first interval, not an infinite number.

Output of pd.show_versions()

pandas version: 1.0.3

@nnworkspace nnworkspace added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 1, 2020
@nnworkspace
Copy link
Author

I've just committed a workaround for this problem. But my workaround involves manual setup of the labels, this is not ideal. Hope this can be fixed in the cut() method. Bin labels after my workaround:

image

@mroeschke
Copy link
Member

Could you post a minimally reproducible example? Ideally not depending on graphics https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

@mroeschke mroeschke added Needs Info Clarification about behavior needed to assess issue and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 2, 2020
@nnworkspace
Copy link
Author

ok I will do it, but need some time, since at the moment some deadlines to meet.....

@MarcoGorelli
Copy link
Member

closing as I can't reproduce this - please do let us know if you can provide a reproducible example though and I'll reopen

>>> pd.cut(pd.Series(np.random.randn(100)), bins=10, include_lowest=True)
0     (-0.0938, 0.392]
1      (-1.066, -0.58]
2       (1.851, 2.337]
3      (-1.066, -0.58]
4     (-0.58, -0.0938]
            ...       
95    (-1.553, -1.066]
96    (-0.58, -0.0938]
97    (-0.0938, 0.392]
98    (-0.58, -0.0938]
99      (0.879, 1.365]
Length: 100, dtype: category
Categories (10, interval[float64]): [(-2.045, -1.553] < (-1.553, -1.066] < (-1.066, -0.58] <
                                     (-0.58, -0.0938] ... (0.879, 1.365] < (1.365, 1.851] <
                                     (1.851, 2.337] < (2.337, 2.824]]

@zareami10
Copy link

@MarcoGorelli I can reproduce this too, one of the many issues with cut I faced in only a single day.

Code to reproduce:

import numpy as np
import pandas as pd


mydata = pd.Series([2.58,4.79,5.50,6.75,2.65,6.60,11.25,3.78,4.90,5.21])

histogram = np.histogram_bin_edges(mydata, bins="sturges", range=(1.195,12.875))

output  = pd.cut(mydata, bins=histogram, right=True, include_lowest=True, precision=3)

print(output.value_counts(sort=False))
(1.1940000000000002, 3.142]    2
(3.142, 5.088]                 3
(5.088, 7.035]                 4
(7.035, 8.982]                 0
(8.982, 10.928]                0
(10.928, 12.875]               1
dtype: int64

Two important points are right=True and include_lowest=True. Not to mention that include_lowest doesn't actually change '(' to '[' but merely decrease the lower bound, but that's another story I guess.

@MarcoGorelli MarcoGorelli reopened this Nov 25, 2020
@MarcoGorelli
Copy link
Member

Thanks @zareami10 , that reproduces the issue - much appreciated!

Not to mention that include_lowest doesn't actually change '(' to '[' but merely decrease the lower bound, but that's another story I guess.

This sounds like a separate issue - could you open a new one with this one please? A reproducible example (like the one you posted here) would help to expedite resolution

@MarcoGorelli MarcoGorelli added cut cut, qcut and removed Needs Info Clarification about behavior needed to assess issue labels Nov 25, 2020
@zareami10
Copy link

zareami10 commented Nov 25, 2020

This sounds like a separate issue - could you open a new one with this one please? A reproducible example (like the one you posted here) would help to expedite resolution

I believe there is a (2 years old) report for that, see issue #23164. Though I'm not sure how that could be improved without changing IntervalIndex as it it only accepts intervals which are closed on the same side (I'm quite new to pandas though, so might be missing something).

Even in the code I provided I would expect an output of [1.195, 3.142] rather than (1.194, 3.142], even if we ignore the current issue.

(Sorry for discussing it here, just thought I would explain a bit as I'm pointing to the issue)

@simonjayhawkins
Copy link
Member

Two important points are right=True and include_lowest=True.

The bug appears to be in the rounding and adjust logic in _format_labels in pandas/core/reshape/tile.py

breaks = [formatter(b) for b in bins]
if right and include_lowest:
# adjust lhs of first interval by precision to account for being right closed
breaks[0] = adjust(breaks[0])

@simonjayhawkins simonjayhawkins added this to the Contributions Welcome milestone May 29, 2022
@carbonleakage
Copy link
Contributor

carbonleakage commented Jun 22, 2022

After a bit of analysis, I think the fix maybe simple. The formatter function is used to format the breaks values in line 582.

breaks = [formatter(b) for b in bins]

However in line 585 the formatter is not applied after adjusting the starting value. So the fix would be just to apply formatter to line 585. So line 585 changes from

breaks[0] = adjust(breaks[0])

to this breaks[0] = formatter(adjust(breaks[0])).

@simonjayhawkins What do you think, shall I submit a PR?

@carbonleakage
Copy link
Contributor

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug cut cut, qcut
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants