Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Groupby Array-Type Quantiles Broken in 0.25.0 #27526

Closed
sernst opened this issue Jul 22, 2019 · 11 comments · Fixed by #27827
Closed

Groupby Array-Type Quantiles Broken in 0.25.0 #27526

sernst opened this issue Jul 22, 2019 · 11 comments · Fixed by #27827
Labels
Numeric Operations Arithmetic, Comparison, and Logical operations Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@sernst
Copy link

sernst commented Jul 22, 2019

Code Sample

import pandas as pd

df = pd.DataFrame({
    'category': ['A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B'],
    'value': [1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6]
})
quantiles = df.groupby('category').quantile([0.25, 0.5, 0.75])
print(quantiles)

Problem description

In previous versions of Pandas < 0.25.0 and in the documentation it is possible to pass an array-type of quantiles into the DataFrameGroupBy.quantile() method to return multiple quantile values in a single call. However, upon installation of 0.25.0 the following error results instead:

Traceback (most recent call last):
  File "example.py", line 8, in <module>
    quantiles = df.groupby('category').quantile([0.25, 0.5, 0.75])
  File "/usr/local/lib/python3.7/site-packages/pandas/core/groupby/groupby.py", line 1908, in quantile
    interpolation=interpolation,
  File "/usr/local/lib/python3.7/site-packages/pandas/core/groupby/groupby.py", line 2248, in _get_cythonized_result
    func(**kwargs)  # Call func to modify indexer values in place
  File "pandas/_libs/groupby.pyx", line 69

Expected Output

Using Pandas 0.24.2 the output is:

               value
category
A        0.25   2.25
         0.50   3.50
         0.75   4.75
B        0.25   2.25
         0.50   3.50
         0.75   4.75

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.4.final.0
python-bits : 64
OS : Linux
OS-release : 4.9.125-linuxkit
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 0.25.0
numpy : 1.16.4
pytz : 2019.1
dateutil : 2.8.0
pip : 19.1.1
setuptools : 41.0.1
Cython : None
pytest : 5.0.1
hypothesis : None
sphinx : 2.1.2
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.1
IPython : None
pandas_datareader: None
bs4 : 4.8.0
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.1
numexpr : 2.6.9
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : 0.3.0
scipy : 1.3.0
sqlalchemy : None
tables : 3.5.2
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None

@jreback jreback added Numeric Operations Arithmetic, Comparison, and Logical operations Regression Functionality that used to work in a prior pandas version labels Jul 23, 2019
@jreback jreback added this to the 0.25.1 milestone Jul 23, 2019
@ar4hc
Copy link

ar4hc commented Aug 7, 2019

I got this error message when using a numpy array (form np.linspace()):

TypeError: only size-1 arrays can be converted to Python scalars

Downgrade to pandas 0.24 solves this.

my test code (snipplet):

    percs = (np.linspace(0, 1, num=intervals + 1).round(decimals=3))
    d = df[['x', 'y']]
    g = d.groupby('x')
    quants = g.quantile(percs)

breaks in last line with 0.25, works in 0.24

@jreback
Copy link
Contributor

jreback commented Aug 7, 2019

there is a PR #27473 which solves this and just needs some touching up to fix

@ghost
Copy link

ghost commented Aug 7, 2019

That PR was about #20405 not validating inputs. This issue is about #20405 deleting functionality so different bugs.

@TomAugspurger
Copy link
Contributor

Is the fix to change

return self._get_cythonized_result(
"group_quantile",
self.grouper,
aggregate=True,
needs_values=True,
needs_mask=True,
cython_dtype=np.float64,
pre_processing=pre_processor,
post_processing=post_processor,
q=q,
interpolation=interpolation,
)
to be called once per value in q, when a list of quintiles is provide? Then concat the results together with concat(results, axis=1, keys=q)?

@TomAugspurger
Copy link
Contributor

The output of DataFrameGroupBy.quantile is a DataFrame whose

  • index is the group keys
  • columns are the (numeric) columns
In [68]: df = pd.DataFrame({"A": [0, 1, 2, 3, 4]})

In [69]: df.groupby([0, 0, 1, 1, 1]).quantile(0.25)
Out[69]:
      A
0  0.25

What's the expected output of .quantile(List[float])?

It's not the most useful, but I think the best option is a MultiIndex in the columns.

In [70]: a = df.iloc[:2].quantile([0.25]).unstack()

In [71]: b = df.iloc[2:].quantile([0.25]).unstack()

In [72]: pd.concat([a, b], keys=[0, 1]).unstack([1, 2])
Out[72]:
      A
   0.25
0  0.25
1  2.50

The other option is to have the qs in the index, but that breaks my mental model that the index should be the unique group keys.

@TomAugspurger
Copy link
Contributor

Oh, whoops, I missed the 0.24 output. We'll match that.

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Aug 8, 2019
@dragoljub
Copy link

Thanks for the fix. I just ran into this!

TomAugspurger added a commit that referenced this issue Aug 22, 2019
* BUG: Fixed groupby quantile for listlike q

Closes #27526
@ar4hc
Copy link

ar4hc commented Aug 23, 2019

not sure if this is the right place,but with 0.25.1 and my code from above i now get a differnet error, but still an error:

    quants = g.quantile(percs)
  File "/usr/local/miniconda3/envs/scipy37/lib/python3.7/site-packages/pandas/core/groupby/groupby.py", line 1953, in quantile
    return result.take(indices)
  File "/usr/local/miniconda3/envs/scipy37/lib/python3.7/site-packages/pandas/core/generic.py", line 3604, in take
    indices, axis=self._get_block_manager_axis(axis), verify=True
  File "/usr/local/miniconda3/envs/scipy37/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 1389, in take
    indexer = maybe_convert_indices(indexer, n)
  File "/usr/local/miniconda3/envs/scipy37/lib/python3.7/site-packages/pandas/core/indexers.py", line 201, in maybe_convert_indices
    raise IndexError("indices are out-of-bounds")
IndexError: indices are out-of-bounds

the dataframe groupyby objects is looking good, the percs ars a list of floats, including 0 and 1, and the data is the same as it was for the for 0.24.2 version.

what am i missing....?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Aug 23, 2019 via email

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Aug 23, 2019 via email

galuhsahid pushed a commit to galuhsahid/pandas that referenced this issue Aug 25, 2019
* BUG: Fixed groupby quantile for listlike q

Closes pandas-dev#27526
@ar4hc
Copy link

ar4hc commented Aug 30, 2019

sry, just reverted back to 0.24 and went to fix the other issues i have... :-\
and didn't watch here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Numeric Operations Arithmetic, Comparison, and Logical operations Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants