PERF: `nunique` is slower than `unique.apply(len)` on a `groupby` #55972

arnaudlegout · 2023-11-15T15:18:34Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np
unique_values = np.arange(30, dtype=np.uint32)
data = np.random.choice(unique_values, size=1_000_000)
s = pd.Series(data)

%timeit s.groupby(s).nunique()
%timeit s.groupby(s).unique().apply(len)

Issue Description

I expect nunique() to be at least as fast as unique.apply(len), however, it is much slower in the reproducible example I provided (3x slower).
On my computer I have the following performance:
%timeit s.groupby(s).nunique()
103 ms ± 942 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit s.groupby(s).unique().apply(len)
31.1 ms ± 603 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Expected Behavior

nunique() is at least as fast as unique.apply(len),

Installed Versions

INSTALLED VERSIONS

commit : 0f43794
python : 3.11.5.final.0
python-bits : 64
OS : Linux
OS-release : 6.5.11-300.fc39.x86_64
Version : #1 SMP PREEMPT_DYNAMIC Wed Nov 8 22:37:57 UTC 2023
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.0.3
numpy : 1.24.3
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 68.0.0
pip : 23.3
Cython : 3.0.0
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.3
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.15.0
pandas_datareader: 0.10.0
bs4 : 4.12.2
bottleneck : 1.3.5
brotli : 1.0.9
fastparquet : None
fsspec : 2023.9.2
gcsfs : None
matplotlib : 3.8.0
numba : 0.57.1
numexpr : 2.8.7
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 11.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.11.3
snappy :
sqlalchemy : 2.0.21
tables : None
tabulate : 0.8.10
xarray : 2023.6.0
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : 2.2.0
pyqt5 : None

The text was updated successfully, but these errors were encountered:

rhshadrach · 2023-11-16T02:16:43Z

I do think there are performance gains to be had here, but this is only the case on a "small" (for an appropriate definition of small) number of groups:

unique_values = np.arange(30000, dtype=np.int64)
data = np.random.choice(unique_values, size=1_000_000)
s = pd.Series(data)

%timeit s.groupby(s).nunique()
%timeit s.groupby(s).unique().apply(len)

# 167 ms ± 1.68 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# 1.32 s ± 7.78 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The performance of .nunique is mostly dependent on the number of rows whereas the performance of .unique is largely based on the number of groups.

rhshadrach · 2023-11-16T02:24:44Z

Expanding a bit on this, the bottleneck of the current implementation for nunique is sorting the group ids and factorized codes. We can avoid sorting with a hashtable approach: populate a hashtable with (id, code) values, counting as they are added.

jorisvandenbossche · 2023-11-17T07:44:45Z

The original PR that added this custom nunique implementation (#10894) has a similar benchmark case with many groups that also still shows the benefit (essentially the same as the example of Richard):

In [25]: df = pd.DataFrame({'a': np.random.randint(10000, size=100000),
    ...:                    'b': np.random.randint(10, size=100000)})

In [26]: %timeit df.groupby('a')['b'].nunique()
13.1 ms ± 94.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [27]: %timeit df.groupby('a')['b'].unique().apply(len)
329 ms ± 4.18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

arnaudlegout added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 15, 2023

arnaudlegout mentioned this issue Nov 15, 2023

ENH: Make performance warnings opt-in and more noisy #55385

Open

jorisvandenbossche changed the title ~~BUG: nunique is slower than unique.apply(len) on a groupby~~ PERF: nunique is slower than unique.apply(len) on a groupby Nov 17, 2023

jorisvandenbossche added Groupby Performance Memory or execution speed performance and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 17, 2023

rhshadrach mentioned this issue Nov 19, 2023

PERF: groupby.nunique #56061

Merged

5 tasks

mroeschke closed this as completed in #56061 Nov 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: `nunique` is slower than `unique.apply(len)` on a `groupby` #55972

PERF: `nunique` is slower than `unique.apply(len)` on a `groupby` #55972

arnaudlegout commented Nov 15, 2023

INSTALLED VERSIONS

rhshadrach commented Nov 16, 2023 •

edited

Loading

rhshadrach commented Nov 16, 2023 •

edited

Loading

jorisvandenbossche commented Nov 17, 2023

PERF: nunique is slower than unique.apply(len) on a groupby #55972

PERF: nunique is slower than unique.apply(len) on a groupby #55972

Comments

arnaudlegout commented Nov 15, 2023

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

rhshadrach commented Nov 16, 2023 • edited Loading

rhshadrach commented Nov 16, 2023 • edited Loading

jorisvandenbossche commented Nov 17, 2023

PERF: `nunique` is slower than `unique.apply(len)` on a `groupby` #55972

PERF: `nunique` is slower than `unique.apply(len)` on a `groupby` #55972

rhshadrach commented Nov 16, 2023 •

edited

Loading

rhshadrach commented Nov 16, 2023 •

edited

Loading