Extreme performance difference between int and float factorization #28303

ghost · 2019-09-05T18:43:23Z

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np
from time import time


df = pd.date_range(start="1/1/2018", end="1/2/2018", periods=1e6).to_frame()

start = time()
dfr = df.resample("1s").last()
print(time() - start)
print("Length:", len(dfr))
print()

group_index = np.round(df.index.astype(int) / 1e9)
start = time()
dfr = df.groupby(group_index).last()
print(time() - start)
print("Length:", len(dfr))

Problem description

In my current project, I use groupby as well as resample for the same data frames with the same aggregations. I have noticed that resample is way quicker than groupby. While I understand that groupby is more flexible, it would still be nice if the performance was comparable. In the example above, resample is more than 50 times faster:

Length: 86401
0.023558616638183594

Length: 86401
1.264981746673584

I am aware that they don't result in the exact same data frames, but this does not matter for this discussion.

Expected Output

Better performance for groupby.

I haven't looked at the groupby implementation and therefore I don't know if there is a good reason for the difference. If there is a good reason, some common cases could still be improved a lot. For example, in this case, we could just check first if the by-argument is monotonic increasing or decreasing. In this case, the operation can be implemented even without a hash map.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None
python : 3.7.3.final.0
python-bits : 64
OS : Linux
OS-release : 4.4.0-17134-Microsoft
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8

pandas : 0.25.1
numpy : 1.17.1
pytz : 2019.2
dateutil : 2.8.0
pip : 19.2.3
setuptools : 41.2.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.4.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.1
IPython : 7.8.0
pandas_datareader: 0.7.4
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.4.1
matplotlib : 3.1.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.14.1
pytables : None
s3fs : None
scipy : 1.3.1
sqlalchemy : None
tables : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None

The text was updated successfully, but these errors were encountered:

WillAyd · 2019-09-05T18:46:21Z

Can you profile to see where time is spent? This is a pretty open ended request so anything you can do to isolate what you think needs to be improve is helpful

ghost · 2019-09-05T20:11:59Z

If I am not mistaken, pretty much all time for groupby is spent in Float64HashTable.factorize. Are you familiar with this part of Pandas? Can you point me to the native code?

TomAugspurger · 2019-09-05T21:20:21Z

Not sure how useful it'll be, but that's

pandas/pandas/_libs/hashtable_class_helper.pxi.in

Line 490 in 2d65e38

def factorize(self, const {{dtype}}_t[:] values, Py_ssize_t na_sentinel=-1,

TomAugspurger · 2019-09-05T21:22:21Z

If you use ints instead of floats for your group_index, things are comparable.

    ...: group_index = np.round(df.index.astype(int) / 1e9).astype(int)  # added the astype(int) here
    ...: start = time()
    ...: dfr = df.groupby(group_index).last()
    ...: print(time() - start)
    ...: print("Length:", len(dfr))
    ...:
0.023672103881835938
Length: 86401

0.029394865036010742
Length: 86401

ghost · 2019-09-06T05:56:26Z

Very nice, thank you! This solves my particular issue. Do you think it's worth fixing in general?

TomAugspurger · 2019-09-06T16:09:22Z

What needs to be fixed? Hashing floats is always going to be slower than hashing ints.

…

On Fri, Sep 6, 2019 at 12:56 AM Joel Richard ***@***.***> wrote: Very nice, thank you! This solves my particular issue. Do you think it's worth fixing in general? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#28303?email_source=notifications&email_token=AAKAOISJTKRMVDGAECQN3WTQIHWJHA5CNFSM4IUBGJKKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6B2FNA#issuecomment-528720564>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAKAOIUOICJY7RT62QKNDUTQIHWJHANCNFSM4IUBGJKA> .

ghost · 2019-09-06T16:22:43Z

I don't understand why it should be significantly slower. I mean, why can't you just interpret the binary representation of the floats as an integer, call the integer factorize and then undo the mapping? I am not suggesting to implement it like this, but this implementation would only be slightly slower than the integer implementation and therefore I don't see why it has to be so slow.

WillAyd · 2019-09-06T16:24:44Z

If you have a more performant float factorization technique I think we'd take a PR for it

jschendel · 2019-09-10T00:15:53Z

I think some further investigation is needed here: it doesn't look like pd.factorize is slow in general for floats, but rather that there are certain float values that are slower to factorize than others.

In the example below, performance loss peaks for values around10**9 (this might not be the only scenario but rather just one I've found):

In [1]: import pandas as pd; import numpy as np

In [2]: # factorize int values for comparison 
   ...: a = np.arange(10**4, 2 * 10**4).repeat(100) 
   ...: %timeit pd.factorize(a)
6.18 ms ± 39 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [3]: # factorize float values on the order of 10**i 
   ...: a = np.arange(10**4, dtype='float64') 
   ...: for i in range(4, 16): 
   ...:     print(f'exponent: {i}') 
   ...:     a2 = (a + 10**i).repeat(100) 
   ...:     %timeit pd.factorize(a2) 
   ...: 
exponent: 4
7.11 ms ± 108 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
exponent: 5
6.77 ms ± 61.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
exponent: 6
6.72 ms ± 348 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
exponent: 7
16.6 ms ± 675 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
exponent: 8
78.6 ms ± 1.58 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
exponent: 9
649 ms ± 5.23 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
exponent: 10
344 ms ± 2.56 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
exponent: 11
74.4 ms ± 953 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
exponent: 12
15 ms ± 429 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
exponent: 13
8.71 ms ± 271 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
exponent: 14
7.46 ms ± 388 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
exponent: 15
7.63 ms ± 314 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

From the above, the performance loss seems worst for around 10**9 but it isn't universal for all values on the order of 10**9. The best I can tell from some ad hoc timings is that this performance loss occurs for int-like values that are close together:

In [6]: # rerunning the bad example for emphasis 
   ...: a2 = (a + 10**9).repeat(100) 
   ...: %timeit pd.factorize(a2)
633 ms ± 3.39 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [7]: # not int-like but still close together is fine
   ...: a2 = (a + 10**9 + np.random.random(10**4)).repeat(100) 
   ...: %timeit pd.factorize(a2)
6.69 ms ± 40 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [8]: # int-like but spaced out is fine 
   ...: a2 = np.linspace(10**9, 9*10**9, 10**4).round().repeat(100) 
   ...: %timeit pd.factorize(a2)
6.76 ms ± 238 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Finally, it might be worthwhile to note that it's not only pd.factorize that's slow, but pd.unique is slow as well in the scenarios above, which makes sense as they share a common codepath. For what it's worth, numpy's np.unique performance doesn't take a hit, which makes sense since it's a different implementation, but still good to check:

In [9]: # same speed hit for pd.unique with int-like (numpy is the same)
   ...: a2 = (a + 10**9).repeat(100) 
   ...: %timeit pd.unique(a2) 
   ...: %timeit np.unique(a2)
633 ms ± 4.29 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
23.3 ms ± 300 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [10]: # likewise no speed hit when not int-like (numpy is the same)
    ...: a2 = (a + 10**9 + np.random.random(10**4)).repeat(100) 
    ...: %timeit pd.unique(a2) 
    ...: %timeit np.unique(a2)
4.97 ms ± 173 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
23.3 ms ± 156 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

I haven't dove in far enough to actually figure out why this is happening, much less come up with a fix. Hopefully what I wrote is a good hint if someone else that wants to look into this.

chris-b1 · 2019-10-10T18:07:53Z

A little more investigative work - we actually do hash our floats as if they are integers (translation of the c code below) - and that happens to creates collisions in the 10**7 to 10**9 range which is the cause of the performance problem.

Someone more than experienced than me in bit shuffling might see an obvious solution here, but ultimately need a "better" hash function.

def float_hash(key):
    # define kh_float64_hash_func_0_NAN(key) (khint32_t)((asint64(key))>>33^(asint64(key))^(asint64(key))<<11)
    return (key.view('uint64')>>33^key.view('uint64')^key.view('uint64')<<11).astype('uint32')

a = np.arange(10**4, dtype='float64') 
for i in range(4, 16): 
    a2 = (a + 10**i)
    hashes = float_hash(a2)
    num_collisions = len(a2) - len(pd.unique(hashes)) 
    print(f'exponent: {i} num_collisions: {num_collisions}')

exponent: 4 num_collisions: 0
exponent: 5 num_collisions: 0
exponent: 6 num_collisions: 0
exponent: 7 num_collisions: 5000
exponent: 8 num_collisions: 4992
exponent: 9 num_collisions: 4608
exponent: 10 num_collisions: 0
exponent: 11 num_collisions: 0
exponent: 12 num_collisions: 0
exponent: 13 num_collisions: 0
exponent: 14 num_collisions: 0
exponent: 15 num_collisions: 0

WillAyd · 2019-10-10T18:29:34Z

@chris-b1 do you know what the astype to 32 bit is at the end of that function? If you remove that there are no collisions and I think would match khash perfectly without

chris-b1 · 2019-10-10T19:01:47Z

khash uses 32 bit hashes throughout, so there's no way to avoid that truncation without modifying the underlying hash table code. Presumably a memory cost to that, but haven't tried

WillAyd · 2020-01-13T19:38:58Z

Maybe relevant idea from upstream:

attractivechaos/klib#125 (comment)

WillAyd added the Needs Info Clarification about behavior needed to assess issue label Sep 5, 2019

WillAyd changed the title ~~Extreme performance difference between groupby and resample for comparable results~~ Extreme performance difference between int and float factorization Sep 6, 2019

WillAyd added Performance Memory or execution speed performance and removed Needs Info Clarification about behavior needed to assess issue labels Sep 6, 2019

WillAyd mentioned this issue Apr 4, 2020

PERF: hashtable for 16, 32, 128 bit dtypes #33287

Closed

WillAyd mentioned this issue Sep 25, 2020

PERF: Always using panda's hashtable approach, dropping np.in1d #36611

Merged

4 tasks

realead mentioned this issue Sep 29, 2020

PERF: using murmur hash for float64 khash-tables #36729

Merged

5 tasks

jreback added this to the 1.2 milestone Oct 2, 2020

jreback closed this as completed in #36729 Nov 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extreme performance difference between int and float factorization #28303

Extreme performance difference between int and float factorization #28303

ghost commented Sep 5, 2019

INSTALLED VERSIONS

WillAyd commented Sep 5, 2019

ghost commented Sep 5, 2019

TomAugspurger commented Sep 5, 2019

TomAugspurger commented Sep 5, 2019

ghost commented Sep 6, 2019

TomAugspurger commented Sep 6, 2019 via email

ghost commented Sep 6, 2019

WillAyd commented Sep 6, 2019

jschendel commented Sep 10, 2019

chris-b1 commented Oct 10, 2019

WillAyd commented Oct 10, 2019

chris-b1 commented Oct 10, 2019

WillAyd commented Jan 13, 2020

Extreme performance difference between int and float factorization #28303

Extreme performance difference between int and float factorization #28303

Comments

ghost commented Sep 5, 2019

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

WillAyd commented Sep 5, 2019

ghost commented Sep 5, 2019

TomAugspurger commented Sep 5, 2019

TomAugspurger commented Sep 5, 2019

ghost commented Sep 6, 2019

TomAugspurger commented Sep 6, 2019 via email

ghost commented Sep 6, 2019

WillAyd commented Sep 6, 2019

jschendel commented Sep 10, 2019

chris-b1 commented Oct 10, 2019

WillAyd commented Oct 10, 2019

chris-b1 commented Oct 10, 2019

WillAyd commented Jan 13, 2020

Output of `pd.show_versions()`