Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extreme performance difference between int and float factorization #28303

Closed
ghost opened this issue Sep 5, 2019 · 13 comments · Fixed by #36729
Closed

Extreme performance difference between int and float factorization #28303

ghost opened this issue Sep 5, 2019 · 13 comments · Fixed by #36729
Labels
Performance Memory or execution speed performance
Milestone

Comments

@ghost
Copy link

ghost commented Sep 5, 2019

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np
from time import time


df = pd.date_range(start="1/1/2018", end="1/2/2018", periods=1e6).to_frame()

start = time()
dfr = df.resample("1s").last()
print(time() - start)
print("Length:", len(dfr))
print()

group_index = np.round(df.index.astype(int) / 1e9)
start = time()
dfr = df.groupby(group_index).last()
print(time() - start)
print("Length:", len(dfr))

Problem description

In my current project, I use groupby as well as resample for the same data frames with the same aggregations. I have noticed that resample is way quicker than groupby. While I understand that groupby is more flexible, it would still be nice if the performance was comparable. In the example above, resample is more than 50 times faster:

Length: 86401
0.023558616638183594

Length: 86401
1.264981746673584

I am aware that they don't result in the exact same data frames, but this does not matter for this discussion.

Expected Output

Better performance for groupby.

I haven't looked at the groupby implementation and therefore I don't know if there is a good reason for the difference. If there is a good reason, some common cases could still be improved a lot. For example, in this case, we could just check first if the by-argument is monotonic increasing or decreasing. In this case, the operation can be implemented even without a hash map.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.3.final.0
python-bits : 64
OS : Linux
OS-release : 4.4.0-17134-Microsoft
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8

pandas : 0.25.1
numpy : 1.17.1
pytz : 2019.2
dateutil : 2.8.0
pip : 19.2.3
setuptools : 41.2.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.4.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.1
IPython : 7.8.0
pandas_datareader: 0.7.4
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.4.1
matplotlib : 3.1.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.14.1
pytables : None
s3fs : None
scipy : 1.3.1
sqlalchemy : None
tables : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None

@WillAyd
Copy link
Member

WillAyd commented Sep 5, 2019

Can you profile to see where time is spent? This is a pretty open ended request so anything you can do to isolate what you think needs to be improve is helpful

@WillAyd WillAyd added the Needs Info Clarification about behavior needed to assess issue label Sep 5, 2019
@ghost
Copy link
Author

ghost commented Sep 5, 2019

If I am not mistaken, pretty much all time for groupby is spent in Float64HashTable.factorize. Are you familiar with this part of Pandas? Can you point me to the native code?

@TomAugspurger
Copy link
Contributor

Not sure how useful it'll be, but that's

def factorize(self, const {{dtype}}_t[:] values, Py_ssize_t na_sentinel=-1,

@TomAugspurger
Copy link
Contributor

If you use ints instead of floats for your group_index, things are comparable.

    ...: group_index = np.round(df.index.astype(int) / 1e9).astype(int)  # added the astype(int) here
    ...: start = time()
    ...: dfr = df.groupby(group_index).last()
    ...: print(time() - start)
    ...: print("Length:", len(dfr))
    ...:
0.023672103881835938
Length: 86401

0.029394865036010742
Length: 86401

@ghost
Copy link
Author

ghost commented Sep 6, 2019

Very nice, thank you! This solves my particular issue. Do you think it's worth fixing in general?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Sep 6, 2019 via email

@ghost
Copy link
Author

ghost commented Sep 6, 2019

I don't understand why it should be significantly slower. I mean, why can't you just interpret the binary representation of the floats as an integer, call the integer factorize and then undo the mapping? I am not suggesting to implement it like this, but this implementation would only be slightly slower than the integer implementation and therefore I don't see why it has to be so slow.

@WillAyd
Copy link
Member

WillAyd commented Sep 6, 2019

If you have a more performant float factorization technique I think we'd take a PR for it

@WillAyd WillAyd changed the title Extreme performance difference between groupby and resample for comparable results Extreme performance difference between int and float factorization Sep 6, 2019
@WillAyd WillAyd added Performance Memory or execution speed performance and removed Needs Info Clarification about behavior needed to assess issue labels Sep 6, 2019
@jschendel
Copy link
Member

I think some further investigation is needed here: it doesn't look like pd.factorize is slow in general for floats, but rather that there are certain float values that are slower to factorize than others.

In the example below, performance loss peaks for values around10**9 (this might not be the only scenario but rather just one I've found):

In [1]: import pandas as pd; import numpy as np

In [2]: # factorize int values for comparison 
   ...: a = np.arange(10**4, 2 * 10**4).repeat(100) 
   ...: %timeit pd.factorize(a)
6.18 ms ± 39 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [3]: # factorize float values on the order of 10**i 
   ...: a = np.arange(10**4, dtype='float64') 
   ...: for i in range(4, 16): 
   ...:     print(f'exponent: {i}') 
   ...:     a2 = (a + 10**i).repeat(100) 
   ...:     %timeit pd.factorize(a2) 
   ...: 
exponent: 4
7.11 ms ± 108 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
exponent: 5
6.77 ms ± 61.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
exponent: 6
6.72 ms ± 348 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
exponent: 7
16.6 ms ± 675 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
exponent: 8
78.6 ms ± 1.58 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
exponent: 9
649 ms ± 5.23 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
exponent: 10
344 ms ± 2.56 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
exponent: 11
74.4 ms ± 953 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
exponent: 12
15 ms ± 429 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
exponent: 13
8.71 ms ± 271 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
exponent: 14
7.46 ms ± 388 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
exponent: 15
7.63 ms ± 314 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

From the above, the performance loss seems worst for around 10**9 but it isn't universal for all values on the order of 10**9. The best I can tell from some ad hoc timings is that this performance loss occurs for int-like values that are close together:

In [6]: # rerunning the bad example for emphasis 
   ...: a2 = (a + 10**9).repeat(100) 
   ...: %timeit pd.factorize(a2)
633 ms ± 3.39 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [7]: # not int-like but still close together is fine
   ...: a2 = (a + 10**9 + np.random.random(10**4)).repeat(100) 
   ...: %timeit pd.factorize(a2)
6.69 ms ± 40 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [8]: # int-like but spaced out is fine 
   ...: a2 = np.linspace(10**9, 9*10**9, 10**4).round().repeat(100) 
   ...: %timeit pd.factorize(a2)
6.76 ms ± 238 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Finally, it might be worthwhile to note that it's not only pd.factorize that's slow, but pd.unique is slow as well in the scenarios above, which makes sense as they share a common codepath. For what it's worth, numpy's np.unique performance doesn't take a hit, which makes sense since it's a different implementation, but still good to check:

In [9]: # same speed hit for pd.unique with int-like (numpy is the same)
   ...: a2 = (a + 10**9).repeat(100) 
   ...: %timeit pd.unique(a2) 
   ...: %timeit np.unique(a2)
633 ms ± 4.29 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
23.3 ms ± 300 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [10]: # likewise no speed hit when not int-like (numpy is the same)
    ...: a2 = (a + 10**9 + np.random.random(10**4)).repeat(100) 
    ...: %timeit pd.unique(a2) 
    ...: %timeit np.unique(a2)
4.97 ms ± 173 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
23.3 ms ± 156 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

I haven't dove in far enough to actually figure out why this is happening, much less come up with a fix. Hopefully what I wrote is a good hint if someone else that wants to look into this.

@chris-b1
Copy link
Contributor

A little more investigative work - we actually do hash our floats as if they are integers (translation of the c code below) - and that happens to creates collisions in the 10**7 to 10**9 range which is the cause of the performance problem.

Someone more than experienced than me in bit shuffling might see an obvious solution here, but ultimately need a "better" hash function.

def float_hash(key):
    # define kh_float64_hash_func_0_NAN(key) (khint32_t)((asint64(key))>>33^(asint64(key))^(asint64(key))<<11)
    return (key.view('uint64')>>33^key.view('uint64')^key.view('uint64')<<11).astype('uint32')

a = np.arange(10**4, dtype='float64') 
for i in range(4, 16): 
    a2 = (a + 10**i)
    hashes = float_hash(a2)
    num_collisions = len(a2) - len(pd.unique(hashes)) 
    print(f'exponent: {i} num_collisions: {num_collisions}') 

exponent: 4 num_collisions: 0
exponent: 5 num_collisions: 0
exponent: 6 num_collisions: 0
exponent: 7 num_collisions: 5000
exponent: 8 num_collisions: 4992
exponent: 9 num_collisions: 4608
exponent: 10 num_collisions: 0
exponent: 11 num_collisions: 0
exponent: 12 num_collisions: 0
exponent: 13 num_collisions: 0
exponent: 14 num_collisions: 0
exponent: 15 num_collisions: 0

@WillAyd
Copy link
Member

WillAyd commented Oct 10, 2019

@chris-b1 do you know what the astype to 32 bit is at the end of that function? If you remove that there are no collisions and I think would match khash perfectly without

@chris-b1
Copy link
Contributor

khash uses 32 bit hashes throughout, so there's no way to avoid that truncation without modifying the underlying hash table code. Presumably a memory cost to that, but haven't tried

@WillAyd
Copy link
Member

WillAyd commented Jan 13, 2020

Maybe relevant idea from upstream:

attractivechaos/klib#125 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants