-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extreme performance difference between int and float factorization #28303
Comments
Can you profile to see where time is spent? This is a pretty open ended request so anything you can do to isolate what you think needs to be improve is helpful |
If I am not mistaken, pretty much all time for groupby is spent in |
Not sure how useful it'll be, but that's
|
If you use ints instead of floats for your group_index, things are comparable.
|
Very nice, thank you! This solves my particular issue. Do you think it's worth fixing in general? |
What needs to be fixed? Hashing floats is always going to be slower than
hashing ints.
…On Fri, Sep 6, 2019 at 12:56 AM Joel Richard ***@***.***> wrote:
Very nice, thank you! This solves my particular issue. Do you think it's
worth fixing in general?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#28303?email_source=notifications&email_token=AAKAOISJTKRMVDGAECQN3WTQIHWJHA5CNFSM4IUBGJKKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6B2FNA#issuecomment-528720564>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAKAOIUOICJY7RT62QKNDUTQIHWJHANCNFSM4IUBGJKA>
.
|
I don't understand why it should be significantly slower. I mean, why can't you just interpret the binary representation of the floats as an integer, call the integer factorize and then undo the mapping? I am not suggesting to implement it like this, but this implementation would only be slightly slower than the integer implementation and therefore I don't see why it has to be so slow. |
If you have a more performant float factorization technique I think we'd take a PR for it |
I think some further investigation is needed here: it doesn't look like In the example below, performance loss peaks for values around In [1]: import pandas as pd; import numpy as np
In [2]: # factorize int values for comparison
...: a = np.arange(10**4, 2 * 10**4).repeat(100)
...: %timeit pd.factorize(a)
6.18 ms ± 39 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [3]: # factorize float values on the order of 10**i
...: a = np.arange(10**4, dtype='float64')
...: for i in range(4, 16):
...: print(f'exponent: {i}')
...: a2 = (a + 10**i).repeat(100)
...: %timeit pd.factorize(a2)
...:
exponent: 4
7.11 ms ± 108 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
exponent: 5
6.77 ms ± 61.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
exponent: 6
6.72 ms ± 348 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
exponent: 7
16.6 ms ± 675 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
exponent: 8
78.6 ms ± 1.58 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
exponent: 9
649 ms ± 5.23 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
exponent: 10
344 ms ± 2.56 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
exponent: 11
74.4 ms ± 953 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
exponent: 12
15 ms ± 429 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
exponent: 13
8.71 ms ± 271 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
exponent: 14
7.46 ms ± 388 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
exponent: 15
7.63 ms ± 314 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) From the above, the performance loss seems worst for around In [6]: # rerunning the bad example for emphasis
...: a2 = (a + 10**9).repeat(100)
...: %timeit pd.factorize(a2)
633 ms ± 3.39 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [7]: # not int-like but still close together is fine
...: a2 = (a + 10**9 + np.random.random(10**4)).repeat(100)
...: %timeit pd.factorize(a2)
6.69 ms ± 40 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [8]: # int-like but spaced out is fine
...: a2 = np.linspace(10**9, 9*10**9, 10**4).round().repeat(100)
...: %timeit pd.factorize(a2)
6.76 ms ± 238 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) Finally, it might be worthwhile to note that it's not only In [9]: # same speed hit for pd.unique with int-like (numpy is the same)
...: a2 = (a + 10**9).repeat(100)
...: %timeit pd.unique(a2)
...: %timeit np.unique(a2)
633 ms ± 4.29 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
23.3 ms ± 300 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [10]: # likewise no speed hit when not int-like (numpy is the same)
...: a2 = (a + 10**9 + np.random.random(10**4)).repeat(100)
...: %timeit pd.unique(a2)
...: %timeit np.unique(a2)
4.97 ms ± 173 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
23.3 ms ± 156 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) I haven't dove in far enough to actually figure out why this is happening, much less come up with a fix. Hopefully what I wrote is a good hint if someone else that wants to look into this. |
A little more investigative work - we actually do hash our floats as if they are integers (translation of the c code below) - and that happens to creates collisions in the Someone more than experienced than me in bit shuffling might see an obvious solution here, but ultimately need a "better" hash function. def float_hash(key):
# define kh_float64_hash_func_0_NAN(key) (khint32_t)((asint64(key))>>33^(asint64(key))^(asint64(key))<<11)
return (key.view('uint64')>>33^key.view('uint64')^key.view('uint64')<<11).astype('uint32')
a = np.arange(10**4, dtype='float64')
for i in range(4, 16):
a2 = (a + 10**i)
hashes = float_hash(a2)
num_collisions = len(a2) - len(pd.unique(hashes))
print(f'exponent: {i} num_collisions: {num_collisions}') exponent: 4 num_collisions: 0 |
@chris-b1 do you know what the astype to 32 bit is at the end of that function? If you remove that there are no collisions and I think would match khash perfectly without |
khash uses 32 bit hashes throughout, so there's no way to avoid that truncation without modifying the underlying hash table code. Presumably a memory cost to that, but haven't tried |
Maybe relevant idea from upstream: |
Code Sample, a copy-pastable example if possible
Problem description
In my current project, I use groupby as well as resample for the same data frames with the same aggregations. I have noticed that resample is way quicker than groupby. While I understand that groupby is more flexible, it would still be nice if the performance was comparable. In the example above, resample is more than 50 times faster:
I am aware that they don't result in the exact same data frames, but this does not matter for this discussion.
Expected Output
Better performance for groupby.
I haven't looked at the groupby implementation and therefore I don't know if there is a good reason for the difference. If there is a good reason, some common cases could still be improved a lot. For example, in this case, we could just check first if the by-argument is monotonic increasing or decreasing. In this case, the operation can be implemented even without a hash map.
Output of
pd.show_versions()
INSTALLED VERSIONS
commit : None
python : 3.7.3.final.0
python-bits : 64
OS : Linux
OS-release : 4.4.0-17134-Microsoft
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8
pandas : 0.25.1
numpy : 1.17.1
pytz : 2019.2
dateutil : 2.8.0
pip : 19.2.3
setuptools : 41.2.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.4.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.1
IPython : 7.8.0
pandas_datareader: 0.7.4
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.4.1
matplotlib : 3.1.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.14.1
pytables : None
s3fs : None
scipy : 1.3.1
sqlalchemy : None
tables : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
The text was updated successfully, but these errors were encountered: