-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Klib upgrade (factorizing performance increase) #8524
Comments
thanks @CarstVaartjes
You are welcome to take a shot, otherways will try to get for 0.15.1 |
We'll take a shot tomorrow afternoon; there were quite some changes and I think Wes also added quite some own inline stuff in the current khash.c but with some diff comparisons we hope we should be okay. And the performance increase is significant, so it would be nice to make it in time for 0.15.0 |
ok if u could get done on next day or 2 and everything pass then could do |
@jreback we would love to make it happen, unfortunately with my current knowledge in C, Python and pandas I cannot get First error out of many others:
In case you would like to have a look you can find:
Everything we made is based on pandas-team's work, any tips or suggestions are welcome. I am sorry to say, that this won't be possible in the following days. Thanks |
ok no prob. maybe can do this for 0.15.1. something that might help is seeing what was originally changed, e.g.
So I am not sure what @wesm changed but maybe can fiigurr out from this. |
@jreback thanks for the info, actually this is exactly the way we adapted klib 0.2.8 which seems to work already a an independent package, but when I put this changes to pandas seems to have problems with other files (like parser.c), files that we don't really use in the other project. I suppose I need to have a deeper look to the other files in pandas to understand the interconnection between them before being able to make it work properly. Thanks again for your rapid answer |
@FrancescElies np. I think its appropriate to keep it as an-line package as don't want to add an explicit dep (and it is license compat and works well that way). that said, pls have a closer look! I am imagine the perf boost IS worth it. and FYI, this library is also vastly faster than almost anything else, e.g. see commentary here: http://article.gmane.org/gmane.comp.python.numeric.general/58724/match=unique+performance essentially numpy should use klib (or similar) and use hashtables to do unique. |
I might have time to look into this at the weekend as this definitely looks interesting, but I can't guarantee anything. |
go @immerrr ! |
Very interesting, this is good to know that the quadratic probing has a practical impact in many cases. I can try to look at the compilation issues if you can't sort them out, let me know. |
I've integrated the code from 0.2.8 in #8547, but for some reason vbench results are not as astounding: new version is systematically faster, especially in all-unique cases, but the increase is nowhere near 60% (https://gist.github.com/immerrr/98a02338df8cbabb3a86). I did refactor the benchmarks somewhat and made string values 10 chars each to be closer to real-life conditions, so that may have influenced the result. |
My example might have been a bit skewed because of the limited amount of options and the short string I guess. Also, the big increase was only for strings in my example, the rest was 10-20%, but still decent and good to see that it's still 10%ish in your gist :) |
I wonder why your example compares In [5]: np.random.seed(0); strs = np.array(['AA', 'BB', 'CC', 'DD'], dtype=np.object_).take(np.random.randint(4, size=10000))
In [6]: %timeit pd.factorize(strs)
1000 loops, best of 3: 285 µs per loop
In [7]: %%timeit
ht = pd.hashtable.StringHashTable(len(strs))
ht.factorize(strs)
...:
10000 loops, best of 3: 171 µs per loop As for the actual performance, I'm struggling to get quadratic probing on par with the old version, in short-string scenario it's about 5 times slower and valgrind shows that there's a lot more probing going on and strcmp calls are rather expensive: old versionIn [2]: strs = pd.util.testing.rands_array(2, 10000)
In [3]: strs[:5]
Out[3]: array(['SV', '1a', 'd7', 'dN', 'jt'], dtype=object)
In [4]: %%timeit
...: ht = pd.hashtable.StringHashTable(len(strs))
...: ht.factorize(strs)
...:
1000 loops, best of 3: 544 µs per loop new versionIn [4]: np.random.seed(0); strs = pd.util.testing.rands_array(2, 10000)
In [5]: strs[:5]
Out[5]: array(['SV', '1a', 'd7', 'dN', 'jt'], dtype=object)
In [6]: %%timeit
ht = pd.hashtable.StringHashTable(len(strs))
ht.factorize(strs)
...:
100 loops, best of 3: 2.55 ms per loop |
Ok, I've realized that 3.5k two character strings that only slightly differ from each other may be a corner case for a hash function as simple as x31 and a probing scheme that favours neighboring buckets. I've re-run it on short and duplicated strings and the newer version is indeed slightly faster. old versionIn [16]: %%timeit np.random.seed(0); strs = np.array(['NY', 'IL', 'OH', 'CA'], dtype=np.object_).take(np.random.randint(4, size=10000))
ht = pd.hashtable.StringHashTable(len(strs))
ht.factorize(strs)
....:
10000 loops, best of 3: 158 µs per loop
In [17]: %%timeit np.random.seed(0); strs = np.array(['AA', 'BB', 'CC', 'DD'], dtype=np.object_).take(np.random.randint(4, size=10000))
ht = pd.hashtable.StringHashTable(len(strs))
ht.factorize(strs)
....:
10000 loops, best of 3: 176 µs per loop
In [18]: pd.__version__
Out[18]: '0.15.0-6-g403f38d' new versionIn [16]: %%timeit np.random.seed(0); strs = np.array(['NY', 'IL', 'OH', 'CA'], dtype=np.object_).take(np.random.randint(4, size=10000))
ht = pd.hashtable.StringHashTable(len(strs))
ht.factorize(strs)
....:
10000 loops, best of 3: 155 µs per loop
In [17]: %%timeit np.random.seed(0); strs = np.array(['AA', 'BB', 'CC', 'DD'], dtype=np.object_).take(np.random.randint(4, size=10000))
ht = pd.hashtable.StringHashTable(len(strs))
ht.factorize(strs)
....:
10000 loops, best of 3: 169 µs per loop
In [18]: pd.__version__
Out[18]: '0.15.0-9-g7d0d41b' |
@CarstVaartjes was a conclusion ever reached about this possible upgrade? |
Hi! not really as I missed the later update! It's still an interesting 10% performance increase in a core function I think. Also khash is really stable (so it's not really a moving target, so this would really get the best out of it) -> https://github.com/attractivechaos/klib/commits/master/khash.h |
This was attempted in #49197 but ultimately reverted due to a performance degradation with isin. The hashing expert @realead has a separate issue open to convert isin to using a set implementation rather than a hash implementation in #39799 . Paired together those changes might get us moving forward on this |
@WillAyd we had investigated quadratic probing back while merging #36729. If I remember results correctly: the downside of the quadratic probing was that there where some "normal" sequences for which the performance degenerated quite a bit. It was decided to use the current approach, because the probability for degeneration for "normal" sequences is much smaller. We have some such "normal" sequences in the performance suite, which would lead to degenerate behavior of quadratic probing. This is probably what you have observed. |
Yea I think this comment was a clear example of that #49197 (comment) |
Hi,
I'm working with two colleagues (@FrancescElies & @javiBi4) on seeing if we can add groupby to bcolz, what we did first was to see if we can make Pandas' excellent klib implementation more library independent. See https://github.com/CarstVaartjes/khash
A while back khash (https://github.com/attractivechaos/klib) was updated to 0.2.8:
2013-05-02 (0.2.8):
We updated the klib to it (which was something of a headache to be honest), but the quadratic probing seems to be quite faster which might make it interesting for Pandas to retrofit this into the general Pandas master. See the example:
So a 20%ish improvement for int64 and a 65% increase for strings in this example. Just thought to give you a heads up about this, as you might have missed 0.2.8 (and it's a bit of a pain to adjust everything so this might help to make Pandas even faster)
Edit: added float64 example too (10-15% improvement)
The text was updated successfully, but these errors were encountered: