Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Klib upgrade (factorizing performance increase) #8524

Open
CarstVaartjes opened this issue Oct 9, 2014 · 19 comments
Open

Klib upgrade (factorizing performance increase) #8524

CarstVaartjes opened this issue Oct 9, 2014 · 19 comments
Assignees
Labels
Performance Memory or execution speed performance

Comments

@CarstVaartjes
Copy link

Hi,

I'm working with two colleagues (@FrancescElies & @javiBi4) on seeing if we can add groupby to bcolz, what we did first was to see if we can make Pandas' excellent klib implementation more library independent. See https://github.com/CarstVaartjes/khash
A while back khash (https://github.com/attractivechaos/klib) was updated to 0.2.8:

2013-05-02 (0.2.8):

* Use quadratic probing. When the capacity is power of 2, stepping function
  i*(i+1)/2 guarantees to traverse each bucket. It is better than double
  hashing on cache performance and is more robust than linear probing.

  In theory, double hashing should be more robust than quadratic probing.
  However, my implementation is probably not for large hash tables, because
  the second hash function is closely tied to the first hash function,
  which reduce the effectiveness of double hashing.

Reference: http://research.cs.vt.edu/AVresearch/hashing/quadratic.php

We updated the klib to it (which was something of a headache to be honest), but the quadratic probing seems to be quite faster which might make it interesting for Pandas to retrofit this into the general Pandas master. See the example:

import numpy as np
import random
import pandas as pd
from khash.hashtable import Int64HashTable, StringHashTable

def new_klib_int(input_array):
    ht = Int64HashTable(len(input_array))
    return ht.get_labels_groupby(input_array)

def new_klib_str(input_array):
    ht = StringHashTable(len(input_array))
    return ht.factorize(input_array)

def new_klib_float(input_array):
    ht = Float64HashTable(len(input_array))
    return ht.factorize(input_array)

a = np.random.randint(100, size=10000000)
b = np.fromiter((random.choice(['NY', 'IL', 'OH', 'CA']) for _ in xrange(10000000)), dtype='S2')
b = pd.Series(b).values
c = np.random.random_sample(10000000) * 10000

%timeit pd.factorize(a)
%timeit new_klib_int(a)
%timeit pd.factorize(b)
%timeit new_klib_str(b)
%timeit pd.factorize(c)
%timeit new_klib_float(c)

In [20]: %timeit pd.factorize(a)
10 loops, best of 3: 122 ms per loop

In [21]: %timeit new_klib_int(a)
10 loops, best of 3: 101 ms per loop

In [22]: %timeit pd.factorize(b)
1 loops, best of 3: 496 ms per loop

In [23]: %timeit new_klib_str(b)
10 loops, best of 3: 165 ms per loop

In [36]: %timeit pd.factorize(c)
1 loops, best of 3: 1.61 s per loop

In [37]: %timeit new_klib_float(c)
1 loops, best of 3: 1.44 s per loop

So a 20%ish improvement for int64 and a 65% increase for strings in this example. Just thought to give you a heads up about this, as you might have missed 0.2.8 (and it's a bit of a pain to adjust everything so this might help to make Pandas even faster)

Edit: added float64 example too (10-15% improvement)

@jreback
Copy link
Contributor

jreback commented Oct 9, 2014

thanks @CarstVaartjes

klib is shipped with pandas so should be straightforward to update I think.
just a couple of files.

You are welcome to take a shot, otherways will try to get for 0.15.1

@jreback jreback added the Performance Memory or execution speed performance label Oct 9, 2014
@jreback jreback added this to the 0.15.1 milestone Oct 9, 2014
@CarstVaartjes
Copy link
Author

We'll take a shot tomorrow afternoon; there were quite some changes and I think Wes also added quite some own inline stuff in the current khash.c but with some diff comparisons we hope we should be okay. And the performance increase is significant, so it would be nice to make it in time for 0.15.0

@jreback
Copy link
Contributor

jreback commented Oct 9, 2014

ok if u could get done on next day or 2 and everything pass then could do
(run a vbench as well)

@FrancescElies
Copy link

@jreback we would love to make it happen, unfortunately with my current knowledge in C, Python and pandas I cannot get python setup.py build_ext --inplace execute correctly with the new version of klib.

First error out of many others:

In file included from pandas/parser.c:360:
In file included from pandas/src/parser/tokenizer.h:36:
pandas/src/klib/khash.h:187:21: error: redefinition of '__ac_HASH_UPPER'
static const double __ac_HASH_UPPER = 0.77;

In case you would like to have a look you can find:

Everything we made is based on pandas-team's work, any tips or suggestions are welcome.

I am sorry to say, that this won't be possible in the following days.

Thanks

@jreback
Copy link
Contributor

jreback commented Oct 10, 2014

@FrancescElies

ok no prob. maybe can do this for 0.15.1.

something that might help is seeing what was originally changed, e.g.

git blame pandas/src/klib/khash.h shows what/when things were changed (e.g. like a diff from the original file).

So I am not sure what @wesm changed but maybe can fiigurr out from this.

@FrancescElies
Copy link

@jreback thanks for the info, actually this is exactly the way we adapted klib 0.2.8 which seems to work already a an independent package, but when I put this changes to pandas seems to have problems with other files (like parser.c), files that we don't really use in the other project.

I suppose I need to have a deeper look to the other files in pandas to understand the interconnection between them before being able to make it work properly.

Thanks again for your rapid answer

@jreback
Copy link
Contributor

jreback commented Oct 10, 2014

@FrancescElies np. I think its appropriate to keep it as an-line package as don't want to add an explicit dep (and it is license compat and works well that way).

that said, pls have a closer look! I am imagine the perf boost IS worth it.

and FYI, this library is also vastly faster than almost anything else,

e.g. see commentary here: http://article.gmane.org/gmane.comp.python.numeric.general/58724/match=unique+performance

essentially numpy should use klib (or similar) and use hashtables to do unique.

@immerrr
Copy link
Contributor

immerrr commented Oct 10, 2014

I might have time to look into this at the weekend as this definitely looks interesting, but I can't guarantee anything.

@jreback
Copy link
Contributor

jreback commented Oct 10, 2014

go @immerrr !

@wesm
Copy link
Member

wesm commented Oct 12, 2014

Very interesting, this is good to know that the quadratic probing has a practical impact in many cases.

I can try to look at the compilation issues if you can't sort them out, let me know.

@immerrr
Copy link
Contributor

immerrr commented Oct 13, 2014

I've integrated the code from 0.2.8 in #8547, but for some reason vbench results are not as astounding: new version is systematically faster, especially in all-unique cases, but the increase is nowhere near 60% (https://gist.github.com/immerrr/98a02338df8cbabb3a86).

I did refactor the benchmarks somewhat and made string values 10 chars each to be closer to real-life conditions, so that may have influenced the result.

@CarstVaartjes
Copy link
Author

My example might have been a bit skewed because of the limited amount of options and the short string I guess. Also, the big increase was only for strings in my example, the rest was 10-20%, but still decent and good to see that it's still 10%ish in your gist :)
P.s. great work!

@immerrr
Copy link
Contributor

immerrr commented Oct 20, 2014

I wonder why your example compares StringHashTable(len(x)).factorize(x) to pd.factorize(x), the latter is a lot more involved:

In [5]: np.random.seed(0); strs = np.array(['AA', 'BB', 'CC', 'DD'], dtype=np.object_).take(np.random.randint(4, size=10000))

In [6]: %timeit pd.factorize(strs)
1000 loops, best of 3: 285 µs per loop

In [7]: %%timeit
ht = pd.hashtable.StringHashTable(len(strs))
ht.factorize(strs)
   ...: 
10000 loops, best of 3: 171 µs per loop

As for the actual performance, I'm struggling to get quadratic probing on par with the old version, in short-string scenario it's about 5 times slower and valgrind shows that there's a lot more probing going on and strcmp calls are rather expensive:

old version

In [2]: strs = pd.util.testing.rands_array(2, 10000)

In [3]: strs[:5]
Out[3]: array(['SV', '1a', 'd7', 'dN', 'jt'], dtype=object)

In [4]: %%timeit
   ...: ht = pd.hashtable.StringHashTable(len(strs))
   ...: ht.factorize(strs)
   ...: 
1000 loops, best of 3: 544 µs per loop

new version

In [4]: np.random.seed(0); strs = pd.util.testing.rands_array(2, 10000)

In [5]: strs[:5]
Out[5]: array(['SV', '1a', 'd7', 'dN', 'jt'], dtype=object)

In [6]: %%timeit
ht = pd.hashtable.StringHashTable(len(strs))
ht.factorize(strs)
   ...: 
100 loops, best of 3: 2.55 ms per loop

@immerrr
Copy link
Contributor

immerrr commented Oct 21, 2014

Ok, I've realized that 3.5k two character strings that only slightly differ from each other may be a corner case for a hash function as simple as x31 and a probing scheme that favours neighboring buckets. I've re-run it on short and duplicated strings and the newer version is indeed slightly faster.

old version

In [16]: %%timeit np.random.seed(0); strs = np.array(['NY', 'IL', 'OH', 'CA'], dtype=np.object_).take(np.random.randint(4, size=10000))
ht = pd.hashtable.StringHashTable(len(strs))
ht.factorize(strs)
   ....: 
10000 loops, best of 3: 158 µs per loop

In [17]: %%timeit np.random.seed(0); strs = np.array(['AA', 'BB', 'CC', 'DD'], dtype=np.object_).take(np.random.randint(4, size=10000))
ht = pd.hashtable.StringHashTable(len(strs))
ht.factorize(strs)
   ....: 
10000 loops, best of 3: 176 µs per loop

In [18]: pd.__version__
Out[18]: '0.15.0-6-g403f38d'

new version

In [16]: %%timeit np.random.seed(0); strs = np.array(['NY', 'IL', 'OH', 'CA'], dtype=np.object_).take(np.random.randint(4, size=10000))
ht = pd.hashtable.StringHashTable(len(strs))
ht.factorize(strs)
   ....: 
10000 loops, best of 3: 155 µs per loop

In [17]: %%timeit np.random.seed(0); strs = np.array(['AA', 'BB', 'CC', 'DD'], dtype=np.object_).take(np.random.randint(4, size=10000))
ht = pd.hashtable.StringHashTable(len(strs))
ht.factorize(strs)
   ....: 
10000 loops, best of 3: 169 µs per loop

In [18]: pd.__version__
Out[18]: '0.15.0-9-g7d0d41b'

@jreback jreback modified the milestones: 0.15.2, 0.16.0 Nov 24, 2014
@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@datapythonista datapythonista modified the milestones: Contributions Welcome, Someday Jul 8, 2018
@jbrockmendel
Copy link
Member

@CarstVaartjes was a conclusion ever reached about this possible upgrade?

@CarstVaartjes
Copy link
Author

Hi! not really as I missed the later update! It's still an interesting 10% performance increase in a core function I think. Also khash is really stable (so it's not really a moving target, so this would really get the best out of it) -> https://github.com/attractivechaos/klib/commits/master/khash.h

@jbrockmendel jbrockmendel self-assigned this Oct 16, 2019
@mroeschke mroeschke removed this from the Someday milestone Oct 13, 2022
@WillAyd
Copy link
Member

WillAyd commented Apr 11, 2023

This was attempted in #49197 but ultimately reverted due to a performance degradation with isin. The hashing expert @realead has a separate issue open to convert isin to using a set implementation rather than a hash implementation in #39799 . Paired together those changes might get us moving forward on this

@realead
Copy link
Contributor

realead commented May 4, 2023

@WillAyd we had investigated quadratic probing back while merging #36729. If I remember results correctly: the downside of the quadratic probing was that there where some "normal" sequences for which the performance degenerated quite a bit. It was decided to use the current approach, because the probability for degeneration for "normal" sequences is much smaller.

We have some such "normal" sequences in the performance suite, which would lead to degenerate behavior of quadratic probing. This is probably what you have observed.

@WillAyd
Copy link
Member

WillAyd commented May 4, 2023

Yea I think this comment was a clear example of that #49197 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.