Klib upgrade (factorizing performance increase) #8524

CarstVaartjes · 2014-10-09T21:50:24Z

Hi,

I'm working with two colleagues (@FrancescElies & @javiBi4) on seeing if we can add groupby to bcolz, what we did first was to see if we can make Pandas' excellent klib implementation more library independent. See https://github.com/CarstVaartjes/khash
A while back khash (https://github.com/attractivechaos/klib) was updated to 0.2.8:

2013-05-02 (0.2.8):

* Use quadratic probing. When the capacity is power of 2, stepping function
  i*(i+1)/2 guarantees to traverse each bucket. It is better than double
  hashing on cache performance and is more robust than linear probing.

  In theory, double hashing should be more robust than quadratic probing.
  However, my implementation is probably not for large hash tables, because
  the second hash function is closely tied to the first hash function,
  which reduce the effectiveness of double hashing.

Reference: http://research.cs.vt.edu/AVresearch/hashing/quadratic.php

We updated the klib to it (which was something of a headache to be honest), but the quadratic probing seems to be quite faster which might make it interesting for Pandas to retrofit this into the general Pandas master. See the example:

import numpy as np
import random
import pandas as pd
from khash.hashtable import Int64HashTable, StringHashTable

def new_klib_int(input_array):
    ht = Int64HashTable(len(input_array))
    return ht.get_labels_groupby(input_array)

def new_klib_str(input_array):
    ht = StringHashTable(len(input_array))
    return ht.factorize(input_array)

def new_klib_float(input_array):
    ht = Float64HashTable(len(input_array))
    return ht.factorize(input_array)

a = np.random.randint(100, size=10000000)
b = np.fromiter((random.choice(['NY', 'IL', 'OH', 'CA']) for _ in xrange(10000000)), dtype='S2')
b = pd.Series(b).values
c = np.random.random_sample(10000000) * 10000

%timeit pd.factorize(a)
%timeit new_klib_int(a)
%timeit pd.factorize(b)
%timeit new_klib_str(b)
%timeit pd.factorize(c)
%timeit new_klib_float(c)

In [20]: %timeit pd.factorize(a)
10 loops, best of 3: 122 ms per loop

In [21]: %timeit new_klib_int(a)
10 loops, best of 3: 101 ms per loop

In [22]: %timeit pd.factorize(b)
1 loops, best of 3: 496 ms per loop

In [23]: %timeit new_klib_str(b)
10 loops, best of 3: 165 ms per loop

In [36]: %timeit pd.factorize(c)
1 loops, best of 3: 1.61 s per loop

In [37]: %timeit new_klib_float(c)
1 loops, best of 3: 1.44 s per loop

So a 20%ish improvement for int64 and a 65% increase for strings in this example. Just thought to give you a heads up about this, as you might have missed 0.2.8 (and it's a bit of a pain to adjust everything so this might help to make Pandas even faster)

Edit: added float64 example too (10-15% improvement)

The text was updated successfully, but these errors were encountered:

jreback · 2014-10-09T22:03:24Z

thanks @CarstVaartjes

klib is shipped with pandas so should be straightforward to update I think.
just a couple of files.

You are welcome to take a shot, otherways will try to get for 0.15.1

CarstVaartjes · 2014-10-09T22:16:00Z

We'll take a shot tomorrow afternoon; there were quite some changes and I think Wes also added quite some own inline stuff in the current khash.c but with some diff comparisons we hope we should be okay. And the performance increase is significant, so it would be nice to make it in time for 0.15.0

jreback · 2014-10-09T22:17:47Z

ok if u could get done on next day or 2 and everything pass then could do
(run a vbench as well)

FrancescElies · 2014-10-10T14:15:55Z

@jreback we would love to make it happen, unfortunately with my current knowledge in C, Python and pandas I cannot get python setup.py build_ext --inplace execute correctly with the new version of klib.

First error out of many others:

In file included from pandas/parser.c:360:
In file included from pandas/src/parser/tokenizer.h:36:
pandas/src/klib/khash.h:187:21: error: redefinition of '__ac_HASH_UPPER'
static const double __ac_HASH_UPPER = 0.77;

In case you would like to have a look you can find:

Changes under: https://github.com/FrancescElies/pandas/tree/klib_upgrade_0.2.8/pandas/src/klib
Pandas-independent implementation of klib at https://github.com/CarstVaartjes/khash

Everything we made is based on pandas-team's work, any tips or suggestions are welcome.

I am sorry to say, that this won't be possible in the following days.

Thanks

jreback · 2014-10-10T14:43:26Z

@FrancescElies

ok no prob. maybe can do this for 0.15.1.

something that might help is seeing what was originally changed, e.g.

git blame pandas/src/klib/khash.h shows what/when things were changed (e.g. like a diff from the original file).

So I am not sure what @wesm changed but maybe can fiigurr out from this.

FrancescElies · 2014-10-10T15:34:44Z

@jreback thanks for the info, actually this is exactly the way we adapted klib 0.2.8 which seems to work already a an independent package, but when I put this changes to pandas seems to have problems with other files (like parser.c), files that we don't really use in the other project.

I suppose I need to have a deeper look to the other files in pandas to understand the interconnection between them before being able to make it work properly.

Thanks again for your rapid answer

jreback · 2014-10-10T15:41:28Z

@FrancescElies np. I think its appropriate to keep it as an-line package as don't want to add an explicit dep (and it is license compat and works well that way).

that said, pls have a closer look! I am imagine the perf boost IS worth it.

and FYI, this library is also vastly faster than almost anything else,

e.g. see commentary here: http://article.gmane.org/gmane.comp.python.numeric.general/58724/match=unique+performance

essentially numpy should use klib (or similar) and use hashtables to do unique.

immerrr · 2014-10-10T15:44:22Z

I might have time to look into this at the weekend as this definitely looks interesting, but I can't guarantee anything.

jreback · 2014-10-10T15:46:24Z

go @immerrr !

wesm · 2014-10-12T00:38:01Z

Very interesting, this is good to know that the quadratic probing has a practical impact in many cases.

I can try to look at the compilation issues if you can't sort them out, let me know.

immerrr · 2014-10-13T13:39:48Z

I've integrated the code from 0.2.8 in #8547, but for some reason vbench results are not as astounding: new version is systematically faster, especially in all-unique cases, but the increase is nowhere near 60% (https://gist.github.com/immerrr/98a02338df8cbabb3a86).

I did refactor the benchmarks somewhat and made string values 10 chars each to be closer to real-life conditions, so that may have influenced the result.

CarstVaartjes · 2014-10-13T20:32:06Z

My example might have been a bit skewed because of the limited amount of options and the short string I guess. Also, the big increase was only for strings in my example, the rest was 10-20%, but still decent and good to see that it's still 10%ish in your gist :)
P.s. great work!

immerrr · 2014-10-20T21:45:12Z

I wonder why your example compares StringHashTable(len(x)).factorize(x) to pd.factorize(x), the latter is a lot more involved:

In [5]: np.random.seed(0); strs = np.array(['AA', 'BB', 'CC', 'DD'], dtype=np.object_).take(np.random.randint(4, size=10000))

In [6]: %timeit pd.factorize(strs)
1000 loops, best of 3: 285 µs per loop

In [7]: %%timeit
ht = pd.hashtable.StringHashTable(len(strs))
ht.factorize(strs)
   ...: 
10000 loops, best of 3: 171 µs per loop

As for the actual performance, I'm struggling to get quadratic probing on par with the old version, in short-string scenario it's about 5 times slower and valgrind shows that there's a lot more probing going on and strcmp calls are rather expensive:

old version

In [2]: strs = pd.util.testing.rands_array(2, 10000)

In [3]: strs[:5]
Out[3]: array(['SV', '1a', 'd7', 'dN', 'jt'], dtype=object)

In [4]: %%timeit
   ...: ht = pd.hashtable.StringHashTable(len(strs))
   ...: ht.factorize(strs)
   ...: 
1000 loops, best of 3: 544 µs per loop

new version

In [4]: np.random.seed(0); strs = pd.util.testing.rands_array(2, 10000)

In [5]: strs[:5]
Out[5]: array(['SV', '1a', 'd7', 'dN', 'jt'], dtype=object)

In [6]: %%timeit
ht = pd.hashtable.StringHashTable(len(strs))
ht.factorize(strs)
   ...: 
100 loops, best of 3: 2.55 ms per loop

immerrr · 2014-10-21T06:01:23Z

Ok, I've realized that 3.5k two character strings that only slightly differ from each other may be a corner case for a hash function as simple as x31 and a probing scheme that favours neighboring buckets. I've re-run it on short and duplicated strings and the newer version is indeed slightly faster.

old version

In [16]: %%timeit np.random.seed(0); strs = np.array(['NY', 'IL', 'OH', 'CA'], dtype=np.object_).take(np.random.randint(4, size=10000))
ht = pd.hashtable.StringHashTable(len(strs))
ht.factorize(strs)
   ....: 
10000 loops, best of 3: 158 µs per loop

In [17]: %%timeit np.random.seed(0); strs = np.array(['AA', 'BB', 'CC', 'DD'], dtype=np.object_).take(np.random.randint(4, size=10000))
ht = pd.hashtable.StringHashTable(len(strs))
ht.factorize(strs)
   ....: 
10000 loops, best of 3: 176 µs per loop

In [18]: pd.__version__
Out[18]: '0.15.0-6-g403f38d'

new version

In [16]: %%timeit np.random.seed(0); strs = np.array(['NY', 'IL', 'OH', 'CA'], dtype=np.object_).take(np.random.randint(4, size=10000))
ht = pd.hashtable.StringHashTable(len(strs))
ht.factorize(strs)
   ....: 
10000 loops, best of 3: 155 µs per loop

In [17]: %%timeit np.random.seed(0); strs = np.array(['AA', 'BB', 'CC', 'DD'], dtype=np.object_).take(np.random.randint(4, size=10000))
ht = pd.hashtable.StringHashTable(len(strs))
ht.factorize(strs)
   ....: 
10000 loops, best of 3: 169 µs per loop

In [18]: pd.__version__
Out[18]: '0.15.0-9-g7d0d41b'

jbrockmendel · 2018-07-15T00:36:42Z

@CarstVaartjes was a conclusion ever reached about this possible upgrade?

CarstVaartjes · 2018-07-15T13:37:46Z

Hi! not really as I missed the later update! It's still an interesting 10% performance increase in a core function I think. Also khash is really stable (so it's not really a moving target, so this would really get the best out of it) -> https://github.com/attractivechaos/klib/commits/master/khash.h

WillAyd · 2023-04-11T21:00:40Z

This was attempted in #49197 but ultimately reverted due to a performance degradation with isin. The hashing expert @realead has a separate issue open to convert isin to using a set implementation rather than a hash implementation in #39799 . Paired together those changes might get us moving forward on this

realead · 2023-05-04T03:52:31Z

@WillAyd we had investigated quadratic probing back while merging #36729. If I remember results correctly: the downside of the quadratic probing was that there where some "normal" sequences for which the performance degenerated quite a bit. It was decided to use the current approach, because the probability for degeneration for "normal" sequences is much smaller.

We have some such "normal" sequences in the performance suite, which would lead to degenerate behavior of quadratic probing. This is probably what you have observed.

WillAyd · 2023-05-04T03:54:18Z

Yea I think this comment was a clear example of that #49197 (comment)

jreback added the Performance Memory or execution speed performance label Oct 9, 2014

jreback added this to the 0.15.1 milestone Oct 9, 2014

CarstVaartjes mentioned this issue Oct 10, 2014

factorize for bcolz Blosc/bcolz#63

Closed

immerrr mentioned this issue Oct 13, 2014

PERF: upgrade khash lib to 0.2.8 #8547

Closed

5 tasks

jreback modified the milestones: 0.15.2, 0.16.0 Nov 24, 2014

jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015

TomAugspurger added Effort High labels Jul 6, 2018

datapythonista modified the milestones: Contributions Welcome, Someday Jul 8, 2018

jbrockmendel self-assigned this Oct 16, 2019

jbrockmendel removed Difficulty Advanced labels Oct 21, 2019

jbrockmendel mentioned this issue Oct 3, 2020

PERF: using murmur hash for float64 khash-tables #36729

Merged

5 tasks

mroeschke removed this from the Someday milestone Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Klib upgrade (factorizing performance increase) #8524

Klib upgrade (factorizing performance increase) #8524

CarstVaartjes commented Oct 9, 2014

jreback commented Oct 9, 2014

CarstVaartjes commented Oct 9, 2014

jreback commented Oct 9, 2014

FrancescElies commented Oct 10, 2014

jreback commented Oct 10, 2014

FrancescElies commented Oct 10, 2014

jreback commented Oct 10, 2014

immerrr commented Oct 10, 2014

jreback commented Oct 10, 2014

wesm commented Oct 12, 2014

immerrr commented Oct 13, 2014

CarstVaartjes commented Oct 13, 2014

immerrr commented Oct 20, 2014

immerrr commented Oct 21, 2014

jbrockmendel commented Jul 15, 2018

CarstVaartjes commented Jul 15, 2018

WillAyd commented Apr 11, 2023

realead commented May 4, 2023

WillAyd commented May 4, 2023

Klib upgrade (factorizing performance increase) #8524

Klib upgrade (factorizing performance increase) #8524

Comments

CarstVaartjes commented Oct 9, 2014

jreback commented Oct 9, 2014

CarstVaartjes commented Oct 9, 2014

jreback commented Oct 9, 2014

FrancescElies commented Oct 10, 2014

jreback commented Oct 10, 2014

FrancescElies commented Oct 10, 2014

jreback commented Oct 10, 2014

immerrr commented Oct 10, 2014

jreback commented Oct 10, 2014

wesm commented Oct 12, 2014

immerrr commented Oct 13, 2014

CarstVaartjes commented Oct 13, 2014

immerrr commented Oct 20, 2014

old version

new version

immerrr commented Oct 21, 2014

old version

new version

jbrockmendel commented Jul 15, 2018

CarstVaartjes commented Jul 15, 2018

WillAyd commented Apr 11, 2023

realead commented May 4, 2023

WillAyd commented May 4, 2023