-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: using murmur hash for float64 khash-tables #36729
Conversation
Running the
The example from the #28303 is Curiosly, the measurement
actually shows how much better the murmur2-hash is:
Note: the above is a just an educated guess, I still have to profile/debug to be sure. For example it looks as if, while the table is built up, a rehashing happens. This would explain why 160ms and not 80ms (latency of my machine * 1e6) are needed to build up the hash map. This behavior of the old hash function is an advantage in this special case, but for other cases as shown by Also compared to Int64 and UInt64 the old Float64-behavior is a outlier (130ms vs 1ms), the explanation is the same as above:
|
self.s.isin(self.values) | ||
|
||
|
||
class GH28303Example: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you give this a descriptive name and then in a comment # see GH#28303
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@realead looks really interesting. let me have a closer look soonish. |
72eea8e
to
ee8dd93
Compare
is there any chance we can get it via
I'd like to understand this, but it may just be above my pay-grade. Let me know if I've got any of this right: So the naive interpretation of the asv result is that the murmur implementation is slower, but you're saying that interpretation is incorrect. Instead, the murmur implementation is so much faster that the [mumble] is using a slower cache level, making it look like its slower? |
One could replace
and build every pyx-file which cimports
to these files. However, C++11 is needed, so this option might be needed to be added as well. |
I've said the murmur hash is better (not faster):) In a nutshell: Not having cache misses while building the hash-table,only shows that the works isn't done "properly" and the paycheck will come later, when this hash-table is used. (One must know, that in the example we are talking about, only building of the hash-table is measured) Here more details, I hope they make things clearer and not worse... From a good hash we would expect the following behavior:
That means accessing buckets for hash(1.0), hash(2.0) and so one with the new function will lead to cache misses, so be relatively slow. The old hash has a behavior similar to hash(1.0)=1, hash(2.0)=2, hash(3.0)=3 and so on, so we access the buckets in an order which optimizes the utilization of the L1-cache and which is 100 times faster, i.e. latency RAM/latency L1. We basically just copy the memory - you cannot beat that. However in If we look up series 1.0, 2.0, 3.0... the old hash-function would be very fast again: we just look at one bucket after another utilizing the L1 cache, while the new hash-function would be getting cache-misses and being a constant factor of about 100 slower. However, if we are unlucky and have to look up a key which is not in the hash-table, but hits the first bucket, the old hash function becomes a problem: first bucket is not empty, but wrong key, so we need to look into the second bucket (that is how khash resolves collisions) which is not empty, but also wrong key - and so almost all n-buckets will be checked leading to In the meantime, the new hash function would make the look-up in Obviously, we could also construct a series, which would trigger the same behavior also for the new hash-function, but the performance degradation will not happen for such "usual" series like 1.0, 2.0, and so on. |
5cb5b3b
to
0b5a7c7
Compare
0b5a7c7
to
8048f97
Compare
After thinking about it for another while: the issue with cpp hash function, not only that it means to switch to another language, but the behavior of the hash function will depend on platform (Linux - gcc and libstdc++, macOS - clang + libc++, Windows + MSVC (doesn't use murmur-hash IIRC)) and build (different hash-functions for 32bit and 64bit), so the behavior of the hash-map can be different on different platform. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @realead lgtm. can you add a note in the Perf section of 1.2, ok to be reasonably specific (e.g. perf increase on groupby of floats vs ints).
cc @jbrockmendel @TomAugspurger ok here? |
@realead are the slower-here asvs in indexing.NumericSeriesIndexing measuring the wrong thing? or if not wrong, then incomplete? |
8048f97
to
821966e
Compare
@jbrockmendel In my opinion However, measuring the performance of building the hash-table alone is a legit test as well. An example: as noted here:
Given the size of pandas/pandas/_libs/hashtable.pyx Line 59 in a7e39b8
rehashing should not happen for 1e6 elements. Once this bug is fixed, However, |
@jreback whatsnew added. |
Not really qualified to comment here. My only question would be whether this should be controlled by an option since (IIUC) the performance depends on the access pattern. But I haven't followed things closely. |
Does this render #8524 unnecessary? |
I wasn't aware that pandas' khash isn't the last version and thus uses double hashing (which makes at least some things I said in the above comments wrong). Sorry for the long text, which will come now, I hope it will help us to reach a decision. There are already the most important points:
Here are descriptions what linear probing, double hashing and quadratic probing do. Cost per element for building a hash map or looking-up are:
While checking for equality can be costly (strings, PyObjects) and even dominate, for floats and ints it is almost free - we aren't going to consider these costs. Linear probing: When we create hash-table/look-up series of elements which are random, the best strategy is linear probing with a very simple hash function. For integers, let's say
The behavior is even better for series like 1,2,3,4,5,...,n<m, with subsequent look-up series 1,2,3,...,n: We acces one bucket after another, the prefetcher recognizes the pattern and gets the memory in advance, thus we have only latency of the L1 cache - we are on par with just using an index-vector. Good: We have the performance of a vector for very simple (but probably pretty usual for pandas, let's call them "important") pattern of 1,2,3..n, but also flexibility to be able to handle other series quite efficiently. This is also what khash-comments say: pandas/pandas/_libs/src/klib/khash.h Lines 58 to 60 in a7e39b8
There a however some non-random inputs which make the above strategy degenerate to a Single hashing: So we could keep linear probing and take a proper/strong hash-function like murmur2, which (almost) guarantees:
The second property is "good" because having multiple occupied buckets next to eachother and linear probing is a combination for an The costs are now:
So we paid However, we aren't 100% happy, because now for "important" series 1,2,3,4,5 ... we pay
Double hashing (current state in pandas): The idea is to be optimistic about the data and use a simple hash-function without "avalance effect", thus 1 and 2 will land in the neighboring buckets once again. But then, if our believe in the goodness of the data was betrayed and we have a collision, we get into the panic mode, and instead of probing the next bucket we use a second (more or less strong) hash to calculate the step length and jump somewhere for the next probing, if not successful we take the same step length and jump again (we must and can(!) ensure that the step is chosen in such a way that all buckets can be probed). Good news: we are on par with a vector for "important" series like 1,2,3,4.... But there is an additional price for random inputs and crowded hash-maps, for which let's say 50% of look-ups are collisions:
So we have paid We can conclude:
Quadratic probing: Obviously it pains us to pay this price for random series. Quadratic probing let's do another trade-off: differently to linear probing we probe not neighbors 1,2,3,4... but neighbors 1,3,7,15,31 and so on. The advantages:
So the cost for random series is:
we are on par with vector for "important" series and we can handle bad "occupied neighbors"-case in Before switching to quadratic probing (as proposed in #8524), one should consider the following:
What we should try:
|
@realead thanks for the detailed cases and analysis. looking forward to your experimentation. If possible, can you capture the analysis in a doc / notebook / gist that we can stick somewhere, so the next time we need to look at this there will be enough info. i think a notebook might be the best here. |
Sorry it took so long, here is the first part of my analysis/experiments: pleadings for a stronger hash. In a nutshell:
While for random series, the performance of both, the current (weak) hash-function and the proposed murmur2-hash function, is on par, the weak hash-function seems to be order of magnitude faster, e.g. for asv-test However when we see the performance for different sizes (asv-tests
The second point is easiest to explain, when looking at the Int64-case (see HashEmulator.ipynb.zip for an emulation of the khash-lib behavior), we can see, that the elements are inserted in the following order in the following buckets:
the interesting thing is that when the first row is done, the first bucket in the second row is still in L2 cache (around 64Kb are needed for ca. 1000 such "heads" at 0, 2049, 2*2049 and so on). However, for 5e10 elements, there are 8 times so many heads, that means 512Kb cache are used => L3 cache which has 10 times the lattency of the L2-cache. The bigger problem of such regular behavior of the hash, is the pattern of the occupated buckets (1=occupied bucket, 0=free bucket): There are 512 occupied buckets, then 512 non-occupied buckets, then 512 occupied buckets and so on. This regular pattern means, that there are step-sizes for which we will iterate over all (or many) such islands until we can resolve a collision, which will lead to a suplinear-behavior of the hash-table. The situation is even worse for Float64: there will be an occupied bucket-island consisting of 5e5 (out of 2e6) buckets (but other occupied islands are relatively small) - no wonder there are some bugs/series showning degenerated performance: As comparison, occupied buckets with murmur-hash: There is no pattern and no large occupied bucket-islands, thus the collision resolution never take at most around 20 steps and are done after 2 steps on average. This PR currently proposes using murmur2-hash for float64 and for secondary hash-functio. The performance for this special case is somewhat worse:
i.e.:
but it is not that much worse. On the other hand, the advantage is that the danger of superlinear behavior is much smaller. Right now panda's khash-version uses the most robust probing strategy: double-hashing. Using murmur2 as secondary hash will improve the robustness even more. However, only with stronger first hash using more performant but less robust probing strategies (such as quadratic and linear) can be taken into consideration (which probably should not be part of this PR). Another option is only to replace the second-hash trough murmur2, which would add robustness and be less intrusive than the proposal of this RP (on the other hand PR-proposal gives more safety, that the performance will not take a hit for some more or less "usual" series), the results of the run for only-second-hash-murmur2 can be found at the end. Here are the asv-bench timings for RP (strong first and second hashes)
Only second-hash is strong (murmur2):
|
c852b2e
to
efae65c
Compare
Now the comparison of different probing strategies:
The different probing strategies are interesting for cases where comparisons are cheap (thus cache misses play a role), e.g. float64/int64/uint64. For heavier types (PyObject, strings), it plays only a role when there are problems with robustness in such a a way that look-up costs more than Linear probing: Linear probing will be really bad for types with bad first (and only) hash-function at least for some series. This is what we see (see the whole comparison at the end of comment):
Quadratic probing Supposed to be better than linear probing, as more robust. This is what we also can observe (all results at the end):
Combined probing: Theoretically best of both worlds: robust as double hashing but as few cache misses as quadratic probing. It looks as expected: there are some cases with about 10% slow-down but many more cases with speed-ups, comparable with speed-ups from quadratic probing. Also for weak hashes nothing bad happens. The way it is implemented, combined probing can be switched off/on type-wise (it is only on for float64/(u)int64, in tests for PyObject no noteworthy changes were seen with combined probing). Here are the all comparisons for combined probing:
Other comparisons Linear probing
Quadratic probing:
|
The above long investigation in a nutshell:
I think a stronger hash would be good for float64 (this PR) and PyObject (not (yet) this PR). Probably for the sake of consistency it also should be done for (u)int64 (the impact is not yet investigated). This trade-off is however not as clear cut as the first.
Now, it is your call... |
can you comment on the implementation complexity / maintability of these strategies as you have outlined. meaning, assume they are well-documented is any one significantly more complex than the others? e.g. the next thing we need to investigate, how approachable is the code (of course will have this PR for refernce as well). or are they all 'similar' in code complexity? |
sorry, formatting of a previous comment went wrong, there are the needed code changes for different probing strategies:
I would say not really much is going on code-complexity wise: linear probing and quadratic probing are slightly simpler than the current state, combined approach adds a little bit of complexity, as there is an additional parameter (how many linear steps) but nothing crucial. Using stronger hash functions for PyObject/(u)int64 would look a lot like this change: 4e89fce#diff-7ac30c345bd6d38838a46337e4c6b5b6feae3e1fd5aea54b5d9e37d20054edf5R29 |
efae65c
to
dad1ede
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, some doc requests.
@jreback docs are updated. @jbrockmendel @TomAugspurger bump:) In the meantime I took a look at the implementation of CPython's I have ported their approach to khash (realead@160fcf3): it would solve the issue at hand (#28303)and seems to do better on average than this PR (see the timings at the end of this post). If you like, I could rather submit the other approach instead of this PR. Overview of pro and cons:
Here is a selection of the most important/relevant (imo) insights for khash and this PR: 1.Quote: "Use cases for sets differ considerably from dictionaries where looked-up keys are more likely to be present. In contrast, sets are primarily about membership testing where the presence of an element is not known in advance. Accordingly, the set implementation needs to optimize for both the found and not-found case." After reading this, a shortcomming of the current use of khash becomes obvious: the same hash-table is used as dictionary (Index) and as set ( This is also the expirience I've made: optimizing for the look-up in the case of "key is contained" hurts the performance of the case "key is not contained" and vice versa: for the "key is contained"-case, the most important part is not to have hash-collisions - having long chains of occupied buckets isn't a problem at all if there are no collisions. For the "key is not contained" it is better to take some collisions but not too have long chains of occupied buckets, which might mean catastrophic performance degradation.
Why is it better than double hashing? Double hashing has a problem: if the second hash (modulo size of table) is 1,2,3,4, or another small number: it takes many steps to traverse a large chain of occupied buckets. The chances for such a bad hash aren't very high (smaller than .1% for 10^6 elements), but if they happen it really hurts performance. Due to the Why is it better than quadratic probing? The problem with quadratic probing is that it doesn't take higher bits of the hash into consideration, thus making the hash weaker as it is and leading to more hash-collision. Why is it better than combined probing? Theoretically, utilizing cache better should be a good thing, however the CPython people made the experience, that this actually hurts the performance for the scenarios in which maps are used. I have made the same experience (to my great surprise!), this is the reason why in my "combined probing" implementation, I had used only 3 linear steps - after that the performance in map-scenarios started to detoriate.
Timings CPython's probing (realead@160fcf3, after) vs this PR (before):
Timings CPython's probing (realead@160fcf3, after) vs master (before):
|
@realead if you can rebase These are the last entires on cpython vs your impl.
are these just degenerate? |
needs rebase |
11d9731
to
ac16c1e
Compare
I cannot really explain these result, when I rerun the tests I get:
However, the "bad" measurements aren't just a fluke in my opinion. A possible explanation is: This PR is less "gentle" with cache than the current state or CPython's approach, thus more sensitive if some background process (my testing machine is a VM sharing hardware with others) claims some part of the cache. This shows why the CPython-approach is tempting: it leaves the memory-access-pattern intact while being more robust for corner cases (because it takes the higher bits of the hash into account while the current implementations doesn't) on the other hand this PR offers more safety against degeneration of the performance, but is less performant for some special series like 1,2,3,4,... However, while CPython's approach makes most sense for the usage of dict's: they probably aren't used very often for more than 1e6 elements, and for 1e6-elements the running time e.g. of |
i am fine merging this with the current appropach, small comment, pls merge master and ping on green. |
…t unique/stable (obviously depends on hash-values of keys), ensure unique ordering
824869f
to
e743c79
Compare
It is failing, but seems to be unrelated to to my changes (TestRegistration.test_pandas_plots_register should be unaffected by it).
Sorry, I don't understand what you are referring to... |
thats affecting all builds, dont worry about it.
Best guess is @jreback was referring to the "are these just degenerate" question, which I think you answered.
Unfortunately, it isnt unusual for asv runs to include a lot of noise. I usually do multiple runs and successively narrow down consistent-looking results. That said, the results usually dont look bimodal, which these do if you squint. |
@@ -13,25 +13,31 @@ | |||
// is 64 bits the truncation causes collission issues. Given all that, we use our own | |||
// simple hash, viewing the double bytes as an int64 and using khash's default | |||
// hash for 64 bit integers. | |||
// GH 13436 | |||
// GH 13436 showed that _Py_HashDouble doesn't work well with khash | |||
// GH 28303 showed, that the simple xoring-version isn't good enough |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a tiny bit more content here on why this appropach vs the CPython appropach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was a late comment, pls update in a followon.
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
The currently used hash-function can lead to many collisions (see #28303 or this comment) for series like 0.0, 1.0, 2.0, ... n.
This PR uses a specialization (for a simple double-value) of murmur2-hash, which is used in stdc++ and libc++ and more or less state of the art.
An alternative would be to use Python's
_Py_HashDouble
, but because it has the property:hash(1.0)=1
,hash(2.0)=2
and so on: there is no desirable avalanche effect , which is not an issue for Python'sdict
implementation, but problematic for khash as it uses a different strategy for collision handling. See #13436 as_Py_HashDouble
was replaced through the simple hash-function used until now.