-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extremely slow and inefficient hash function for openArray[byte] for 64-bit systems #23678
Comments
I copied an implementation of MurmurHash3_x64_128 from @timotheecour In https://github.com/PeterScott/murmur3
@arnetheduck Would you mind checking the performance of |
The 32-bit output is even more an issue than the speed, IMO, as it is surprising. There was originally some concern about having the string hash values from the C/C++ backends match the hash values for the JavaScript backend. The minimum version of NodeJs/js/ECMAscript/whatever and in particular the BigInt support may have evolved since the last time this came up, but I imagine matching still matters? In timing string hash functions for table lookup, per-hash time is often neglected (as shows up even in @ringabout's quote of PeterScott just above!) but keys in Table are often short. This is probably not exhaustive, but you want to at least 1) stay in L1 to avoid confusing CPU/system memory bandwidth interaction effects more sensitive to broader workloads with hashing perf., 2) being sure to measure both per-byte and per-hash calls, 3) do statistical error analysis so when comparing any two you have error bars, 4) measure on multiple generations of backend CPUs, not just whatever your laptop/desktop/whatever is because results vary a lot. One way to do all of the above is loop over substrings of each length of the same string that fits well within L1 in several rounds, taking the minimum times across rounds to filter out system noise and dynamic CPU frequency scaling effects1 and then doing a linear regression on a The most basic reason to use time not throughput is just that "time is additive". "kicks of noise" (visually like Brownian motions of atoms hitting pollen under a microscope seen by Brown) are added in. Throughput numbers/"reciprocal space" add fuel to an already confusing noise fire. Another reason why the In terms of concrete string hasher recommendations, for a long while now I have been patching my Nim installs to A better recommendation might be Google's Farm Hash which does quite well with C backends of the past 5..8-ish years and has better stress tests of diverse deployments at Google. They have decent specializations to 16-32-64- byte strings. So, a 64-bit output farm hash would probably be a good default and my primary recommendation, but as noted whatever you do need both JavaScript and NimVM attention as well as Nim with pointers. As just one deployment example/datapoint (non-PGO gcc, C impls), I can offer on an i7-6700k with times in clock cycles assuming 4.7GHz:
but the fastest of these either have poor distribution (fold is just xor'd 8byte words on top of each other) or use specific-architectural speed-ups like Intel SIMD (Dan Lemire's CL) or gx (SIMD & AES when avail, etc.). gcc used to be worse at compiling farm, though - WY was much faster back in 2017 and now is slower. EDIT: I emailed @ringabout some C farm hash code to evaluate for porting to Nim/Js/NimVM, but it should be noted performance may vary a lot across Nim ports of C codes unless you are shipping assembly. That also applies to just taking PeterScott's analysis of his C impl. Case in point, "hash_mmr" above is some old variant of Murmur. Footnotes
|
It seems Nim's runtime checks doesn't affect the speed of murmurhash so much. nim c -r -d:release benchstrhash.nim
nim c -r -d:danger benchstrhash.nim
Benchmarked with latest devel Nim and GCC 13.2.1. I changed murmurHash code so that backend C compiler can remove a some of Nim's runtime checks: nim c -r -d:release benchstrhash.nim
nim c -r -d:danger benchstrhash.nim
It slightly improved the performance of murmurhash for small size input. |
This table shows hash sizes and probability of random collision: In the case of using hash for
|
Thanks for the contribution, @demotomohiro. I don't think anyone disputes that a limited range of output A maybe interesting alternative application area for a Anyway, more on-point, I have ported that Farm hash to pure Nim that works in the NimVM, JavaScript Backend, and C/C++ backends producing the same values as a C variant that passed all the SMHasher distribution tests. It will never be as fast as things leveraging CPU-specific features, but it uses 64 byte limbs and special cases strings shorter than that to be loop-less and should be a worthwhile improvement. I can only benchmark on bare metal on about 3 or 4 CPU arch's and gcc-13 & clang-17 and might ask @demotomohiro & others to run a measurement program on Looong & maybe ARM/some other arch's before we really decide to go that way or other aspects of the implementation such as whether to |
So, with the below Nim program (and It's just 2 CPUs and (on purpose) almost a "maximally micro"-benchmark with the required caveats that these days CPU resources themselves are a complex network of caches & queues so "costs in context" can be very different than isolated costs. You can clearly see the step functions from the loop structure/structure of the hash function and can also "ball park" the cost as 1/4 to 3/4 CPU cycles/byte which at ~5 GHz is roughly 20 to 7 GB/s. At least for me on Nim-devel it gives the same hash values under the JS and under the As mentioned, one thing that is kind of nice about this one is that it doesn't rely on any CPU specific features for its speed (which of course throttles it in some other senses). What I would recommend is that @demotomohiro runs it on his Looong and someone else runs it on an RPi ARM and someone else runs it on Apple Silicon ARM and yet third/fourth person runs it on AWS or Google Cloud ARM (or maybe some one lucky person generous with their time and curiosity has access to all of those with both gcc & clang?). type B = char # uint8 byte ..?
const k0 = 0xc3a5c85c97cb3127u64 # Primes on (2^63, 2^64) for various uses
const k1 = 0xb492b66fbe98f273u64
const k2 = 0x9ae16a3b2f90404fu64
proc load4e(s: openArray[B], o=0): uint32 {.inline.} =
((uint32(s[o + 3]) shl 24) or (uint32(s[o + 2]) shl 16) or
(uint32(s[o + 1]) shl 8) or uint32(s[o + 0]))
proc load8e(s: openArray[B], o=0): uint64 {.inline.} =
((uint64(s[o + 7]) shl 56) or (uint64(s[o + 6]) shl 48) or
(uint64(s[o + 5]) shl 40) or (uint64(s[o + 4]) shl 32) or
(uint64(s[o + 3]) shl 24) or (uint64(s[o + 2]) shl 16) or
(uint64(s[o + 1]) shl 8) or uint64(s[o + 0]))
proc load4(s: openArray[B], o=0): uint32 {.inline.} =
when nimvm: load4e(s, o)
else:
when defined(js): load4e(s, o)
else: copyMem result.addr, s[o].addr, result.sizeof
proc load8(s: openArray[B], o=0): uint64 {.inline.} =
when nimvm: load8e(s, o)
else:
when defined(js): load8e(s, o)
else: copyMem result.addr, s[o].addr, result.sizeof
proc lenU(s: openArray[B]): uint64 {.inline.} = s.len.uint64
proc shiftMix(v: uint64): uint64 {.inline.} = v xor (v shr 47)
proc rotR(v: uint64; bits: cint): uint64 {.inline.} =
(v shr bits) or (v shl (64 - bits))
proc len16(u: uint64; v: uint64; mul: uint64): uint64 {.inline.} =
var a = (u xor v)*mul
a = a xor (a shr 47)
var b = (v xor a)*mul
b = b xor (b shr 47)
b*mul
proc len0_16(s: openArray[B]): uint64 {.inline.} =
if s.len >= 8:
let mul = k2 + 2*s.lenU
let a = load8(s) + k2
let b = load8(s, s.len - 8)
let c = rotR(b, 37)*mul + a
let d = (rotR(a, 25) + b)*mul
len16 c, d, mul
elif s.len >= 4:
let mul = k2 + 2*s.lenU
let a = load4(s).uint64
len16 s.lenU + (a shl 3), load4(s, s.len - 4), mul
elif s.len > 0:
let a = uint32(s[0])
let b = uint32(s[s.len shr 1])
let c = uint32(s[s.len - 1])
let y = a + (b shl 8)
let z = s.lenU + (c shl 2)
shiftMix(y*k2 xor z*k0)*k2
else: k2 # s.len == 0
proc len17_32(s: openArray[B]): uint64 {.inline.} =
let mul = k2 + 2*s.lenU
let a = load8(s)*k1
let b = load8(s, 8)
let c = load8(s, s.len - 8)*mul
let d = load8(s, s.len - 16)*k2
len16 rotR(a + b, 43) + rotR(c, 30) + d, a + rotR(b + k2, 18) + c, mul
proc len33_64(s: openArray[B]): uint64 {.inline.} =
let mul = k2 + 2*s.lenU
let a = load8(s)*k2
let b = load8(s, 8)
let c = load8(s, s.len - 8)*mul
let d = load8(s, s.len - 16)*k2
let y = rotR(a + b, 43) + rotR(c, 30) + d
let z = len16(y, a + rotR(b + k2, 18) + c, mul)
let e = load8(s, 16)*mul
let f = load8(s, 24)
let g = (y + load8(s, s.len - 32))*mul
let h = (z + load8(s, s.len - 24))*mul
len16 rotR(e + f, 43) + rotR(g, 30) + h, e + rotR(f + a, 18) + g, mul
type Pair = tuple[first, second: uint64]
proc weakLen32withSeeds2(w, x, y, z, a, b: uint64): Pair {.inline.} =
var a = a + w
var b = rotR(b + a + z, 21)
let c = a
a += x
a += y
b += rotR(a, 44)
result[0] = a + z
result[1] = b + c
proc weakLen32withSeeds(s: openArray[B]; o: int; a,b: uint64): Pair {.inline.} =
weakLen32withSeeds2 load8(s, o ), load8(s, o + 8),
load8(s, o + 16), load8(s, o + 24), a, b
proc hashFarm*(s: openArray[B]): uint64 {.inline.} =
if s.len <= 16: return len0_16(s)
if s.len <= 32: return len17_32(s)
if s.len <= 64: return len33_64(s)
const seed = 81u64 # not const to use input `h`
var
o = 0 # s[] ptr arith -> variable origin variable `o`
x = seed
y = seed*k1 + 113
z = shiftMix(y*k2 + 113)*k2
v, w: Pair
x = x*k2 + load8(s)
let eos = ((s.len - 1) div 64)*64
let last64 = eos + ((s.len - 1) and 63) - 63
while true:
x = rotR(x + y + v[0] + load8(s, o+8), 37)*k1
y = rotR(y + v[1] + load8(s, o+48), 42)*k1
x = x xor w[1]
y += v[0] + load8(s, o+40)
z = rotR(z + w[0], 33)*k1
v = weakLen32withSeeds(s, o+0 , v[1]*k1, x + w[0])
w = weakLen32withSeeds(s, o+32, z + w[1], y + load8(s, o+16))
swap z, x
inc o, 64
if o == eos: break
let mul = k1 + ((z and 0xff) shl 1)
o = last64
w[0] += (s.lenU - 1) and 63
v[0] += w[0]
w[0] += v[0]
x = rotR(x + y + v[0] + load8(s, o+8), 37)*mul
y = rotR(y + v[1] + load8(s, o+48), 42)*mul
x = x xor w[1]*9
y += v[0]*9 + load8(s, o+40)
z = rotR(z + w[0], 33)*mul
v = weakLen32withSeeds(s, o+0 , v[1]*mul, x + w[0])
w = weakLen32withSeeds(s, o+32, z + w[1], y + load8(s, o+16))
swap z, x
len16 len16(v[0],w[0],mul) + shiftMix(y)*k0 + z, len16(v[1],w[1],mul) + x, mul
type Hash = int64
proc hash*(x: openArray[B]): Hash {.inline.} = x.hashFarm.Hash
when isMainModule:
when defined testValues:
import std/os
for s in commandLineParams(): echo s.hash,"\t",s.len,"\t",s
elif defined ubench:
import std/[times, monotimes, strformat]
proc hashTm(reps=100, mins=30, offs=0..0, lens=1..128, ghz=4.6) =
## `farm | fitl -cn -cb -b999 1 2 3` -> nsec = perByte*bytes + perHash
## time formula with bootstrapped error bars on the two coefficients.
## CHECK `reps` ALTERS `dt` *TO ENSURE COMPILER DIDN'T DO LOOP->MULTIPLY*.
var s = "" # Hashes fast relative to mem xfer=>stay in L1
for i in 0 .. (offs.b + lens.b): s.add chr(ord('A') + i mod 64)
for n in lens: # Can batch work => study many lens
for o in offs: # Perf can vary across align => study|avg over
let p = cast[ptr UncheckedArray[char]](s[o].addr)
var h {.volatile.}: uint64
var dt = int64.high # min(mins trials) hides CPU ramp up delay..
for trial in 1..mins: #..or interrupts/competing load/etc.
let t0 = getMonoTime()
for r in 1..reps: # Hash fast relative to tm qry => Loop
h += hashFarm(toOpenArray[char](p, 0, n - 1))
dt = min(dt, (getMonoTime() - t0).inNanoseconds)
echo &"{dt.float*ghz/reps.float:.1f} {reps*n} {reps} {n} {o} {h}"
import cligen; dispatch hashTm, help={
"reps": "Repeat Loops for dt", "mins": "num Loops to min(dt) over",
"offs": "offsets to do", "lens": "string lens", "ghz": "cvt ns -> cycles"}
elif B is char:
const x = hash("1234"); echo x # 882600748797058222
let y = hash("1234"); echo y # 882600748797058222 For longer strings the step function more or less continues at 64B boundaries. |
We've moved on since to a different hash function and setup entirely so testing this one would be quite involved. That said, anything is better than the current one, specially if it fixes the 32-bit bug ;) @c-blake looks like he's on top of better measuring performance over a more diverse input range already though - the most important thing is to establish a minimally correct baseline which means that as long as we're getting a full 64-bit range out of it, it's already going to be a massive improvement which should be merged asap. Once that's done, one can investigate more modern options such as https://github.com/ogxd/gxhash/blob/main/article/article.pdf
The specifics of On this note, it's also quite unfortunate that it's an |
I think all agree what is most a problem is the counterintuitive restriction of The loop-less structure for <64B keys (by far the most "meta-common" key sets in my own experiences) alone should make this always faster or at least about the same. For large keys doing 64B/512b whole L1 cache lines at a time should be better from that. I can post a 3-way comparison program here people can run on their favorite backends / CPUs with PGO/not/etc. if there is interest. I should also say, I had taken the @arnetheduck "20-byte key of uniform random distribution" as "merely an example", but IF the keys are already close to pseudorandom bits then you can also cast the first|last 4|8 bytes to a Also, the |
Well, for the performance curious, since this issue did start with performance complaint, a Nim built with the referenced PR, and this little cligen utility program: import std/[times, monotimes, strformat], std/hashes {.all.}
type Fun = enum orig, murmur, farm
proc hashTm(reps=100, mins=50, offs=0..0, lens=1..128, ghz=4.6, funs: seq[Fun])=
## `hashTm | fitl -cn -cb -b999 1 2 3` -> nsec = perByte*bytes + perHash
## time formula with bootstrapped error bars on the two coefficients.
## CHECK `reps` ALTERS `dt` *TO ENSURE COMPILER DIDN'T DO LOOP->MULTIPLY*.
let funs = if funs.len > 0: funs else: @[orig, murmur, farm]
var s = "" # Hashes fast relative to mem xfer=>stay in L1
for i in 0 .. (offs.b + lens.b): s.add chr(ord('A') + i mod 64)
for fun in funs:
for n in lens: # Can batch work => study many lens
for o in offs: # Perf can vary across aligns => study|avg over
let p = cast[ptr UncheckedArray[byte]](s[o].addr)
var h {.volatile.}: uint64
var dt = int64.high # min(mins trials) hides CPU ramp up delay..
for trial in 1..mins: #..or interrupts/competing load/etc.
template doTm(expr) =
let t0 = getMonoTime()
for r in 1..reps: h += expr # Hash fast relative to tm qry => Loop
dt = min(dt, (getMonoTime() - t0).inNanoseconds)
case fun
of orig : doTm cast[uint64](hashData(p, n))
of murmur: doTm cast[uint64](murmurHash(toOpenArray[byte](p, 0, n - 1)))
of farm : doTm hashFarm(toOpenArray[byte](p, 0, n - 1))
echo &"{dt.float*ghz/reps.float:.1f} {reps*n} {reps} {n} {o} {h}"
import cligen; dispatch hashTm, help={
"reps": "Repeat Loops for dt", "mins": "num Loops to min(dt) over",
"offs": "offsets to do", "lens": "string lens", "ghz": "cvt ns -> cycles",
"funs": "hash funs to time"} this little driver script #!/bin/sh
b="chrt 99 taskset 0xC env -i CLIGEN=/n PATH=$PATH"
cpuTag="$1"
xtra="$2"
nim c --cc:gcc -d:danger tmHash && 0 $b ./tmHash $xtra > ${cpuTag}g
nim c --cc:clang -d:danger tmHash && 0 $b ./tmHash $xtra > ${cpuTag}c
# These are two PGO driver scripts I have around; Roll your own.
nim-pgo tmHash './tmHash>/n' --mm:arc && 0 $b ./tmHash $xtra > ${cpuTag}gPG
nim-pgoc tmHash './tmHash>/n' --mm:arc && 0 $b ./tmHash $xtra > ${cpuTag}cPG so that after #!/bin/sh
cpuT0=$1
cpuT1=$2
cat > doit.gpi <<EOF
set term png size 1920,1080 font "Helvetica,10"
set output "/t/3NimHashes${cpuT1}.png"
set yrange [0:156]
set ytics 10
set xtics 16
set grid
set key bottom right
plot "${cpuT0}c" u 4:1 t "${cpuT1} clang", \
"${cpuT0}g" u 4:1 t "${cpuT1} gcc", \
"${cpuT0}cPG" u 4:1 t "${cpuT1} clangPGO", \
"${cpuT0}gPG" u 4:1 t "${cpuT1} gccPGO"
EOF
gnuplot doit.gpi which I also did for a SkyLake with a You can see all 3 Nim string hashes present after the PR as clearly separated clusters of 4 compilation modes (2 compilers * (pgo,not)). The improved performance of @narimiran 's murmur over the original Anyway, I realize that it's only 2 CPUs, but I tried to "show my work" enough that anyone interested could do their own analysis on their own CPUs of interest. I do think that just reading the functions is an ok "first order" explanation of the performance differences (byte-at-a-time for the slowest, word-at-a-time for the middle, cache-line-at-a-time for the fastest). Perf tracking one's intuitions is always a good cross-check. |
@c-blake It output each results to separate files and measure times with 1~256 bytes input. import std/[times, monotimes, strformat], std/hashes {.all.}
import MurmurHash3_x64_128
proc hashTm(reps=100, mins=50, offs=0..0, lens=1..256, ghz=4.6)=
## `hashTm | fitl -cn -cb -b999 1 2 3` -> nsec = perByte*bytes + perHash
## time formula with bootstrapped error bars on the two coefficients.
## CHECK `reps` ALTERS `dt` *TO ENSURE COMPILER DIDN'T DO LOOP->MULTIPLY*.
var s = "" # Hashes fast relative to mem xfer=>stay in L1
for i in 0 .. (offs.b + lens.b): s.add chr(ord('A') + i mod 64)
for n in lens: # Can batch work => study many lens
for o in offs: # Perf can vary across aligns => study|avg over
let p = cast[ptr UncheckedArray[byte]](s[o].addr)
var h {.volatile.}: uint64
var dt = int64.high # min(mins trials) hides CPU ramp up delay..
for trial in 1..mins: #..or interrupts/competing load/etc.
let t0 = getMonoTime()
for r in 1..reps:
h += (when defined(orig):
cast[uint64](hashData(p, n))
elif defined(murmur):
cast[uint64](murmurHash(toOpenArray[byte](p, 0, n - 1)))
elif defined(farm):
hashFarm(toOpenArray[byte](p, 0, n - 1))
elif defined(murmur3x64128):
cast[uint64](MurmurHash3_x64_128.hash3(toOpenArray[byte](p, 0, n - 1)))
else:
{.error: "No define".}
0)
dt = min(dt, (getMonoTime() - t0).inNanoseconds)
echo &"{dt.float*ghz/reps.float:.1f} {reps*n} {reps} {n} {o} {h}"
import cligen; dispatch hashTm, help={
"reps": "Repeat Loops for dt", "mins": "num Loops to min(dt) over",
"offs": "offsets to do", "lens": "string lens", "ghz": "cvt ns -> cycles"} run-bench.sh:
Extended yrange so that I can see the result of Raspberry Pi 3.
I ran on my Ivy Bridge CPU: ./run-bench.sh Ivy -g2.1
sh plot.sh Ivy ivy On Raspberry Pi 3 (Compiled with latest successful build for branch devel in Nim Nightly) ./run-bench.sh RP3 -g1.2
sh plot.sh RP3 rp3 Farm hash is fastest on both CPU and it seems MurmurHash3_x64_128 has performance problems in the loop that hashes remaining bytes. I have a question in your benchmark code (tmHash.nim): let t0 = getMonoTime()
for r in 1..reps: h += expr # Hash fast relative to tm qry => Loop
dt = min(dt, (getMonoTime() - t0).inNanoseconds) In this most inner loop, the hash function is repeatedly called with the same input. |
First, thank you for all your hard work, showing your work and your plots that help confirm my recommendation for Farm hash! (Showing some improvement even for 16 Byte strings and much for 32B, as it should be intuitively, but it is best to confirm intuition with measurement when not too hard!). You had mentioned using a Looong CPU, right? Might be nice to see that, too or at least hear the confirmation in words. I also confirmed the basic recommendations/stylized results on an AMD Zen2. My PR adding that under nimPreviewFarmHash was already merged, but I guess there is some (unexpected by me) openness to fast-tracking this to be the default for Nim-2.2. So, maybe the Second, Re: your question:
In short, yes. A bit more detail - the |
@c-blake, thank you for your reply, great benchmark code and added fast hash function to Nim's stdlib! Ivy Bridge CPU is the same CPU I used to run the benchmark code in my first post on this issue.
I just put extra 'o' to 'Long' just because the input string size is large.
Changing for r in 1..reps:
h += expr
s[o] = r.char But it adds small noise to the measured time. |
Sorry. I guess the CPU is only 2 'o' Loong anyway with a "-son" suffix (often) and I didn't read your code carefully enough. 🤦 Re: |
Fixed with #23793 |
I understand this issue has been closed, but did you consider using rapidhash which is considered |
I find SMHasher's approach to performance..not useful. I have no idea how many people its perf analysis has led astray, really, but I also have no illusions that comments on a stdlib issue for a niche prog.lang will sway it at all. Better approaches with explicit code are mentioned several times in several ways in the above thread. Anyway, I did not port this one to Nim, but did measure the C impl to address your question within my nicer measurement style { that shows "clean-ish" ledges/cliffs/fine structure matching one's intuition about CPU performance even when run by people other than me }, old gcc-13 results had farm already { using @Vindaar's measuremancer }
New gcc-14 results have farm also beating this new rapid hash for both zero length and in average time/bytes slope on 2 diff CPUs, the newer one compiled two different ways:
So, I don't know what compiler / CPUs rapidhash was developed / tested on. Maybe AMD and the readme mentions AArch64. I didn't test those, but it's dubious to me that it's really faster than Farm by enough to be "very worthwhile" given that it's already so much slower on Intel { a full Even if rapidhash were faster on some CPU, I doubt it would be "faster by more than it's slower on Intel", just shown to be in the 1.10x to 1.40x range. So, to move to rapidhash, the stdlib would have to develop "CPU favoritism" since I don't think Nim-core wants That said, to the extent it is not "free", I think the next "performance cliff" meaningfully faster than Farm hash with modern CPUs and modern compilers is much more CPU-specific SIMD techniques like GX hash as already alluded to three earlier times in this thread. By its very nature, though, it will be challenging to get such hashes good distribution & good performance on all relevant backends. |
Description
The murmurhash implemention uses 32-bit limbs (on 64-bit systems) and inserts range checking inside the loop causing it to perform extremely poorly due to branching and inability to optimize the loop for example by unrolling it.
In addition, it returns a 32-bit value for a 64-bit hash effectively biasing the hash function to half of the number space which has the potential to double collisions depending on how it's used.
Nim Version
77c0409
Current Output
No response
Expected Output
No response
Possible Solution
No response
Additional Information
No response
The text was updated successfully, but these errors were encountered: