Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How smart should Nim sets & tables be? #201

Closed
c-blake opened this issue Mar 16, 2020 · 6 comments
Closed

How smart should Nim sets & tables be? #201

c-blake opened this issue Mar 16, 2020 · 6 comments

Comments

@c-blake
Copy link

c-blake commented Mar 16, 2020

So, I just put up a sort of a sketch of what I've been trying to say abstractly at https://github.com/c-blake/adix in response to a variety of changes @timotheecour has been making to the stdlib tables, e.g. nim-lang/Nim#13440 and nim-lang/Nim#13393 and nim-lang/Nim#13418

It is pretty drop-in compatible with the stdlib sets & tables and I tried to keep the internal names similar as well. I just added a few generic procs for the CountTable functionality and sorting does not yet reconstitute the index.

I didn't commit all my test programs, but I have been using the primary linear probing with Robin Hood re-org for the past week or two as a replacement for the stdlib tables (via that metab.nim module, short for metaTab, that makes compile-time switching easy). I'm not sure I'm happy with a lot of things about this code -- the getKey/setKey, tables written as wrappers around sets, various things in triplicate instead of template bodies, etc.

Nevertheless, it does show concretely how we could do both unordered and ordered tables with just linear probing with run-time per-table optional robin hood hashing and also run-time per-table hash()-rehashing with the latter two modes "automatically activated" on overdeep probes on sparse tables. It also shows another instance of the current algorithms (I believe cleaner via the probeSeq iterator). The weak hash mitigation is mostly checked for in a proc tooFull in the various ??set.nim implementations.

From a mathematical property point of view, I believe the non-tombstone variants let users who know what they are doing suffer no performance penalty in time or space on delete heavy workloads while at the same time automatically adapting for users that are blissfully unaware of hash table performance pitfalls - up to a a point. { Beyond the point we can adapt to, the user really needs to provide inequality element comparison since there just isn't enough diversity in the output of the hash() they're using (by definition of the "up to a point". As far as I can tell that inequality requirement is ineliminable. } I sort of feel like it's almost a theorem that this is a better approach for tables of unknown workloads (unless you want to force performance sensitive people to do their own table impls or use this library). Even so, a broader conversation than just comments on pull requests seemed warranted. Hence this RFC issue.

Beyond the above virtues, probe depth triggered resize (in any of the variants) has a bonus that if the user does have a great hash function then they get to enjoy near 100% space utilization. For example, with my current hash-the-output of hash via this multiply-rotate gizmo, the running example of only high bits varying (like what started this whole arc of development) becomes almost a perfect hash table with Robin Hood and rehash both activated.

In terms of performance, usually Robin Hood wins big when you have a "miss heavy" workload - e.g., building a pair of sets and then doing an intersection of them where the result is mostly empty. For a more "hit heavy" workload like a histogram where the counts average much greater than 1, vanilla linear probing is usually best. I don't believe there is a single algorithmic solution that is fastest for all workloads, though. Timing things simply in hot loops can mislead, especially leading one astray generalizing to not perfectly oiled cache prefetching situations.

Personally, I think all the recent churn in table stuff could be profitably just unwound and replaced with advice to use such a better integer hash function, possibly providing a warning from the run-time to guide them and then possibly a few alternative hashes in the standard lib under different names from just hash so users can go proc hash(x: int)=hash2(x) to activate them. Then we could add hash3, hash4 (or give them better names). The althash module in adix is sort of like that.

Anyway, there are also some NOTES.md and TODO.md and doc comments at the tops of modules.

@c-blake
Copy link
Author

c-blake commented Mar 17, 2020

Just as one concrete example, this program:

import metab
var one = initSet[int]()
for i in 0 ..< (1 shl 23):
  one.incl (i shl 25)
var ds = one.depths
echo ds
echo one.len, "/", one.getCap
echo "Stats: ", one.depthStats

produces this output (RobinHood mode on by default but with no hash() output rehashing on by default as per current vsn of adix):

LPSet: Weak hash/too many dups(d=7); Adapting by re-hashing hash()
@[8388607, 1]
8388608/16777216
Stats: (m1: 1.192092895507812e-07, m2: 1.192092895507812e-07, mx: 2)

So, you can see that early on (when depth is 7) the bad hashing is detected, then hash code rehashing is activated and the end result is a near perfect hash table (only one item has a probe depth greater than the minimum). The rehasher is currently a pretty simple Fibonacci-esque rotate & multiply.

Your runtime results may vary a little from the above because right now the rehasher xors in a hash of the virtual memory address of the data array to be less predictable/independent from table to table. Even one table over its own history could have many hash orders with rehashing activated.

That last is actually a safety feature (see, e.g., https://accidentallyquadratic.tumblr.com/post/153545455987/rust-hash-iteration-reinsertion, though depth triggered resize also fixes that particular problem..So, this would be a "belt & suspenders" solution to that blog post).

It might be better to actually "stage activation" of the per-VM-addr ordering, too, maybe just for reproducibility with a slightly more slow but more scrambling-than-default identity hash. It might also make sense to have another/other fallback hashes. I'm open to suggestions. Mostly I think we can do a lot better than Python's hard-coded fancier-hash-built-into-the-probe sequence (and therefore fixed!) and its associated need of tombstones.

@Araq
Copy link
Member

Araq commented Mar 17, 2020

I don't like anything based on "random" memory addresses as it makes bugs much harder to reproduce. This should only be done via -d:nimPreferSafetyOverDebuggability or similar. IMO.

@c-blake
Copy link
Author

c-blake commented Mar 17, 2020

Yeah. I was leaning toward a user-redefinable hash2 that acted as a secondary hash of the hash if the primary fails to be good enough. The secondary could take an additional "salt" parameter related to VM address which could then ignore the salt or not. I am not opposed to a VM addr hash ignored default. I mostly just did that out of an abundance of caution, really.

@c-blake
Copy link
Author

c-blake commented Mar 17, 2020

(In fact I comitted a comment similar to that idea 10 minutes before your comment: c-blake/adix@508e3b9 though I'm sure you hadn't seen that. I do appreciate the feedback, though. Thanks!)

c-blake added a commit to c-blake/adix that referenced this issue Mar 18, 2020
Can now either `import althash; proc hash(h,salt: Hash): Hash = ...` to
change how hashes are rehashed (when that is even necessary) or compile
with `-d:unstableHashHash` to do so with the built-in hash rehasher.
@Araq
Copy link
Member

Araq commented Apr 16, 2020

Since we now have a good hash function for integers, I consider this RFC settled. The current implementation is good enough, Robin Hood hashing can still be introduced, for example, but I think this RFC can now be closed.

@Araq Araq closed this as completed Apr 16, 2020
@c-blake
Copy link
Author

c-blake commented Apr 16, 2020

Ok. Seems fair.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants