Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: hash: standardize the hash function #70471

Open
adonovan opened this issue Nov 20, 2024 · 20 comments
Open

proposal: hash: standardize the hash function #70471

adonovan opened this issue Nov 20, 2024 · 20 comments
Labels
Milestone

Comments

@adonovan
Copy link
Member

adonovan commented Nov 20, 2024

Background: Issue #69420 proposes a types.Hash function that provides a hash operator for types.Type values that is consistent with the equivalence relation for types defined by types.Identical. The only real question in that proposal is: what is to be Go's future convention for the signature of a custom hash function?

The naive answer is func(T) uint64, where T stands for types.Type. However, hash tables with only a single hash function are vulnerable to a "hash flooding" denial-of-service attack, in which an adversary provides many inputs that all hash to the same value, causing the table to degenerate into a linked list. The defense is for each process (or particular hash table) to select a different hash function at random so that the values cannot be predicted. The hash/maphash package embodies this approach: a random Seed value determines the hash functions (Hash.Strings and Hash.Bytes), which depend on the seed's opaque random value. (maphash does not provide a Seed-dependent way to hash integers.)

So, perhaps custom hash functions should accept a maphash.Seed parameter. And perhaps Seed should also provide methods for hashing integers.

Proposal: In the hash package, define function types HashFunc and EqFunc that document the conventions for Go hash tables.

Here is the minimal answer, without Seeds:

package hash

// HashFunc defines a hash function for values of type T.
// A conformant HashFunc and EqFunc pair (hash, eq) must observe the invariant
// that eq(x, y) => hash(x) == hash(y).
type HashFunc[T any] func(T) uint64

// EqFunc defines an equivalence relation for values of type T.
type EqFunc[T any] func(x, y T) bool

@ianlancetaylor @griesemer

@rsc
Copy link
Contributor

rsc commented Nov 20, 2024

(The minimal answer without Seeds is definitely the wrong one, as you explained before giving it.)

@ianlancetaylor
Copy link
Member

One approach is to define

type Hasher[T any] interface {
    Hash(T) uint64
    Equal(T, T) bool
}

A generic hash table with keys of type T will take a value of type Hasher[T], and invoke the methods of that value to compute hash values and to compare elements for equality.

Then we can have

// MakeHasher returns a Hasher[T] that invokes the hash and equal functions.
// Note that this does not use an explicit seed.
func MakeHasher[T any](hash(T) uint64, equal(T, T) bool) Hasher[T] { ... }

// MakeSeededHasher returns a Hasher[T] that uses a random seed.
func MakeSeededHasher[T any](hash(T, uint64), equal(T, T) bool) Hasher[T] {
    return &seededHasher[T]{hash: hash, equal: equal, seed: randUint64()}
}

type seededHasher[T any] struct {
    hash func(T, uint64)
    equal func(T, T) bool
    seed uint64
}

func (sh *seededHasher[T]) Hash(v T) uint64 {
    return sh.hash(v, sh.seed)
}

func (sh *seededHasher[T]) Equal(a, b T) bool {
    return sh.equal(a, b)
}

func (sh *seededHasher[T]) SetSeed(seed uint64) {
    sh.seed = seed
}

// MakeMapHashHasher returns a Hasher[T] that uses [maphash.Comparable].
func MakeMapHashHasher[T comparable](seed maphash.Seed) Hasher[T] {
    return mapHashHasher[T]{seed}
}

type mapHashHasher[T comparable] struct {
    seed maphash.Seed
}

func (mhh mapHashHasher[T]) Hash(v T) uint64 {
    return maphash.Comparable(mhh.seed, v)
}

func (mhh mapHashHasher[T]) Equal(a, b T) bool {
    return a == b
}

@adonovan
Copy link
Member Author

adonovan commented Nov 20, 2024

(The minimal answer without Seeds is definitely the wrong one, as you explained before giving it.)

Indeed, I was hoping to confirm Cunningham's law.

func (sh *seededHasher[T]) Hash(v T) uint64 {
return sh.hash(v, sh.seed)
}

This approach doesn't support a per-table seed.

That said, I still don't really understand how even a per-table seed provides a defense against flooding unless it is passed through the hash function itself. If the attacker knows the hash function and can provide all the keys in the table, then it's not enough to transform the hashes at the end, since they will all be transformed in the same way, whatever that is, preserving collisions. It seems to me that the seed would need to be threaded through the hash function so that whether two keys' hashes collide is impossible to predict from the sequence of elements that are incorporated into the hash.

@randall77
Copy link
Contributor

This approach doesn't support a per-table seed.

I don't think that's super important. A per-process seed is probably good enough.

That said, I still don't really understand how even a per-table seed provides a defense against flooding unless it is passed through the hash function itself.

Yes, it must be passed through the hash function itself. Ian's proposed code does that.

@ianlancetaylor
Copy link
Member

I believe that MakeSeededHasher does support a per-table seed, because you can pass a call to MakeSeededHasher to the hash table's Make function. Each hash table will then be using a different seed.

@ianlancetaylor
Copy link
Member

That is, I was thinking in terms of passing a Hasher[T] as a value when making a hash table. But there is another approach, which is to make Hasher[T] a type parameter of the hash table. That approach has the advantage that it permits inlining and the hash and equality methods. That approach still requires that the hash table include a value of type Hasher[T]. And we can arrange for that value to initialize itself with a seed when first called.

@aclements
Copy link
Member

TL;DR: Of the options I present below, option 3 feels like the best balance to me.

Note that a lot of this discussion is.. uh.. rehashing #69559.

I worry that @ianlancetaylor 's API proposed in #70471 (comment) (or one where a hash table is parameterized by the hasher type), makes it easier to write a non-seeded hash function than a seeded one. I'd much rather an API that makes it an obvious mistake to not seed the hash function.

Also, I think it's important to compare what impact different API options would have on a custom hash table type. Otherwise we're operating in a vacuum.

Setting aside for a moment the type of the seed, option 1 is an obvious extension of @adonovan 's API:

type HashFunc[T any] func(Seed, T) uint64

This leads to a hash table API like:

type EqFunc[T any] func(T, T) bool
type HashTable[T any] struct {
    seed Seed
    hash HashFunc[T]
    eq EqFunc[T]
    ...
}
func NewHashTable[T any](hash HashFunc[T], eq EqFunc[T]) *HashTable

@mateusz834 proposed this here.

Option 1 requires the hash table to store three words for its hasher, and calls to the hasher and equal are indirect.

Option 2 is to do this same transformation, but to @ianlancetaylor 's Hasher interface:

type Hasher[T any] interface {
    Hash(Seed, T) uint64
    Equal(T, T) bool
}

// Example hash table
type HashTable[T any, Hash Hasher[T]] struct {
    hash Hash  // Probably zero-sized
    seed Seed
    ...
}

Option 2 most likely stores just the seed in the hash table, which is minimal, and calls to the hasher and equal functions will be direct. However, the type implementing Hasher will almost always be an empty struct, so we're in the land of pure type metaprogramming here, which isn't wrong but sure feels weird. This definitely suffers from the lack of type-type inference.

Option 3 is to make the Hasher store the seed:

type Hasher[T any] interface {
    Hash(T) uint64
    Equal(T, T) bool
    Reset(Seed)
}

// Example hash table
type HashTable[T any, Hash Hasher[T]] struct {
    hash Hash // Probably just struct { Seed }
    ...
}
func NewHashTable[T any, Hash Hasher[T]]() *HashTable[T, Hash] {
    ht := &HashTable[T, Hash]{}
    seed := MakeSeed()
    ht.hash.Reset(seed)
    ...
}

This is similar to a proposal by @Merovius.

Option 3, like option 2, stores just the seed in the hash table, and calls to the hasher are direct, but suffers from the lack of type-type inference. However, it avoids pure metaprogramming. It feels slightly weird that the type implementing Hasher is mutable, but that's not too unusual.

Option 4 avoids mutability by introducing a hash constructor function:

type Hasher[T any] interface {
    Hash(T) uint64
    Equal(T, T) bool
}

type NewHasher[T any, Hash Hasher[T]] func(Seed) Hash

// Example hash table
type HashTable[T any, Hash Hasher[T]] struct{
    h Hash
    ...
}
func NewHashTable[T any, Hash Hasher[T]](newHasher NewHasher[T, Hash]) *HashTable[T, Hash] {
    seed := MakeSeed()
    h := newHasher(seed)
    return &HashTable[T, Hash]{h: h, ...}
}

This is pretty similar to option 3, but it feels like it's getting heavy on the mechanism.

And perhaps Seed should also provide methods for hashing integers.

As for the Seed type, this is an interesting idea, though in practice I think a lot of hash functions would either want to ultimately pass the Seed on to maphash (so it's just opaque) or will want to get transform the seed into a uint64 or some bytes or something they can use directly. You could do that with the interface by passing, e.g., 0, 1, etc, but that feels not very obvious.

Alternatively, the Seed could provide a few different ways to get a concrete seed value. E.g., it could have a Uint64 method to get it as a uint64, and a Bytes method to populate an arbitrarily large []byte. These methods would be idempotent: pure functions of the Seed's hidden value. And there would be no way to construct a Seed from any of these representations.

If we want to support hash functions that derive something from Seed and use that instead of using Seed opaquely, I think we need to have something like Reset from option 3 or NewHasher from option 4 so it has a chance to derive what it needs to derive.

@Merovius
Copy link
Contributor

Merovius commented Nov 21, 2024

I don't want to touch maphash.Seed. I believe maphash.Seed is fine as it is. If we want to enable a hash function to work with different seed types, I think the straight forward way to do that would be to make Seed a type parameter on the concrete implementation. Not to change maphash.Seed to be transformable into a bunch of different types. Then maphash.Seed can be one possible type argument, uint64 can be another, and if a hash function needs more bits, it can use []byte or struct { Lo, Hi uint64 } or whatever.

I think the advantage of @ianlancetaylor's design is that it makes the type of seed opaque to the hash table. I think that is a good thing. It gives maximum flexibility, while staying simple in usage. If we want to make it maximally simple to use safely, we could do something like

type Hasher[T any] interface{
    Hash(T) uint64
    Equal(T, T) bool
}

// exported type, to optionally avoid indirection.
type MapHashHasher[T comparable] struct {
    Seed maphash.Seed
}

func (h *MapHashHasher[T]) Reset() { h.Seed = maphash.MakeSeed() }
func (h MapHashHasher[T]) Hash(v T) uint64 { return maphash.Comparable(h.Seed, v) }
func (h MapHashHasher[T]) Equal(a, b T) bool { return a == b }
func MakeMapHashHasher[T comparable]() MapHashHasher { return MapHashHasher[T]{maphash.MakeSeed()} }

type FuncHasher[T, Seed any] struct {
    hash(Seed, T) uint64
    equal(T, T) bool
    seed Seed
}
func MakeFuncHasher[T, Seed any](hash func(Seed, T) uint64, equal func(T, T) bool, seed Seed) FuncHasher {
    return FuncHasher[T, Seed]{hash, equal, seed}
}
func (h FuncHasher) Hash(v T) uint64 { return h.hash(h.seed, v) }
func (h FuncHasher) Equal(a, b T) bool { return h.equal(a, b) }

To be fair, this still makes it easy to construct a badly seeded hasher for composite types. There are two ways to deal with that:

  1. Not put Seed into MakeFuncHasher at all and instead just make it a fixed uint64. That way, we can choose it at random and leave it up to third parties to provide implementations with other seed types. After all, the fact that the seed type is opaque is exactly the point.
  2. Require that Seed has a Reset(*rand.Rand) method. But… blegh.

With this, a hashtable API can decide itself whether it wants to parameterize over Hasher[T] (avoiding an indirection) or just take a Hasher[T] value in the constructor (avoiding extra type parameters).

I think ultimately, the Hasher interface not knowing about Seed is the maximally general interface. The type of the seed is, ultimately, up to the implementation. So it shouldn't appear in the interface. And I think it's easy to see, that we can add whatever set of concrete implementations we want. Or perhaps don't provide any - in the beginning - putting that up to the community. Or put the base interface into the standard library (so we can use it in our APIs) but then put a bunch of concrete, experimental implementations into x/exp and see which of them get adopted to ultimately get promoted into the stdlib.

@rsc
Copy link
Contributor

rsc commented Dec 4, 2024

Option 4 seems like the right answer for a hash table with custom equality: it needs Hash(T) and Equal(T, T), and the underlying Hash should absolutely be seeded, but you can just instantiate each hash table with a pre-seeded Hasher.

It seems like the interface belongs more in the hash-table package than it does in package hash, which is not really about hash tables. It would be more like sort.Interface defining the interface it needs.

So maybe we have the first part of the hash table package: the interface it will require for processing type T.

@rsc
Copy link
Contributor

rsc commented Dec 4, 2024

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

@rsc rsc moved this from Incoming to Active in Proposals Dec 4, 2024
@aclements
Copy link
Member

@Merovius , I worry that entirely taking any seed out of Hasher like in your proposed

type Hasher[T any] interface{
    Hash(T) uint64
    Equal(T, T) bool
}

makes it far too tempting for people to write bad hash functions that either aren't seeded at all, or don't use a per-instance hash. I'd much rather make it hard for people to ignore seeding.

Not to change maphash.Seed to be transformable into a bunch of different types.

I had another thought on this. What if we forced hashers to take a maphash.Seed, but simply didn't provide any new methods to transform it into other things? That would force most hashers to ultimately dispatch to our high-quality hash implementations in maphash. My sense is that what people really want here is not to define wildly different hash functions, but just to customize a bit which fields and how deep in a structure equality and hashing go.

@aclements
Copy link
Member

It seems like the interface belongs more in the hash-table package than it does in package hash, which is not really about hash tables.

I'm not sure I understand this argument. Hashes are useful for lots of things, including other implementations hash tables and data types that are totally unlike hash tables. To me, this seems philosophically more aligned with maphash (give me a hash and I'll do something with it) than a concrete container/unordered type (here's how you provide a hash for this particular implementation of this particular data type).

@adonovan
Copy link
Member Author

That would force most hashers to ultimately dispatch to our high-quality hash implementations in maphash.

maphash.Hash doesn't support integers and floats, only strings. Perhaps it should support all the elementary types, both for efficiency, and to avoid mistakes coercing a number-shaped peg into a string-shaped hole.

@mateusz834
Copy link
Member

@adonovan #54670 ?

// WriteComparable adds x to the data hashed by h. 
func WriteComparable[T comparable](h *Hash, x T)

@adonovan
Copy link
Member Author

Thanks, I had forgotten about that.

@Merovius
Copy link
Contributor

@aclements

I had another thought on this. What if we forced hashers to take a maphash.Seed, but simply didn't provide any new methods to transform it into other things?

If anything, it should probably take a *maphash.Hash then. Especially if we are talking about the possibility of needing to hash recursively.

@prattmic
Copy link
Member

prattmic commented Dec 13, 2024

@Merovius , I worry that entirely taking any seed out of Hasher like in your proposed

type Hasher[T any] interface{
    Hash(T) uint64
    Equal(T, T) bool
}

makes it far too tempting for people to write bad hash functions that either aren't seeded at all, or don't use a per-instance hash. I'd much rather make it hard for people to ignore seeding.

To add to this, in the absence of clear documentation, I think it is a bit unclear where this interface should be implemented. i.e., I think some people would mistakenly implement this directly on the type you want to hash:

type Foo struct { ... }
func (f Foo) Hash(f2 Foo) {}
func (f Foo) Equal(f1, f2 Foo) {}

Hopefully the extra argument (why does Hash have an argument and receiver?) would clue people in. Still, if someone did this, then they would almost certainly not have a per-instance seed as that doesn't really make sense in this context at all.

@aclements
Copy link
Member

If anything, it should probably take a *maphash.Hash then. Especially if we are talking about the possibility of needing to hash recursively.

That's a great point about recursive hashing. To reiterate what I believe @Merovius is getting at: hashes are often built up recursively, mirroring the type hierarchy. If custom hashers take a maphash.Seed, the custom hasher is going to have to create a maphash.Hash to use that seed. That alone isn't a problem, but if that hasher has to call another custom hasher, the only thing it can pass is the same Seed that was passed to it. The child hasher will have to create another maphash.Hash, with the same seed, and ultimately return the Hash.Uint64() value from it, which the parent hasher will then have to mix in to its Hash. That's really wordy, so here's a code example:

type T1 struct { a string; b T2 }
type T2 struct { c string }

type H1 struct {} // T1's hasher
func (H1) Hash(seed maphash.Seed, val T1) uint64 {
    var mh maphash.Hash // Turn seed into a maphash
    mh.SetSeed(seed)
    mh.WriteString(val.a)
    // We pass the same seed because we have no other choice,
    // and then fold the uint64 into our maphash.
    maphash.WriteComparable(&mh, H2{}.Hash(seed, val.b))
    return mh.Uint64()
}

type H2 struct {} // T2's hasher
func (H2) Hash(seed maphash.Seed, val T2) uint64 {
    var mh maphash.Hash // Turn seed into a maphash
    mh.SetSeed(seed)
    mh.WriteString(val.c)
    return mh.Uint64() // Turn maphash into a uint64
}

It would be better to avoid all of this back and forth and just pass a *maphash.Hash, which can easily be passed down to child hashers.

@aclements option 2b

Modifying my "option 2" from above to take a maphash.Hash looks like:

type Hasher[T any] interface {
    Hash(*maphash.Hash, T)
    Equal(T, T) bool
}

// Example hash table
type HashTable[T any, Hash Hasher[T]] struct {
    hash Hash  // Probably zero-sized
    seed maphash.Seed
    ...
}

With this, the equivalent of my above code example looks like:

type T1 struct { a string; b T2 }
type T2 struct { c string }

type H1 struct {} // T1's hasher
func (H1) Hash(mh *maphash.Hash, val T1) {
    mh.WriteString(val.a)
    H2{}.Hash(mh, val.b)
}

type H2 struct {} // T2's hasher
func (H2) Hash(mh *maphash.Hash, val T2) {
    mh.WriteString(val.c)
}

This seems much nicer to me!

Unfortunately, today, this would force the maphash.Hash passed to Hasher.Hash to escape because we don't stencil the methods of HashTable based on the concrete type of Hash. Ideally, each get/put would declare a local maphash.Hash to pass into the Hash method, but today this would mean an allocation on every get/put. Alternatively, HashTable itself could store a maphash.Hash instead of Seed, but that's 152 bytes (compared to 8 for Seed). Or you use a sync.Pool, which is also ugly. This a all a form of #48849.

I don't believe this is a fundamental limitation. I spent a while banging out potential solutions with @cherrymui, and we came up with a few options around either including escape information in the GC shape used for stenciling, or doing a runtime form of "conditional escape" by putting escape information into the generic dictionary. None of these solutions are easy, but they don't seem super hard, and fixing this would lift a lot of boats.

I am loathe to warp basic interfaces like this to the current limitations on our tools, especially when we see a path to fixing those limitations. Though I also recognize that it's frustrating to block one thing on another.

@aclements option 3b

I don't think passing the *maphash.Hash makes sense with the other approaches I proposed above.

type Hasher[T any] interface {
    Hash(T) uint64
    Equal(T, T) bool
    Reset(*maphash.Hash)
}

Here, Reset would have to store the maphash.Hash in the Hasher, which would probably cause it to escaped at a much deeper level. It also means a Hasher cannot possibly be thread-safe.

@aclements option 4b

type Hasher[T any] interface {
    Hash(T) uint64
    Equal(T, T) bool
}

type NewHasher[T any, Hash Hasher[T]] func(*maphash.Hash) Hash

func NewHashTable[T any, Hash Hasher[T]](newHasher NewHasher[T, Hash]) *HashTable[T, Hash] {
    var mh maphash.Hash
    mh.SetSeed(maphash.MakeSeed())
    h := newHasher(&mh)
    return &HashTable[T, Hash]{h: h, ...}
}

There are some variations on this, but no matter what the 152 byte maphash.Hash has to live as long as the HashTable. And like option 3, a Hasher could not possibly be thread-safe.

@Merovius parameterized Seed option

I think @Merovius 's parameterized Seed runs into a lot of similar problems. There, it's up to the Hasher to store either a maphash.Seed or a maphash.Hash or something else entirely. But because the seed state is in the Hasher itself, if you have a hierarchy of hashers, the parent Hasher is going to have to store the Hashers for all of its children, recursively. That makes the same of a Hasher implementation proportional to the total size of the tree of types under it.

If you implement a hash table on this, all of this seed state has to persist as long as the hash table itself. Even if the Hasher stores just a maphash.Seed, each call to a Hash method, including recursively, still has to create its own maphash.Hash from that.

I'm not positive, but I think @Merovius 's solution does avoid the escape problem. If a hash table type is parameterized over the Hasher type, then we will stencil that type definition. I think we might still consider the method calls to escape their receiver, but that would just force a one-time heap allocation of the whole hash table object, not a per-get/put heap allocation.

@adonovan
Copy link
Member Author

I like option 2b; I don't mind waiting for the compiler to catch up before we can do it. This approach will elevate the rather obscure *maphash.Hash type to a role of great prominence across our APIs, but perhaps that's fine. Will we need to add Write methods to it for all the basic types?

I agree that 3b (Reset) introduces state where it needn't exist.

How immovable an obstacle is the 152-byte size of the maphash? Is a buffer smaller than 128 bytes much less efficient? Or is the point that rthash in two chunks delivers a different result from calling it in one chunk?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Active
Development

No branches or pull requests

10 participants