RFC: Why #[deriving(IterBytes)], not #[deriving(Hash)]? #8038

bblum · 2013-07-25T08:12:29Z

I anticipate a lot of "GEE I WONDER WHY I DIDN'T THINK OF THAT" when newcomers come into the channel asking how to make their types hashable, since the hash trait has some weirdo type signature, and the answer is "Derive this other typeclass you've never heard of before."

Why can't we just make it be #[deriving(Hash)]? It also makes already-written code more legible.

graydon · 2013-07-25T08:16:10Z

Because deriving(Hash) won't be deriving ... a hash function. It'll be deriving a way-to-iterate-over-your-fields to feed to the hash function. The hash function is in a central location and very carefully chosen.

That said, we could probably rename it somehow. A lot of the ugliness in the way hash is implemented had to do with missing features in default methods and inheritence, which I think all work now.

nikomatsakis · 2013-07-25T11:20:48Z

As an aside, I still think IterBytes and Encodable are the same thing.

graydon · 2013-07-25T14:19:52Z

They're certainly related. Not sure if identical? I guess encoding is supposed to hit all the "meaningful" bytes and skip the "meaningless" ones. Maybe we could build a default method on the encode trait that funnels all the other encoding methods (including a prefix byte for each tycon) into a byte stream? Essentially a "default encoder" that delegates to an inner function with a byte slice?

Or, I guess, hash could provide that encoder itself. But I bet it's useful elsewhere.

nikomatsakis · 2013-07-25T17:12:40Z

To put it another way, it seems like iterbytes could be trivially implemented in terms of encodable, but encodable provides higher-level information. The only reason I see to separate them is that perhaps there are bits of data that one wants to encode but which do not affect the value when used as a hash; but I think this is subtle and a bit error-prone. I would personally be inclined to separate out the "hash key" value from the rest, in such cases. Also, this last bit presumes that IterBytes is only used for hashing, which is currently true afaik, but not obvious from its name. I guess I personally would rather just have one way of serializing types into bytes (the Encodable trait) and then use that for multiple purposes than to have many distinct serialization traits.

thestinger · 2013-07-25T17:25:36Z

By implementing IterBytes, you currently give up the ability to supply a custom Hash implementation. I don't think we have the machinery to implement a default method for types implementing a trait at the moment. It might make sense to move it into the hash module and change the naming to reflect it being the standard way of making a type hashable.

erickt · 2013-07-25T17:34:25Z

@nikomatsakis: It shouldn't be that hard to reimplement the IterBytes trait in terms of Encodable. I can think of one big issue with it though. Encodable doesn't guarantee the output is ordered in any fashion. It would make sense for HashMap to implement Encodable, but it probably wouldn't make sense to support generating a hash for that structure because `Encodable doesn't guarantee order in the output.

However, it would be nice for us to add a raw binary Encoder/Decoder. This would be handy for libraries like extra::flatpipes that can guarantee we are encoding/decoding the type so we don't need a binary format that tags each piece of data for type checking deserialization, like ebml.

nikomatsakis · 2013-07-25T19:06:13Z

@erickt I'm not sure what you mean by "doesn't guarantee ordering"? In my head, at least, most of the methods are in fact ordered (i.e., a tuple must try to encode each of its elements in order, a struct each of its fields, etc), but I guess the higher-order methods for emitting a map might not be. I was envisioning creating a "HashEncoder" that just hashed data in response to calls to the encoder methods. Still, now that we have deriving modes, it doesn't hurt to have them be separate interfaces, and I can imagine it might allow for looser or more efficient encoding around maps and other data structures that are not obviously ordered. I cared about this more when there was no deriving for iterbytes.

erickt · 2013-07-26T01:06:00Z

@nikomatsakis: Yeah, I'm talking about the higher ordered methods for emitting a map. For example:

let a = HashMap::new();
a.insert(1, 2);
a.insert(3, 4);
let a_bytes = io::with_bytes_writer(|wr| a.encode(&BytesEncoder::new(&wr)));

let b = HashMap::new();
b.insert(3, 4);
b.insert(1, 2);
let b_bytes = io::with_bytes_writer(|wr| b.encode(&BytesEncoder::new(&wr)));

That hypothetical BytesEncoder won't necessarily guarantee that the keys are outputted in the same order, so while a equals b, a_bytes may or may not equal b_bytes.

If we wanted to prevent someone from using a HashMap as a key, we add a simple Hashable trait:

trait Hashable {
    fn hash(&self) -> uint;
}

We could then either use #[deriving(Hashable)] that depends on an implementation of Encodable, or we could start using default methods to do the same thing. Assuming this would actually work, we could mash together Hashable and Encodable into one HashableEncodable trait with a default method like this:

trait HashableEncodable: Hashable + Encodable {
    fn hash(&self) -> uint {
         let hash_encoder = HashEncoder::new();
         self.encode(&hash_encoder);
         hash_encoder.hash
    }
}

Then to use it, we just write:

#[deriving(Encodable)]
struct Foo { ... }

impl HashableEncodable for Foo {}

This would save us from having to write another deriving syntax extension. HashableEncodable is pretty ugly though, so I'm a little torn.

huonw · 2013-07-26T02:19:19Z

(fwiw, writing new derivings is fairly easy.)

erickt · 2013-07-26T16:20:36Z

@huonw: It's true, but I worry that deriving is a crutch. Writing a new deriving extension isn't that hard, but it is a royal pain to do the snapshot dance if the underlying trait changes. I hope that someday we'll figure out some mechanism to lift deriving out of the compiler and into a library. I expect that would remove the need to do the snapshot dance, and it'd be much easier for third parties to add their own implementations.

huonw · 2013-07-27T02:21:33Z

They're normal syntax extensions, so #1762 should cover that (if/when that happens).

emberian · 2014-01-06T16:40:59Z

This is pretty messy. Should be fixed.

erickt · 2014-01-22T17:27:19Z

I started playing around with a fix for this last night. Since we only seem to use IterBytes with Hash, it makes sense to me to try to merge the two constructs. I came up with two approaches. The first is a bit heavyweight. It entails moving serialize.rs into libstd and creating:

struct HashEncoder<S> { state: S }

impl<S: Streaming> Encoder for HashEncoder<S> { ... }

trait HashEncodable<S: Streaming>: Encodable<HashEncoder<S>> { }

pub fn sip_hash<T: HashEncodable<hash::SipState>>(v: &T) -> u64 {
    let mut encoder = HashEncoder::new(hash::SipState::new(0, 0));
    v.encode(&mut encoder);
    encoder.state.result_u64()
}

All hashable traits need to implement HashEncodable in order to be hashed. This allows HashMap to be encodable but not hashable. @cmr mentioned in IRC that he was not convinced that serialize.rs should live in libstd.

The other approach I had was to have a much lighter weight StreamingHash trait as in:

pub trait StreamingHash<S> {
    fn hash_stream(&self, state: &mut S);
}

impl<S: Streaming> StreamingHash<S> for u8 {
    fn hash_stream(&self, state: &mut S) {
        state.input([*self])
    }
}

pub fn sip_hash<T: StreamingHash<hash::SipState>>(v: &T) -> u64 {
    let mut state = hash::SipState::new(0, 0);
    v.hash_stream(&mut state);
    state.result_u64()
}

This is a much smaller change, but we do end up duplicating a lot of work that serialize.rs does for us.

Here is a prototype implementation of both schemes. Performance is identical for an optimized build. For non-optimized, HashEncodable is a little slower than IterBytes, which is a little slower than StreamingHash.

I expect that this approach would also allow us to fix both #9075 and #5195.

huonw · 2014-01-23T13:57:40Z

The StreamingHash approach certainly seems lighter weight, and I think (on first thought) it would be more appropriate for #5195 (since the general Encodable instance won't try to handle things like the random order of hashmap elements, etc).

erickt · 2014-01-25T19:49:25Z

@huonw: I got an updated version of my hash.rs replacement here. It uses the same trick Encodable uses to break up the impl<T: IterBytes> Hash for T { ... } constraint by instead constraining the Hasher trait on the concrete type:

pub trait Hasher {
    fn result_u64(&mut self) -> u64;
}

pub trait StreamHasher: Hasher {
    fn input(&mut self, bytes: &[u8]);
}

pub trait Hashable<H: Hasher> {
    fn hash2(&self, h: &mut H);
}

impl<H: StreamHasher> Hashable<H> for i8 { ... }

I'm going to start working on a PR to switch over to this approach.

huonw · 2014-01-27T03:00:19Z

@erickt i16 etc. are also FixedHashable, right? So it would be good to be able to have Hashable<H> for u8 etc, because this may make a super-fast hashtable; however, I'm not sure of a good way to handle this.

(Also, some of your casts seem to have suffered from copy-paste. (That looks macro-able, fwiw).)

Alsoalso, @pcwalton was saying that the IterBytes closures only get inlined very rarely, but this scheme has static dispatch and so could feasibly be (much) faster. :)

nikomatsakis mentioned this issue Aug 14, 2013

hash collisions with tuples #5257

Closed

pnkfelix mentioned this issue Sep 9, 2013

cannot implement Hash except by IterBytes ? #9075

Closed

huonw mentioned this issue Jan 12, 2014

implement Hash on tuples and owned containers #5195

Closed

erickt mentioned this issue Jan 28, 2014

First phase of a Hash reimplementation #11863

Closed

3 tasks

bors closed this as completed in 22d3669 Feb 23, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Why #[deriving(IterBytes)], not #[deriving(Hash)]? #8038

RFC: Why #[deriving(IterBytes)], not #[deriving(Hash)]? #8038

bblum commented Jul 25, 2013

graydon commented Jul 25, 2013

nikomatsakis commented Jul 25, 2013

graydon commented Jul 25, 2013

nikomatsakis commented Jul 25, 2013

thestinger commented Jul 25, 2013

erickt commented Jul 25, 2013

nikomatsakis commented Jul 25, 2013

erickt commented Jul 26, 2013

huonw commented Jul 26, 2013

erickt commented Jul 26, 2013

huonw commented Jul 27, 2013

emberian commented Jan 6, 2014

erickt commented Jan 22, 2014

huonw commented Jan 23, 2014

erickt commented Jan 25, 2014

huonw commented Jan 27, 2014

RFC: Why #[deriving(IterBytes)], not #[deriving(Hash)]? #8038

RFC: Why #[deriving(IterBytes)], not #[deriving(Hash)]? #8038

Comments

bblum commented Jul 25, 2013

graydon commented Jul 25, 2013

nikomatsakis commented Jul 25, 2013

graydon commented Jul 25, 2013

nikomatsakis commented Jul 25, 2013

thestinger commented Jul 25, 2013

erickt commented Jul 25, 2013

nikomatsakis commented Jul 25, 2013

erickt commented Jul 26, 2013

huonw commented Jul 26, 2013

erickt commented Jul 26, 2013

huonw commented Jul 27, 2013

emberian commented Jan 6, 2014

erickt commented Jan 22, 2014

huonw commented Jan 23, 2014

erickt commented Jan 25, 2014

huonw commented Jan 27, 2014