Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.
/ corefx Public archive

HashCode based on xxHash32 #25013

Merged
merged 11 commits into from
Dec 13, 2017
Merged

HashCode based on xxHash32 #25013

merged 11 commits into from
Dec 13, 2017

Conversation

jcdickinson
Copy link

@jcdickinson jcdickinson commented Nov 2, 2017

Fixes #14354

Description

Works by maintaining the xxHash32 state variables (v1 -> v4, length) as well as a queue of values that fall outside of the block size (16 bytes/4 ints). Seed is assumed to be zero as there is no ctor in the proposal - however, it is possible to add a seed without changing the structure.

It's possible to do this without unsafe, but the code is much more messy. This can be changed if adding AllowUnsafeBlocks to System.Runtime is a problem.

Tests against known xxHash32 vectors are provided. The hashes were calculated by providing chunks from "abcd0123efgh4567ijkl8901mnop2345qrst6789uvwx0123yzab" (increasing in length 4) to this site. The integers are hard-coded in the tests because System.Text.Encoding.ASCII is not available.

Performance

BenchmarkDotNet=v0.10.9, OS=Windows 10 Redstone 2 (10.0.15063)
Processor=Intel Core i7-4800MQ CPU 2.70GHz (Haswell), ProcessorCount=8
Frequency=2630627 Hz, Resolution=380.1375 ns, Timer=TSC
.NET Core SDK=2.0.2
  [Host]     : .NET Core 2.0.0 (Framework 4.6.00001.0), 64bit RyuJIT
  DefaultJob : .NET Core 2.0.0 (Framework 4.6.00001.0), 64bit RyuJIT

Method Mean Error StdDev Scaled
Unsafe 79.98 ns 1.4723 ns 1.3772 ns 1.00
'Calls, No Inlining' 40.34 ns 0.1778 ns 0.1576 ns 0.50
'Unrolled, No Inlining' 12.19 ns 0.0433 ns 0.0384 ns 0.15
'Calls, Inlining' 34.55 ns 0.1372 ns 0.1216 ns 0.43
'Unrolled, Inlining' 12.22 ns 0.0885 ns 0.0785 ns 0.15

Updates (staged)

The test vectors are now normally disabled due to the per-AppDomain seed randomization, define SYSTEM_HASHCODE_TESTVECTORS to enable them (which will disable the seed randomization, setting it to 0).

NB: this has been excluded from netfx builds in the tests csproj.

Deviations from xxHash32

  • Length is stored as the number of hashed fields (not bytes). It is multiplied at the end by 4 so that it behaves exactly like xxHash32.
  • Due to the lazy state initialization (a necessity because HashCode is a struct), it is possible that the structure will re-initialize at field 4,294,967,300 (and multiples thereof) only if the xxHash32 state variables are somehow all zero. This could be deferred by changing the length field to a ulong (but would make the struct larger than a cache line). 4,294,967,300 fields ought to be enough for anyone; you have bigger problems if you are creating hash codes based on that much data (15GB).
  • The code that mixes in individual bytes is omitted - this struct only accepts int.

I'll be online in a few hours for feedback and to sign the CLA.

3rd Party Code

Submission containing materials of a third party:

Author Project License Comments
Cyan4973 xxHash MIT Reference, not copied code

cc: @stephentoub, @KrzysztofCwalina

Works by maintaining the xxHash32 state variables (v1 -> v4, length) as well as a queue of values that fall outside of the block size (16 bytes/4 ints). Seed is assumed to be zero as there is no seed ctor in the proposal - however, it is possible to add a seed without changing the structure.

Tests against known xxHash32 vectors are provided ("abcd1234efgh5678...").
@dnfclas
Copy link

dnfclas commented Nov 2, 2017

CLA assistant check
All CLA requirements met.

Copy link

@gimpf gimpf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for picking up on the PR for the HashCode type!

{
// Initialize

state[0] = Prime1 + Prime2;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid HashCode consumers to confuse the hash-code to be suitable for persistence, the initialization should make use of a random, per-app-domain, seed. For XX32, The seed is added to the first three state-vars (state[0] += seed, same for idx 1 and 2), and being subtracted from for the last (state[3] = seed - state[3]).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem. I still have the seed code in my polyfill.

This would break the test vector test (which the build server is having problems with anyway). I can't think of any other way to test the output of this function. Suggestions?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At my prototype project Haschisch.Kastriert I implemented a version where the seed could be specified explicitly. The tests would use a known seed, and the overloads for the default seed would use random numbers (from the security.cryptography randomnumbergenerator).


[MethodImpl(MethodImplOptions.AggressiveInlining)]
[TargetedPatchingOptOut("Performance critical for inlining across NGen images.")]
private uint Round(uint seed, uint input)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be static.

# pragma warning disable 0809
[Obsolete("Use ToHashCode to retrieve the computed hash code.", error: true)]
[EditorBrowsable(EditorBrowsableState.Never)]
public override int GetHashCode() => ToHashCode();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My (and also @tannergooding 's) interpretation of the proposed API was to throw new NotImplementedException(); here. Usually, and also for this implementation, ToHashCode() destroys state, i.e. calling .ToHashCode() twice will return different results. For .GetHashCode() this would be more than just unexpected, it would lead to wrong results for all the usual consumers (Dictionary etc.). Failing with InvalidOperationException might be even better.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to do this, just need a final say on which Exception and what message.

public static int Combine<T1>(T1 value1)
{
var hc = new HashCode();
hc.Add(value1);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've once tried combining hash-codes with generic code, but due to performance (factor 2 to 8) ended up manually unrolling all .Combine() overloads. I see that you've elegantly improved handling the state and queue, so this might improve the situation; but do you have some ballpark estimate how much this costs? (For comparison: this was a larger difference than choosing SipHash 1-3, which would be hash-dos resistant).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll benchmark post the results in a comment below.

Assert.NotEqual(hcs[i], hcs[j]);
}
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the issues' proposal, the idea was that building the hash-code using .Add() would result in the same hash-code as using a single .Combine() call; it would be nice to have a test-case for that.

@davidsh
Copy link
Contributor

davidsh commented Nov 2, 2017

cc: @morganbr

{
var val = (uint)value;

fixed (uint* state = _state)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a convention we called such fixed pointers pXxx (pState in this example). It makes it easier to see in following code where there is a danger of overruns.

fixed (uint* state = _state)
{
var queue = state + Queue;
var position = state[Length] & 0x3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is not really Length but rather LastIndex.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it seems like this should be LengthIdx, as the array is used as a packed struct (which would be nicer).

@KrzysztofCwalina
Copy link
Member

It's possible to do this without unsafe, but the code is much more messy.

How messy? Could you simply use 4 ulongs? It might actually be faster as the code would not have loops.

@jcdickinson
Copy link
Author

@KrzysztofCwalina much less messy compared to the first iteration before switching to unsafe. Unsafe is still cleaner, but not enough to offset this perf improvement:

Method Mean Error StdDev Scaled
Unsafe 54.64 ns 0.3632 ns 0.3220 ns 1.00
Safe 29.72 ns 0.1424 ns 0.1332 ns 0.54

@gimpf unrolling the loop makes an absurd difference:

Method Mean Error StdDev Scaled
Calls 30.1183 ns 0.3464 ns 0.3240 ns 1.00
Unrolled 0.0043 ns 0.0081 ns 0.0075 ns 0.00

I'll have these changes done by tonight.

Aside: the tests failures are because of [Theory] params. It looks like those tests will fall away due to seed randomization anyway.

@gimpf
Copy link

gimpf commented Nov 2, 2017

@jcdickinson Beware constant-folding. When doing benchmark with constants and unrolled implementations, sometimes the JIT works better than you'd expect. I had to switch to randomly initialized values to make that problem go away (and not static readonly, but initialized per benchmark run).

@jcdickinson
Copy link
Author

@gimpf thanks for the feedback. I updated the benchmark with:

private volatile int _counter;
private int GetCounter() => _counter++;

[Benchmark(Baseline = false)]
public void Unrolled()
{
    Safe.HashCode.Combine(
        GetCounter(),
        GetCounter(),
        GetCounter(),
        GetCounter(),
        GetCounter(),
        GetCounter(),
        GetCounter(),
        GetCounter());
}

It's still twice as fast:

Method Mean Error StdDev Scaled
Calls 29.59 ns 0.2935 ns 0.2451 ns 1.00
Unrolled 14.04 ns 0.1315 ns 0.1166 ns 0.47

It looks like it's still worth it. I was hoping to avoid having to handcraft all those unrolled loops 😛

}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
[TargetedPatchingOptOut("Performance critical for inlining across NGen images.")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkotas is "targetedpatchingoptout" still needed in corfx?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if so when do we apply it...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkotas is "targetedpatchingoptout" still needed in corfx?

No. These attributes can be deleted.

Order of members now matches the proposal, if that matters. Combine methods are unrolled for perf. Removed superfluous attributes.

Test vector tests are now optional (behind SYSTEM_HASHCODE_TESTVECTORS def) as the random seed will break these tests. Defining SYSTEM_HASHCODE_TESTVECTORS will disable the per-AppDomain random seed. Combine is now tested against the behavior of Add.
@jcdickinson
Copy link
Author

❌ NETFX x86 Release Build

System\HashCodeTests.cs(50,22): error CS0246: The type or namespace name 'HashCode' could not be found (are you missing a using directive or an assembly reference?)

What do I need to do to get the new type going in NETFX?

@@ -5,6 +5,7 @@
<ProjectGuid>{56B9D0A9-44D3-488E-8B42-C14A6E30CAB2}</ProjectGuid>
<AssemblyName>System.Runtime</AssemblyName>
<IsPartialFacadeAssembly>true</IsPartialFacadeAssembly>
<AllowUnsafeBlocks>true</AllowUnsafeBlocks>
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have realized that this is no longer needed, I'll put it into the next commit that solves the build problem and any additional feedback.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

@@ -3675,6 +3675,56 @@ public sealed partial class WeakReference<T> : System.Runtime.Serialization.ISer
public void SetTarget(T target) { }
public bool TryGetTarget(out T target) { throw null; }
}

[System.Runtime.InteropServices.StructLayoutAttribute(System.Runtime.InteropServices.LayoutKind.Auto)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is StructLayout specified here in the ref but not the implementation? Is it even necessary?

Copy link
Author

@jcdickinson jcdickinson Nov 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea. I had a look at the rest of the file and followed suit. Looking again it seems as though the vast majority include it.

Removed.

{
private uint _v1, _v2, _v3, _v4;
private uint _queue1, _queue2, _queue3;
private uint _length;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Private fields aren't included in the refs for other structs (e.g. DateTime), why are they needed here?

Copy link
Author

@jcdickinson jcdickinson Nov 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per the rest of the file, it seems as though either the fields are present or [StructLayout(Size))] is set. Maybe the length of these types is important?

Removed.

{
var hc1 = (uint)(value1?.GetHashCode() ?? 0);

var hash = MixEmptyState();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the corefx coding style:

  1. We only use var when it's obvious what the variable type is (i.e. var stream = new FileStream(...) not var stream = OpenStandardInput()).

In this case it looks like uint.

Applies throughout.

@morganbr
Copy link

morganbr commented Nov 3, 2017

@jcdickinson, don't worry about adding it to NetFX yet, we'll handle that after it gets into CoreFX. Just turn the test off for it by checking TargetGroup != netfx in the test csproj.

@jcdickinson
Copy link
Author

@morganbr thanks for the info. Must I commit that change?

@morganbr
Copy link

morganbr commented Nov 3, 2017

@jcdickinson, you'll need it sometime before we merge, but I'm working on more feedback right now so no need for a separate commit.

@jcdickinson
Copy link
Author

jcdickinson commented Nov 3, 2017

Can someone confirm that structs in src/System.Runtime/ref/System.Runtime.cs needs to be the correct length. Every struct currently in the file seems to point to that being the case (by means of specified fields or by means of StructLayout).

# if SYSTEM_HASHCODE_TESTVECTORS
private static readonly uint _seed = 0;
# else
private static readonly uint _seed = unchecked((uint)new Random().Next(int.MinValue, int.MaxValue));
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing to s_seed.

@jkotas
Copy link
Member

jkotas commented Nov 3, 2017

Can someone confirm that structs in src/System.Runtime/ref/System.Runtime.cs needs to be the correct length.

structs in this file do not need to be correct lenght. Delete the private fields and the StructLayoutAttribute

Copy link

@morganbr morganbr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great progress, @jcdickinson and I'm excited to see this coming together.

I've visually verified that the implementation matches the original xxHash and hopefully most of my other comments should be straightforward to address.

# if SYSTEM_HASHCODE_TESTVECTORS
private static readonly uint _seed = 0;
# else
private static readonly uint _seed = unchecked((uint)new Random().Next(int.MinValue, int.MaxValue));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use RandomNumberGenerator.Create().GetBytes() here for high-quality entropy. It's not critical for xxHash, but would be important with other algorithms.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

System.Security is not available in this assembly :).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bartonjs, can you please help with the incantations for getting proper entropy here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial thought: If seeded Marvin is enabled by default, string.Empty.GetHashCode() is entropic. But there's probably a way of getting at randomness from within System.Runtime if I let myself think for more than 20 seconds.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new Random().Next(int.MinValue, int.MaxValue) should be good enough source of entropy in .NET Core.

RandomNumberGenerator.Create().GetBytes()

At this level of the stack, you can use SystemNative_GetNonCryptographicallySecureRandomBytesand Windows equivalent if you have a good reason to - look forInterop.GetRandomBytesin CoreLib. TheRandom` global seed is initialized by this function among other things.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number of reads doesn't matter that much (or at all, I haven't checked the generated code).

And as always, it's best to measure before attempting optimizations. Try it with a fixed (const) seed and compare the result with the current version. If the difference is significant then we can consider alternatives. One possible solution is to avoid static constructors and instead use the good old fashioned if (s_seed == 0) s_seed = ...;

Intrinsic fields might still be an option but to use that we need to ensure that this code is never crossgened, that sounds kind of spooky.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkotas , I think it's somewhat important to use a cryptographic random number generator here for a couple reasons:

  1. We may change the algorithm in the future to account for DoS resistance, which would make a strong RNG critical
  2. Non-cryptographic RNG can have seeding issues such that multiple AppDomains/processes/machines end up with the same seed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having dependency from the core runtime parts on the full cryptographic stack would create ugly dependencies. I do think it is the right tradeoff between complexity and security. We had this discussion multiple times before, e.g. #16652.

We may change the algorithm in the future to account for DoS resistance

If we ever do this, we can re-evaluate what makes sense.

Non-cryptographic RNG can have seeding issues

If there are seeding issues in SystemNative_GetNonCryptographicallySecureRandomBytes, we should fix them.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no real difference between any approach in terms of performance:

Method Mean Error StdDev Scaled
Seed Read 1 12.12 ns 0.1618 ns 0.1514 ns 0.22
Seed Read 4 12.01 ns 0.1978 ns 0.1850 ns 0.21
Seed Lazy 11.91 ns 0.0165 ns 0.0154 ns 0.21
Seed Thread Lazy 13.85 ns 0.0334 ns 0.0261 ns 0.25

Is there another place where I can see how the secure RNG is used so that I can chamber a change containing it?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After further investigation with @bartonjs, CoreLib's Interop.GetRandomBytes is acceptable on most machines, so I can live with it here. @jcdickinson, I think the best way to get that would be to actually move this change to System.Private.CoreLib in the https://github.com/dotnet/coreclr/ repo after we sign off here (we'll just merge as soon as CI goes green assuming nothing meaningful changes in the move). Make sure to put it in the src/mscorlib/shared directory so that it automatically mirrors to CoreRT. That'll give you access to the API and also let CoreLib use HashCode.

@jkotas, for reference, the issues in GetNonCryptographicallySecureRandomBytes are:

  1. It can use urandom before it's seeded, meaning it may be predictable or even constant between containers/VMs/IoT devices where entropy trickles in slowly and a .NET service may run on boot.
  2. It silently fails over from random/urandom to srand(time(null)) if a machine doesn't support them or they're too slow. If this happens, the seed is both very predictable and could easily be the same across processes or machines. It also means cases like leap seconds and VM rollbacks will cause repeat behavior.

private uint _queue1, _queue2, _queue3;
private uint _length;

[MethodImpl(MethodImplOptions.AggressiveInlining)]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and everywhere else you've used AggressiveInlining, do you have data showing it's better? @jkotas, do we have guidance on when to use it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it should be supported by data that shows it makes difference.

Copy link

@gimpf gimpf Nov 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last when I checked (with my implementations in Haschisch) it made a difference. It could easily be that (at best) some are redundant (as the JIT would inline anyways), but any single function call that remains will cause performance to drop. Also, the typical .GetHashCode() caller usually will not be inlined anyways (to help against exploding code-size).

I have to admit that I didn't check in detail how much, if any, of the performance improvement remains in more complex code; but again, given that the caller usually will not be inlined, I'd be surprised if this annotation would be detrimental (although I have been suprised many times already, so what gives...)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

supported by data

Just remembered that there's an app for that.

Most of the methods run up against "Native estimate for function size exceeds threshold" (with exception to MixEmptyState). Writing to to arguments (all the mix methods) outright prevented inlining even with AggressiveInlining. Switch statements are a no-no. Curiously Rol doesn't get inlined, I thought it was supposed to be a JIT intrinsic?

The tool doesn't support generic methods at all. Everything else is inlined after the latest batch of changes:

Inlined methods

{
unchecked
{
var hc1 = (uint)(value1?.GetHashCode() ?? 0);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we actually defined what null should do in the proposal. I have slightly mixed feelings. Treating it as zero may be more likely to cause collisions since lots of GetHashCodeImplementations may also return zero (for example, that means Combine((Nullable<int>)0) is the same as Combine((Nullable<int>)null)). We could:

  1. Keep using zero.
  2. Throw ArgumentNullException.
  3. Pick a largish non-zero number that might be less likely to collide -- perhaps 0x6E756C6C ("null" in ascii).
  4. Use typeof(T1).GetHashCode() so that we're at least including some information in the hash code and so that Combine((string)null, 1) is different than Combine((object)null, 1). I'm not sure how this would perform though.

The downside to throwing ArgumentNullException is that it means Roslyn can't just generate GetHashCode overrides that look like return Combine(a, b, c, d);

I might lean toward 3 or 4. @jcdickinson, @terrajobst or other folks, what do you think?

Copy link
Author

@jcdickinson jcdickinson Nov 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oddly I had this discussion with my brother today. He basically argued what you said. However,

  • If your application differentiates between null and zero/unit, using zero in this case will result in a collision (which will swiftly fail an equals check). This results in marginally decreased performance.
  • If your application does not differentiate (e.g. empty list = null) this will result in the surprising behavior of the "same" value being bucketed separately. This results in surprising behavior.

I don't like surprising other developers, so I'm quite strongly in the 0 camp.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EqualityComparer.Default agrees with you. zero it is.


switch (position)
{
case 0:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The queueing could probably use a comment since xxHash's initialization is confusing

}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public int ToHashCode()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is supposed to check if it's already been finalized and throw since otherwise calling ToHashCode twice in a row would return different answers.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be possible to track that it's been finalized in some combination of the existing fields if you're concerned about cache lines. Something like _length=UInt.Max and _queue1=1 would be otherwise invalid (as long as we clear _queue1 in Add when we get a fourth input).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ToHashCode doesn't actually change any state - so it is safe to call multiple times. This is not due to any genius on my part, just the ghost in the machine.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest guarding it anyway since the "ghost in the machine" could easily change (and would if we switch algorithms on some platform).

hc.Add(1);
hc.Add(ConstComparer.ConstantValue);

Assert.NotEqual(expected.ToHashCode(), hc.ToHashCode());
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not clear on why these wouldn't be equal. Int32.GetHashCode just returns that int, so we should get 1234 as the second input for both cases. I think you might have a bug above where you're adding to hc in both cases instead of adding to expected.


# if SYSTEM_HASHCODE_TESTVECTORS
[Theory]
[InlineData(0x02cc5d05U)]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did you find these test vectors? How can verify them? Are there others we should include?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check the PR blurb for more info. I used this site with strings copied (increasing lengths of 4) from "abcd0123efgh4567ijkl8901mnop2345qrst6789uvwx0123yzab". My original polyfill test actually self-documents it, but Encoding is not available at this callsite - so I turned them all into uint literals.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see now. Please add a comment to the code so that we can reproduce this in the future.

{
var hc = new HashCode();
hc.Add(1);
hc.Add(100.2);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather avoid using floats. If this test starts failing on some platform due to floating point precision differences, it'll be awful to debug.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be OK in this case, as the CLR uses IEEE floats, the float is loaded from a literal, the spec says that conversion from R8 to internal float and back results in the same value, and GetHashCode() for floats should only use the bits available from the 32/64 bit of IEEE float, not any platform specific extended precision bits.

But, on the other hand, for testing, it would likely be more interesting to use a non-primitive value than a floating point number type.

@@ -0,0 +1,406 @@
// Licensed to the .NET Foundation under one or more agreements.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please follow the rules from the "Copying Files from Other Projects" section of https://github.com/dotnet/coreclr/blob/master/Documentation/project-docs/contributing.md. In particular, I think that means including the xxHash license here and adding it to the 3rd party notices file. @richlander, please confirm we don't need to do anything else to make this file follow the rules.


namespace System
{
public struct HashCode
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be good to comment here that we're using xxHash32 and link back to it for future reference.

break;
}

_length++;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed the earlier comment about this overflowing. Could you just add an overflow check here and throw rather than silently resetting the state?

@karelz
Copy link
Member

karelz commented Dec 5, 2017

A cursory look from someone on CoreFX team would be helpful - e.g. Numerics experts/owners @eerhardt @ViktorHofer

Copy link
Member

@eerhardt eerhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.


public static class HashCodeTests
{
# if SYSTEM_HASHCODE_TESTVECTORS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these tests are compiled out?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also - (nit) the normal code formatting is to have #if FOO with no spaces between the # and if. This will just cause the next person editing this code troubles since VS will attempt to format the code (with the default format settings).

Copy link
Author

@jcdickinson jcdickinson Dec 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eerhardt they are compiled out because the GetRandomBytes() randomization in HashCode removes the determinism. HashCode needs to be adjusted before they can execute correctly.

They are provided as a sort of semi-functional documentation. Any future patches against this type can use the tests to validate against actual vectors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds fine to me. Any chance you can stick that in a comment in the code?


Assert.Equal(expected, (uint)hc.ToHashCode());
}
# endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) Same comment as above - #endif

Copy link
Member

@eerhardt eerhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jcdickinson

Copy link
Member

@bartonjs bartonjs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing in this change seems wrong, but a few more tests to shore things up would be good.


Assert.Equal(expected, (uint)hc.ToHashCode());
}
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to see a #else which verifies that the seed is non-zero. E.g. that the empty hash is not 0x02cc5d05.

If the seed is allowed to be randomly zero then it will have the potential for failure; but a 32-bit seed means a 1 in 4 billion chance of failure

hc.Add(8);
Assert.Equal(hc.ToHashCode(), HashCode.Combine(1, 2, 3, 4, 5, 6, 7, 8));
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I recall correctly, adding any non-null object o and adding o.GetHashCode() (or the deferred-to-an-assistant equivalent) should be the same. If true, tests for that seem good. (If false... tests for that seem not unreasonable).

Copy link
Member

@bartonjs bartonjs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd personally be happier with tests that codify the behaviors of GetHashCode() and Equals(object) (either should be targetable if you box the HashCode object).

But code inspection says they just throw, and this covers the functional aspects of the API, so it's fine.

Other bonus tests: type-specific nulls (e.g. (string)null), showing that things are null-safe.

@jcdickinson
Copy link
Author

@dotnet-bot test OSX x64 Debug Build please

@jcdickinson
Copy link
Author

Alpine.3.6 x64 Debug Build

13:46:43 /mnt/j/workspace/dotnet_corefx/master/alpine-TGroup_netcoreapp+CGroup_Debug+AGroup_x64+TestOuter_false_prtest/external/runtime/runtime.depproj(73,5): error MSB3073: The command "chmod +x /mnt/j/workspace/dotnet_corefx/master/alpine-TGroup_netcoreapp+CGroup_Debug+AGroup_x64+TestOuter_false_prtest/bin/testhost/netcoreapp-Linux-Debug-x64/dotnet" exited with code 1.

@jcdickinson
Copy link
Author

@dotnet-bot test Alpine.3.6 x64 Debug Build please

@morganbr morganbr merged commit bd88e51 into dotnet:master Dec 13, 2017
@morganbr
Copy link

Thank you, @jcdickinson! This is a big improvement for hashing and you did a great job working through complex issues!

Now I can start filing issues to replace all of the other hashes 😄

@jcdickinson jcdickinson deleted the feature-system-hashcode branch December 13, 2017 21:57
@morganbr morganbr added netfx-port-consider and removed * NO MERGE * The PR is not ready for merge yet (see discussion for detailed reasons) labels Dec 13, 2017
dotnet-bot pushed a commit that referenced this pull request Jan 13, 2018
* HashCode based on xxHash32

Works by maintaining the xxHash32 state variables (v1 -> v4, length) as well as a queue of values that fall outside of the block size (16 bytes/4 ints). The seed is initialized to random bytes.

Further details, unit tests and history: #25013

Signed-off-by: dotnet-bot-corefx-mirror <dotnet-bot@microsoft.com>
dotnet-bot pushed a commit that referenced this pull request Jan 13, 2018
* HashCode based on xxHash32

Works by maintaining the xxHash32 state variables (v1 -> v4, length) as well as a queue of values that fall outside of the block size (16 bytes/4 ints). The seed is initialized to random bytes.

Further details, unit tests and history: #25013

Signed-off-by: dotnet-bot-corefx-mirror <dotnet-bot@microsoft.com>
@CyrusNajmabadi
Copy link
Member

Question: if i need to hash more than 8 values, which is the preferred pattern:

var hash = new HashCode();
hash.Add(value1); ... ; hash.Add(valueN);
return hash.ToHashCode();

or

return Hash.Combine(value1, ..., value7, Hash.Combine(value8, ..., value15, Hash.Combine(...

Thanks!

@jkotas
Copy link
Member

jkotas commented Jan 16, 2018

The first one is the right pattern to use:

var hash = new HashCode();
hash.Add(value1); ... ; hash.Add(valueN);
return hash.ToHashCode();

@CyrusNajmabadi
Copy link
Member

Great. That's what i'm using in roslyn's new codegen. Thanks!

safern pushed a commit that referenced this pull request Jan 16, 2018
* HashCode based on xxHash32

Works by maintaining the xxHash32 state variables (v1 -> v4, length) as well as a queue of values that fall outside of the block size (16 bytes/4 ints). The seed is initialized to random bytes.

Further details, unit tests and history: #25013

Signed-off-by: dotnet-bot-corefx-mirror <dotnet-bot@microsoft.com>
safern pushed a commit that referenced this pull request Jan 16, 2018
* HashCode based on xxHash32

Works by maintaining the xxHash32 state variables (v1 -> v4, length) as well as a queue of values that fall outside of the block size (16 bytes/4 ints). The seed is initialized to random bytes.

Further details, unit tests and history: #25013

Signed-off-by: dotnet-bot-corefx-mirror <dotnet-bot@microsoft.com>
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
* HashCode based on xxHash32 (dotnet/corefx#14354)

Tests against known xxHash32 vectors are provided ("abcd1234efgh5678...").

Test vector tests are now optional (behind SYSTEM_HASHCODE_TESTVECTORS def) as the random seed will break these tests. Defining SYSTEM_HASHCODE_TESTVECTORS will disable the per-AppDomain random seed. Combine is now tested against the behavior of Add.

Commit migrated from dotnet/corefx@bd88e51
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.