HashCode based on xxHash32 #25013

jcdickinson · 2017-11-02T09:16:22Z

Fixes #14354

Description

Works by maintaining the xxHash32 state variables (v1 -> v4, length) as well as a queue of values that fall outside of the block size (16 bytes/4 ints). Seed is assumed to be zero as there is no ctor in the proposal - however, it is possible to add a seed without changing the structure.

It's possible to do this without unsafe, but the code is much more messy. This can be changed if adding AllowUnsafeBlocks to System.Runtime is a problem.

Tests against known xxHash32 vectors are provided. The hashes were calculated by providing chunks from "abcd0123efgh4567ijkl8901mnop2345qrst6789uvwx0123yzab" (increasing in length 4) to this site. The integers are hard-coded in the tests because System.Text.Encoding.ASCII is not available.

Performance

BenchmarkDotNet=v0.10.9, OS=Windows 10 Redstone 2 (10.0.15063)
Processor=Intel Core i7-4800MQ CPU 2.70GHz (Haswell), ProcessorCount=8
Frequency=2630627 Hz, Resolution=380.1375 ns, Timer=TSC
.NET Core SDK=2.0.2
  [Host]     : .NET Core 2.0.0 (Framework 4.6.00001.0), 64bit RyuJIT
  DefaultJob : .NET Core 2.0.0 (Framework 4.6.00001.0), 64bit RyuJIT

Method	Mean	Error	StdDev	Scaled
Unsafe	79.98 ns	1.4723 ns	1.3772 ns	1.00
'Calls, No Inlining'	40.34 ns	0.1778 ns	0.1576 ns	0.50
'Unrolled, No Inlining'	12.19 ns	0.0433 ns	0.0384 ns	0.15
'Calls, Inlining'	34.55 ns	0.1372 ns	0.1216 ns	0.43
'Unrolled, Inlining'	12.22 ns	0.0885 ns	0.0785 ns	0.15

Updates (staged)

The test vectors are now normally disabled due to the per-AppDomain seed randomization, define SYSTEM_HASHCODE_TESTVECTORS to enable them (which will disable the seed randomization, setting it to 0).

NB: this has been excluded from netfx builds in the tests csproj.

Deviations from xxHash32

Length is stored as the number of hashed fields (not bytes). It is multiplied at the end by 4 so that it behaves exactly like xxHash32.
Due to the lazy state initialization (a necessity because HashCode is a struct), it is possible that the structure will re-initialize at field 4,294,967,300 (and multiples thereof) only if the xxHash32 state variables are somehow all zero. This could be deferred by changing the length field to a ulong (but would make the struct larger than a cache line). 4,294,967,300 fields ought to be enough for anyone; you have bigger problems if you are creating hash codes based on that much data (15GB).
The code that mixes in individual bytes is omitted - this struct only accepts int.

I'll be online in a few hours for feedback and to sign the CLA.

3rd Party Code

Submission containing materials of a third party:

Author	Project	License	Comments
Cyan4973	xxHash	MIT	Reference, not copied code

cc: @stephentoub, @KrzysztofCwalina

Works by maintaining the xxHash32 state variables (v1 -> v4, length) as well as a queue of values that fall outside of the block size (16 bytes/4 ints). Seed is assumed to be zero as there is no seed ctor in the proposal - however, it is possible to add a seed without changing the structure. Tests against known xxHash32 vectors are provided ("abcd1234efgh5678...").

dnfclas · 2017-11-02T09:16:31Z

All CLA requirements met.

gimpf

Thanks for picking up on the PR for the HashCode type!

gimpf · 2017-11-02T10:21:07Z

src/System.Runtime/src/System/HashCode.cs

+                        {
+                            // Initialize
+
+                            state[0] = Prime1 + Prime2;


To avoid HashCode consumers to confuse the hash-code to be suitable for persistence, the initialization should make use of a random, per-app-domain, seed. For XX32, The seed is added to the first three state-vars (state[0] += seed, same for idx 1 and 2), and being subtracted from for the last (state[3] = seed - state[3]).

No problem. I still have the seed code in my polyfill.

This would break the test vector test (which the build server is having problems with anyway). I can't think of any other way to test the output of this function. Suggestions?

At my prototype project Haschisch.Kastriert I implemented a version where the seed could be specified explicitly. The tests would use a known seed, and the overloads for the default seed would use random numbers (from the security.cryptography randomnumbergenerator).

gimpf · 2017-11-02T10:23:49Z

src/System.Runtime/src/System/HashCode.cs

+
+        [MethodImpl(MethodImplOptions.AggressiveInlining)]
+        [TargetedPatchingOptOut("Performance critical for inlining across NGen images.")]
+        private uint Round(uint seed, uint input)


Can be static.

gimpf · 2017-11-02T10:28:21Z

src/System.Runtime/src/System/HashCode.cs

+#       pragma warning disable 0809
+        [Obsolete("Use ToHashCode to retrieve the computed hash code.", error: true)]
+        [EditorBrowsable(EditorBrowsableState.Never)]
+        public override int GetHashCode() => ToHashCode();


My (and also @tannergooding 's) interpretation of the proposed API was to throw new NotImplementedException(); here. Usually, and also for this implementation, ToHashCode() destroys state, i.e. calling .ToHashCode() twice will return different results. For .GetHashCode() this would be more than just unexpected, it would lead to wrong results for all the usual consumers (Dictionary etc.). Failing with InvalidOperationException might be even better.

Happy to do this, just need a final say on which Exception and what message.

gimpf · 2017-11-02T10:43:09Z

src/System.Runtime/src/System/HashCode.cs

+        public static int Combine<T1>(T1 value1)
+        {
+            var hc = new HashCode();
+            hc.Add(value1);


I've once tried combining hash-codes with generic code, but due to performance (factor 2 to 8) ended up manually unrolling all .Combine() overloads. I see that you've elegantly improved handling the state and queue, so this might improve the situation; but do you have some ballpark estimate how much this costs? (For comparison: this was a larger difference than choosing SipHash 1-3, which would be hash-dos resistant).

I'll benchmark post the results in a comment below.

gimpf · 2017-11-02T10:46:32Z

src/System.Runtime/tests/System/HashCodeTests.cs

+                Assert.NotEqual(hcs[i], hcs[j]);
+            }
+    }
+


In the issues' proposal, the idea was that building the hash-code using .Add() would result in the same hash-code as using a single .Combine() call; it would be nice to have a test-case for that.

davidsh · 2017-11-02T13:52:53Z

cc: @morganbr

KrzysztofCwalina · 2017-11-02T15:21:27Z

src/System.Runtime/src/System/HashCode.cs

+            {
+                var val = (uint)value;
+
+                fixed (uint* state = _state)


As a convention we called such fixed pointers pXxx (pState in this example). It makes it easier to see in following code where there is a danger of overruns.

KrzysztofCwalina · 2017-11-02T15:22:37Z

src/System.Runtime/src/System/HashCode.cs

+                fixed (uint* state = _state)
+                {
+                    var queue = state + Queue;
+                    var position = state[Length] & 0x3;


So this is not really Length but rather LastIndex.

Actually, it seems like this should be LengthIdx, as the array is used as a packed struct (which would be nicer).

KrzysztofCwalina · 2017-11-02T15:25:44Z

It's possible to do this without unsafe, but the code is much more messy.

How messy? Could you simply use 4 ulongs? It might actually be faster as the code would not have loops.

jcdickinson · 2017-11-02T18:53:53Z

@KrzysztofCwalina much less messy compared to the first iteration before switching to unsafe. Unsafe is still cleaner, but not enough to offset this perf improvement:

Method	Mean	Error	StdDev	Scaled
Unsafe	54.64 ns	0.3632 ns	0.3220 ns	1.00
Safe	29.72 ns	0.1424 ns	0.1332 ns	0.54

@gimpf unrolling the loop makes an absurd difference:

Method	Mean	Error	StdDev	Scaled
Calls	30.1183 ns	0.3464 ns	0.3240 ns	1.00
Unrolled	0.0043 ns	0.0081 ns	0.0075 ns	0.00

I'll have these changes done by tonight.

Aside: the tests failures are because of [Theory] params. It looks like those tests will fall away due to seed randomization anyway.

gimpf · 2017-11-02T20:00:15Z

@jcdickinson Beware constant-folding. When doing benchmark with constants and unrolled implementations, sometimes the JIT works better than you'd expect. I had to switch to randomly initialized values to make that problem go away (and not static readonly, but initialized per benchmark run).

jcdickinson · 2017-11-02T20:25:56Z

@gimpf thanks for the feedback. I updated the benchmark with:

private volatile int _counter;
private int GetCounter() => _counter++;

[Benchmark(Baseline = false)]
public void Unrolled()
{
    Safe.HashCode.Combine(
        GetCounter(),
        GetCounter(),
        GetCounter(),
        GetCounter(),
        GetCounter(),
        GetCounter(),
        GetCounter(),
        GetCounter());
}

It's still twice as fast:

Method	Mean	Error	StdDev	Scaled
Calls	29.59 ns	0.2935 ns	0.2451 ns	1.00
Unrolled	14.04 ns	0.1315 ns	0.1166 ns	0.47

It looks like it's still worth it. I was hoping to avoid having to handcraft all those unrolled loops 😛

danmoseley · 2017-11-02T22:01:49Z

src/System.Runtime/src/System/HashCode.cs

+        }
+
+        [MethodImpl(MethodImplOptions.AggressiveInlining)]
+        [TargetedPatchingOptOut("Performance critical for inlining across NGen images.")]


@jkotas is "targetedpatchingoptout" still needed in corfx?

if so when do we apply it...

@jkotas is "targetedpatchingoptout" still needed in corfx?

No. These attributes can be deleted.

Order of members now matches the proposal, if that matters. Combine methods are unrolled for perf. Removed superfluous attributes. Test vector tests are now optional (behind SYSTEM_HASHCODE_TESTVECTORS def) as the random seed will break these tests. Defining SYSTEM_HASHCODE_TESTVECTORS will disable the per-AppDomain random seed. Combine is now tested against the behavior of Add.

jcdickinson · 2017-11-03T01:30:35Z

❌ NETFX x86 Release Build

System\HashCodeTests.cs(50,22): error CS0246: The type or namespace name 'HashCode' could not be found (are you missing a using directive or an assembly reference?)

What do I need to do to get the new type going in NETFX?

jcdickinson · 2017-11-03T01:49:08Z

src/System.Runtime/src/System.Runtime.csproj

@@ -5,6 +5,7 @@
    <ProjectGuid>{56B9D0A9-44D3-488E-8B42-C14A6E30CAB2}</ProjectGuid>
    <AssemblyName>System.Runtime</AssemblyName>
    <IsPartialFacadeAssembly>true</IsPartialFacadeAssembly>
+    <AllowUnsafeBlocks>true</AllowUnsafeBlocks>


I have realized that this is no longer needed, I'll put it into the next commit that solves the build problem and any additional feedback.

justinvp · 2017-11-03T01:54:24Z

src/System.Runtime/ref/System.Runtime.cs

@@ -3675,6 +3675,56 @@ public sealed partial class WeakReference<T> : System.Runtime.Serialization.ISer
        public void SetTarget(T target) { }
        public bool TryGetTarget(out T target) { throw null; }
    }
+
+    [System.Runtime.InteropServices.StructLayoutAttribute(System.Runtime.InteropServices.LayoutKind.Auto)]


Why is StructLayout specified here in the ref but not the implementation? Is it even necessary?

~~I have no idea. I had a look at the rest of the file and followed suit. Looking again it seems as though the vast majority include it.~~

Removed.

justinvp · 2017-11-03T01:54:27Z

src/System.Runtime/ref/System.Runtime.cs

+    {
+        private uint _v1, _v2, _v3, _v4;
+        private uint _queue1, _queue2, _queue3;
+        private uint _length;


Private fields aren't included in the refs for other structs (e.g. DateTime), why are they needed here?

~~As per the rest of the file, it seems as though either the fields are present or [StructLayout(Size))] is set. Maybe the length of these types is important?~~

Removed.

justinvp · 2017-11-03T01:58:57Z

src/System.Runtime/src/System/HashCode.cs

+            {
+                var hc1 = (uint)(value1?.GetHashCode() ?? 0);
+
+                var hash = MixEmptyState();


According to the corefx coding style:

We only use var when it's obvious what the variable type is (i.e. var stream = new FileStream(...) not var stream = OpenStandardInput()).

In this case it looks like uint.

Applies throughout.

morganbr · 2017-11-03T01:59:37Z

@jcdickinson, don't worry about adding it to NetFX yet, we'll handle that after it gets into CoreFX. Just turn the test off for it by checking TargetGroup != netfx in the test csproj.

jcdickinson · 2017-11-03T02:00:54Z

@morganbr thanks for the info. Must I commit that change?

morganbr · 2017-11-03T02:01:55Z

@jcdickinson, you'll need it sometime before we merge, but I'm working on more feedback right now so no need for a separate commit.

jcdickinson · 2017-11-03T02:13:18Z

Can someone confirm that structs in src/System.Runtime/ref/System.Runtime.cs needs to be the correct length. Every struct currently in the file seems to point to that being the case (by means of specified fields or by means of StructLayout).

jcdickinson · 2017-11-03T02:18:43Z

src/System.Runtime/src/System/HashCode.cs

+#   if SYSTEM_HASHCODE_TESTVECTORS
+        private static readonly uint _seed = 0;
+#   else
+        private static readonly uint _seed = unchecked((uint)new Random().Next(int.MinValue, int.MaxValue));


Changing to s_seed.

jkotas · 2017-11-03T02:25:02Z

Can someone confirm that structs in src/System.Runtime/ref/System.Runtime.cs needs to be the correct length.

structs in this file do not need to be correct lenght. Delete the private fields and the StructLayoutAttribute

morganbr

This is great progress, @jcdickinson and I'm excited to see this coming together.

I've visually verified that the implementation matches the original xxHash and hopefully most of my other comments should be straightforward to address.

morganbr · 2017-11-03T01:14:26Z

src/System.Runtime/src/System/HashCode.cs

+#   if SYSTEM_HASHCODE_TESTVECTORS
+        private static readonly uint _seed = 0;
+#   else
+        private static readonly uint _seed = unchecked((uint)new Random().Next(int.MinValue, int.MaxValue));


Please use RandomNumberGenerator.Create().GetBytes() here for high-quality entropy. It's not critical for xxHash, but would be important with other algorithms.

System.Security is not available in this assembly :).

@bartonjs, can you please help with the incantations for getting proper entropy here?

Initial thought: If seeded Marvin is enabled by default, string.Empty.GetHashCode() is entropic. But there's probably a way of getting at randomness from within System.Runtime if I let myself think for more than 20 seconds.

new Random().Next(int.MinValue, int.MaxValue) should be good enough source of entropy in .NET Core.

RandomNumberGenerator.Create().GetBytes()

At this level of the stack, you can use SystemNative_GetNonCryptographicallySecureRandomBytesand Windows equivalent if you have a good reason to - look forInterop.GetRandomBytesin CoreLib. TheRandom` global seed is initialized by this function among other things.

The number of reads doesn't matter that much (or at all, I haven't checked the generated code).

And as always, it's best to measure before attempting optimizations. Try it with a fixed (const) seed and compare the result with the current version. If the difference is significant then we can consider alternatives. One possible solution is to avoid static constructors and instead use the good old fashioned if (s_seed == 0) s_seed = ...;

Intrinsic fields might still be an option but to use that we need to ensure that this code is never crossgened, that sounds kind of spooky.

@jkotas , I think it's somewhat important to use a cryptographic random number generator here for a couple reasons:

We may change the algorithm in the future to account for DoS resistance, which would make a strong RNG critical

Non-cryptographic RNG can have seeding issues such that multiple AppDomains/processes/machines end up with the same seed.

Having dependency from the core runtime parts on the full cryptographic stack would create ugly dependencies. I do think it is the right tradeoff between complexity and security. We had this discussion multiple times before, e.g. #16652.

We may change the algorithm in the future to account for DoS resistance

If we ever do this, we can re-evaluate what makes sense.

Non-cryptographic RNG can have seeding issues

If there are seeding issues in SystemNative_GetNonCryptographicallySecureRandomBytes, we should fix them.

There is no real difference between any approach in terms of performance:

Method Mean Error StdDev Scaled

Seed Read 1 12.12 ns 0.1618 ns 0.1514 ns 0.22

Seed Read 4 12.01 ns 0.1978 ns 0.1850 ns 0.21

Seed Lazy 11.91 ns 0.0165 ns 0.0154 ns 0.21

Seed Thread Lazy 13.85 ns 0.0334 ns 0.0261 ns 0.25

Is there another place where I can see how the secure RNG is used so that I can chamber a change containing it?

After further investigation with @bartonjs, CoreLib's Interop.GetRandomBytes is acceptable on most machines, so I can live with it here. @jcdickinson, I think the best way to get that would be to actually move this change to System.Private.CoreLib in the https://github.com/dotnet/coreclr/ repo after we sign off here (we'll just merge as soon as CI goes green assuming nothing meaningful changes in the move). Make sure to put it in the src/mscorlib/shared directory so that it automatically mirrors to CoreRT. That'll give you access to the API and also let CoreLib use HashCode.

@jkotas, for reference, the issues in GetNonCryptographicallySecureRandomBytes are:

It can use urandom before it's seeded, meaning it may be predictable or even constant between containers/VMs/IoT devices where entropy trickles in slowly and a .NET service may run on boot.

It silently fails over from random/urandom to srand(time(null)) if a machine doesn't support them or they're too slow. If this happens, the seed is both very predictable and could easily be the same across processes or machines. It also means cases like leap seconds and VM rollbacks will cause repeat behavior.

morganbr · 2017-11-03T01:17:25Z

src/System.Runtime/src/System/HashCode.cs

+        private uint _queue1, _queue2, _queue3;
+        private uint _length;
+
+        [MethodImpl(MethodImplOptions.AggressiveInlining)]


Here and everywhere else you've used AggressiveInlining, do you have data showing it's better? @jkotas, do we have guidance on when to use it?

Right, it should be supported by data that shows it makes difference.

Last when I checked (with my implementations in Haschisch) it made a difference. It could easily be that (at best) some are redundant (as the JIT would inline anyways), but any single function call that remains will cause performance to drop. Also, the typical .GetHashCode() caller usually will not be inlined anyways (to help against exploding code-size).

I have to admit that I didn't check in detail how much, if any, of the performance improvement remains in more complex code; but again, given that the caller usually will not be inlined, I'd be surprised if this annotation would be detrimental (although I have been suprised many times already, so what gives...)

supported by data

Just remembered that there's an app for that.

Most of the methods run up against "Native estimate for function size exceeds threshold" (with exception to MixEmptyState). Writing to to arguments (all the mix methods) outright prevented inlining even with AggressiveInlining. Switch statements are a no-no. Curiously Rol doesn't get inlined, I thought it was supposed to be a JIT intrinsic?

The tool doesn't support generic methods at all. Everything else is inlined after the latest batch of changes:

morganbr · 2017-11-03T01:36:02Z

src/System.Runtime/src/System/HashCode.cs

+        {
+            unchecked
+            {
+                var hc1 = (uint)(value1?.GetHashCode() ?? 0);


I'm not sure we actually defined what null should do in the proposal. I have slightly mixed feelings. Treating it as zero may be more likely to cause collisions since lots of GetHashCodeImplementations may also return zero (for example, that means Combine((Nullable<int>)0) is the same as Combine((Nullable<int>)null)). We could:

Keep using zero.

Throw ArgumentNullException.

Pick a largish non-zero number that might be less likely to collide -- perhaps 0x6E756C6C ("null" in ascii).

Use typeof(T1).GetHashCode() so that we're at least including some information in the hash code and so that Combine((string)null, 1) is different than Combine((object)null, 1). I'm not sure how this would perform though.

The downside to throwing ArgumentNullException is that it means Roslyn can't just generate GetHashCode overrides that look like return Combine(a, b, c, d);

I might lean toward 3 or 4. @jcdickinson, @terrajobst or other folks, what do you think?

Oddly I had this discussion with my brother today. He basically argued what you said. However,

If your application differentiates between null and zero/unit, using zero in this case will result in a collision (which will swiftly fail an equals check). This results in marginally decreased performance.

If your application does not differentiate (e.g. empty list = null) this will result in the surprising behavior of the "same" value being bucketed separately. This results in surprising behavior.

I don't like surprising other developers, so I'm quite strongly in the 0 camp.

EqualityComparer.Default agrees with you. zero it is.

morganbr · 2017-11-03T01:43:36Z

src/System.Runtime/src/System/HashCode.cs

+
+                switch (position)
+                {
+                    case 0:


The queueing could probably use a comment since xxHash's initialization is confusing

morganbr · 2017-11-03T01:52:17Z

src/System.Runtime/src/System/HashCode.cs

+        }
+
+        [MethodImpl(MethodImplOptions.AggressiveInlining)]
+        public int ToHashCode()


I believe this is supposed to check if it's already been finalized and throw since otherwise calling ToHashCode twice in a row would return different answers.

It might be possible to track that it's been finalized in some combination of the existing fields if you're concerned about cache lines. Something like _length=UInt.Max and _queue1=1 would be otherwise invalid (as long as we clear _queue1 in Add when we get a fourth input).

ToHashCode doesn't actually change any state - so it is safe to call multiple times. This is not due to any genius on my part, just the ghost in the machine.

I'd suggest guarding it anyway since the "ghost in the machine" could easily change (and would if we switch algorithms on some platform).

morganbr · 2017-11-03T02:15:57Z

src/System.Runtime/tests/System/HashCodeTests.cs

+        hc.Add(1);
+        hc.Add(ConstComparer.ConstantValue);
+
+        Assert.NotEqual(expected.ToHashCode(), hc.ToHashCode());


I'm not clear on why these wouldn't be equal. Int32.GetHashCode just returns that int, so we should get 1234 as the second input for both cases. I think you might have a bug above where you're adding to hc in both cases instead of adding to expected.

morganbr · 2017-11-03T02:28:36Z

src/System.Runtime/tests/System/HashCodeTests.cs

+
+#   if SYSTEM_HASHCODE_TESTVECTORS
+    [Theory]
+    [InlineData(0x02cc5d05U)]


Where did you find these test vectors? How can verify them? Are there others we should include?

Check the PR blurb for more info. I used this site with strings copied (increasing lengths of 4) from "abcd0123efgh4567ijkl8901mnop2345qrst6789uvwx0123yzab". My original polyfill test actually self-documents it, but Encoding is not available at this callsite - so I turned them all into uint literals.

Ah, I see now. Please add a comment to the code so that we can reproduce this in the future.

morganbr · 2017-11-03T02:29:17Z

src/System.Runtime/tests/System/HashCodeTests.cs

+    {
+        var hc = new HashCode();
+        hc.Add(1);
+        hc.Add(100.2);


I'd rather avoid using floats. If this test starts failing on some platform due to floating point precision differences, it'll be awful to debug.

This should be OK in this case, as the CLR uses IEEE floats, the float is loaded from a literal, the spec says that conversion from R8 to internal float and back results in the same value, and GetHashCode() for floats should only use the bits available from the 32/64 bit of IEEE float, not any platform specific extended precision bits.

But, on the other hand, for testing, it would likely be more interesting to use a non-primitive value than a floating point number type.

morganbr · 2017-11-03T02:34:18Z

src/System.Runtime/src/System/HashCode.cs

@@ -0,0 +1,406 @@
+// Licensed to the .NET Foundation under one or more agreements.


Please follow the rules from the "Copying Files from Other Projects" section of https://github.com/dotnet/coreclr/blob/master/Documentation/project-docs/contributing.md. In particular, I think that means including the xxHash license here and adding it to the 3rd party notices file. @richlander, please confirm we don't need to do anything else to make this file follow the rules.

morganbr · 2017-11-03T02:35:57Z

src/System.Runtime/src/System/HashCode.cs

+
+namespace System
+{
+    public struct HashCode


It might be good to comment here that we're using xxHash32 and link back to it for future reference.

morganbr · 2017-11-03T02:45:58Z

src/System.Runtime/src/System/HashCode.cs

+                        break;
+                }
+
+                _length++;


I missed the earlier comment about this overflowing. Could you just add an overflow check here and throw rather than silently resetting the state?

karelz · 2017-12-05T20:15:45Z

A cursory look from someone on CoreFX team would be helpful - e.g. Numerics experts/owners @eerhardt @ViktorHofer

eerhardt

Looks good to me.

eerhardt · 2017-12-05T20:36:53Z

src/System.Runtime/tests/System/HashCodeTests.netcoreapp.cs

+
+public static class HashCodeTests
+{
+#   if SYSTEM_HASHCODE_TESTVECTORS


Why are these tests are compiled out?

Also - (nit) the normal code formatting is to have #if FOO with no spaces between the # and if. This will just cause the next person editing this code troubles since VS will attempt to format the code (with the default format settings).

@eerhardt they are compiled out because the GetRandomBytes() randomization in HashCode removes the determinism. HashCode needs to be adjusted before they can execute correctly.

They are provided as a sort of semi-functional documentation. Any future patches against this type can use the tests to validate against actual vectors.

That sounds fine to me. Any chance you can stick that in a comment in the code?

eerhardt · 2017-12-05T20:38:48Z

src/System.Runtime/tests/System/HashCodeTests.netcoreapp.cs

+
+        Assert.Equal(expected, (uint)hc.ToHashCode());
+    }
+#   endif


(nit) Same comment as above - #endif

eerhardt

Thanks @jcdickinson

bartonjs

Nothing in this change seems wrong, but a few more tests to shore things up would be good.

bartonjs · 2017-12-06T15:38:52Z

src/System.Runtime/tests/System/HashCodeTests.netcoreapp.cs

+
+        Assert.Equal(expected, (uint)hc.ToHashCode());
+    }
+#endif


It would be great to see a #else which verifies that the seed is non-zero. E.g. that the empty hash is not 0x02cc5d05.

If the seed is allowed to be randomly zero then it will have the potential for failure; but a 32-bit seed means a 1 in 4 billion chance of failure

bartonjs · 2017-12-06T15:40:51Z

src/System.Runtime/tests/System/HashCodeTests.netcoreapp.cs

+        hc.Add(8);
+        Assert.Equal(hc.ToHashCode(), HashCode.Combine(1, 2, 3, 4, 5, 6, 7, 8));
+    }
+


If I recall correctly, adding any non-null object o and adding o.GetHashCode() (or the deferred-to-an-assistant equivalent) should be the same. If true, tests for that seem good. (If false... tests for that seem not unreasonable).

bartonjs

I'd personally be happier with tests that codify the behaviors of GetHashCode() and Equals(object) (either should be targetable if you box the HashCode object).

But code inspection says they just throw, and this covers the functional aspects of the API, so it's fine.

Other bonus tests: type-specific nulls (e.g. (string)null), showing that things are null-safe.

jcdickinson · 2017-12-07T21:44:17Z

@dotnet-bot test OSX x64 Debug Build please

jcdickinson · 2017-12-07T21:49:31Z

Alpine.3.6 x64 Debug Build

13:46:43 /mnt/j/workspace/dotnet_corefx/master/alpine-TGroup_netcoreapp+CGroup_Debug+AGroup_x64+TestOuter_false_prtest/external/runtime/runtime.depproj(73,5): error MSB3073: The command "chmod +x /mnt/j/workspace/dotnet_corefx/master/alpine-TGroup_netcoreapp+CGroup_Debug+AGroup_x64+TestOuter_false_prtest/bin/testhost/netcoreapp-Linux-Debug-x64/dotnet" exited with code 1.

jcdickinson · 2017-12-13T19:47:20Z

@dotnet-bot test Alpine.3.6 x64 Debug Build please

morganbr · 2017-12-13T20:26:43Z

Thank you, @jcdickinson! This is a big improvement for hashing and you did a great job working through complex issues!

Now I can start filing issues to replace all of the other hashes 😄

* HashCode based on xxHash32 Works by maintaining the xxHash32 state variables (v1 -> v4, length) as well as a queue of values that fall outside of the block size (16 bytes/4 ints). The seed is initialized to random bytes. Further details, unit tests and history: #25013 Signed-off-by: dotnet-bot-corefx-mirror <dotnet-bot@microsoft.com>

CyrusNajmabadi · 2018-01-15T19:43:16Z

Question: if i need to hash more than 8 values, which is the preferred pattern:

var hash = new HashCode();
hash.Add(value1); ... ; hash.Add(valueN);
return hash.ToHashCode();

or

return Hash.Combine(value1, ..., value7, Hash.Combine(value8, ..., value15, Hash.Combine(...

Thanks!

jkotas · 2018-01-16T03:41:51Z

The first one is the right pattern to use:

var hash = new HashCode();
hash.Add(value1); ... ; hash.Add(valueN);
return hash.ToHashCode();

CyrusNajmabadi · 2018-01-16T04:11:53Z

Great. That's what i'm using in roslyn's new codegen. Thanks!

* HashCode based on xxHash32 Works by maintaining the xxHash32 state variables (v1 -> v4, length) as well as a queue of values that fall outside of the block size (16 bytes/4 ints). The seed is initialized to random bytes. Further details, unit tests and history: #25013 Signed-off-by: dotnet-bot-corefx-mirror <dotnet-bot@microsoft.com>

* HashCode based on xxHash32 (dotnet/corefx#14354) Tests against known xxHash32 vectors are provided ("abcd1234efgh5678..."). Test vector tests are now optional (behind SYSTEM_HASHCODE_TESTVECTORS def) as the random seed will break these tests. Defining SYSTEM_HASHCODE_TESTVECTORS will disable the per-AppDomain random seed. Combine is now tested against the behavior of Add. Commit migrated from dotnet/corefx@bd88e51

gimpf reviewed Nov 2, 2017

View reviewed changes

KrzysztofCwalina reviewed Nov 2, 2017

View reviewed changes

karelz added the area-System.Numerics label Nov 2, 2017

karelz assigned jcdickinson, morganbr, ViktorHofer and eerhardt Nov 2, 2017

danmoseley reviewed Nov 2, 2017

View reviewed changes

jcdickinson commented Nov 3, 2017

View reviewed changes

justinvp reviewed Nov 3, 2017

View reviewed changes

jcdickinson commented Nov 3, 2017

View reviewed changes

morganbr suggested changes Nov 3, 2017

View reviewed changes

morganbr reviewed Nov 3, 2017

View reviewed changes

eerhardt approved these changes Dec 5, 2017

View reviewed changes

Jonathan Dickinson added 2 commits December 5, 2017 13:43

PR Feedback

20913b7

Adding documentation above HashCode_Add.

a9a6d7f

eerhardt approved these changes Dec 5, 2017

View reviewed changes

bartonjs reviewed Dec 6, 2017

View reviewed changes

Additional unit tests

638b55b

bartonjs approved these changes Dec 6, 2017

View reviewed changes

Jonathan Dickinson added 2 commits December 6, 2017 17:58

Additional tests

4b0ca46

Fix compilation issue.

07aa81d

morganbr merged commit bd88e51 into dotnet:master Dec 13, 2017

jcdickinson deleted the feature-system-hashcode branch December 13, 2017 21:57

morganbr added netfx-port-consider and removed * NO MERGE * The PR is not ready for merge yet (see discussion for detailed reasons) labels Dec 13, 2017

morganbr mentioned this pull request Dec 13, 2017

Generate Equals/GetHashCode should use framework helpers if available dotnet/roslyn#17646

Closed

MaStr11 mentioned this pull request Jan 10, 2018

Visual Studio Code-Fix for generating GetHashCode results in OverflowException in checked assembly dotnet/roslyn#24035

Closed

iSazonov mentioned this pull request Jun 25, 2018

Change Utils.CombineHashCodes() to use .NET Core's HashCode.Combine() method PowerShell/PowerShell#7134

Closed

bartonjs mentioned this pull request Sep 29, 2020

Fix seed initialization in Microsoft.Bcl.Hashcode dotnet/runtime#42808

Closed

Method	Mean	Error	StdDev	Scaled
Seed Read 1	12.12 ns	0.1618 ns	0.1514 ns	0.22
Seed Read 4	12.01 ns	0.1978 ns	0.1850 ns	0.21
Seed Lazy	11.91 ns	0.0165 ns	0.0154 ns	0.21
Seed Thread Lazy	13.85 ns	0.0334 ns	0.0261 ns	0.25

		@@ -0,0 +1,406 @@
		// Licensed to the .NET Foundation under one or more agreements.

HashCode based on xxHash32 #25013

HashCode based on xxHash32 #25013

Conversation

jcdickinson commented Nov 2, 2017 • edited Loading

Description

Updates (staged)

Deviations from xxHash32

3rd Party Code

dnfclas commented Nov 2, 2017 • edited Loading

gimpf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidsh commented Nov 2, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KrzysztofCwalina commented Nov 2, 2017

jcdickinson commented Nov 2, 2017

gimpf commented Nov 2, 2017

jcdickinson commented Nov 2, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jcdickinson commented Nov 3, 2017

❌ NETFX x86 Release Build

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jcdickinson Nov 3, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jcdickinson Nov 3, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

morganbr commented Nov 3, 2017

jcdickinson commented Nov 3, 2017

morganbr commented Nov 3, 2017

jcdickinson commented Nov 3, 2017 • edited Loading

Choose a reason for hiding this comment

jkotas commented Nov 3, 2017

morganbr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gimpf Nov 3, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jcdickinson Nov 3, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jcdickinson commented Nov 2, 2017 •

edited

Loading

dnfclas commented Nov 2, 2017 •

edited

Loading

jcdickinson Nov 3, 2017 •

edited

Loading

jcdickinson Nov 3, 2017 •

edited

Loading

jcdickinson commented Nov 3, 2017 •

edited

Loading

gimpf Nov 3, 2017 •

edited

Loading

jcdickinson Nov 3, 2017 •

edited

Loading

karelz commented Dec 5, 2017 •

edited

Loading

jcdickinson Dec 5, 2017 •

edited

Loading