Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Marvin32 to System.IO.Hashing #68616

Closed
Tracked by #64488
bartonjs opened this issue Apr 27, 2022 · 12 comments
Closed
Tracked by #64488

Add Marvin32 to System.IO.Hashing #68616

bartonjs opened this issue Apr 27, 2022 · 12 comments
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.IO.Hashing
Milestone

Comments

@bartonjs
Copy link
Member

bartonjs commented Apr 27, 2022

During the initial standup of System.IO.Hashing there were requests for Marvin32, and a couple more have trickled in from other channels, so we may as well go ahead and add it.

Marvin32 is a 32-bit output hash that uses a 64-bit seed. As such, it generally looks like the XxHash32 type except the data type for seed has been changed from int to long. One major deviation from the shape of the XxHash32 type is that for Marvin32 the seed is a mandatory parameter.

namespace System.IO.Hashing
{
    public sealed partial class Marvin32 : System.IO.Hashing.NonCryptographicHashAlgorithm
    {
        public Marvin32(long seed) : base (sizeof(int)) { }
        public static byte[] Hash(byte[] source, long seed);
        public static byte[] Hash(System.ReadOnlySpan<byte> source, long seed);
        public static int Hash(System.ReadOnlySpan<byte> source, System.Span<byte> destination, long seed);
        public static bool TryHash(System.ReadOnlySpan<byte> source, System.Span<byte> destination, long seed, out int bytesWritten);

        public override void Append(System.ReadOnlySpan<byte> source);
        protected override void GetCurrentHashCore(System.Span<byte> destination);
        public override void Reset();
    }
}
@bartonjs bartonjs added this to the 7.0.0 milestone Apr 27, 2022
@ghost
Copy link

ghost commented Apr 27, 2022

Tagging subscribers to this area: @dotnet/area-system-security, @vcsjones
See info in area-owners.md if you want to be subscribed.

Issue Details

During the initial standup of System.IO.Hashing there were requests for Marvin32, and a couple more have trickled in from other channels, so we may as well go ahead and add it.

Marvin32 is a 32-bit output hash that uses a 64-bit seed. As such, it looks like the XxHash32 type except the data type for seed has been changed from int to long.

The default seed for the Marvin32 algorithm is unchecked((long)0xD53CD9CECD0893B7), as specified in the specification (the little-endian interpretation of the 8 most significant bytes of the SHA-256 value of the UTF-8 encoding of the string "Marvin32"). The proposal includes exposing that as a public const, if nothing else to indicate where the magic value comes from in the defaulted parameters. (Note: While our string.GetHashCode implementation also has a "default seed", that seed is randomly generated once per process)

namespace System.IO.Hashing
{
    public sealed partial class Marvin32 : System.IO.Hashing.NonCryptographicHashAlgorithm
    {
        public const long DefaultSeed = unchecked((long)0xD53CD9CECD0893B7);

        public Marvin32() : base (sizeof(int)) { }
        public Marvin32(long seed) : base (sizeof(int)) { }
        public static byte[] Hash(byte[] source);
        public static byte[] Hash(byte[] source, long seed);
        public static byte[] Hash(System.ReadOnlySpan<byte> source, long seed = DefaultSeed);
        public static int Hash(System.ReadOnlySpan<byte> source, System.Span<byte> destination, long seed = DefaultSeed);
        public static bool TryHash(System.ReadOnlySpan<byte> source, System.Span<byte> destination, out int bytesWritten, long seed = DefaultSeed);

        public override void Append(System.ReadOnlySpan<byte> source);
        protected override void GetCurrentHashCore(System.Span<byte> destination);
        public override void Reset();
    }
}
Author: bartonjs
Assignees: -
Labels:

area-System.Security

Milestone: 7.0.0

@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Apr 27, 2022
@teo-tsirpanis teo-tsirpanis added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label Apr 27, 2022
@bartonjs bartonjs removed the untriaged New issue has not been triaged by the area owner label Apr 27, 2022
@krwq
Copy link
Member

krwq commented Apr 28, 2022

would it make sense to expose also .NET randomized seed (same as used for i.e. string.GetHashCode())?

@bartonjs
Copy link
Member Author

I think we'd need a very good reason as to why we'd expose it. The risk of exposing it ends up being that it gets accidentally disclosed via a service and makes it easier for a bad actor to compute string hash collisions.

Not knowing a good reason to expose it I think the balance says not to. (For comparison, we also didn't expose the random seed used for XxHash32 in System.HashCode)

@bartonjs bartonjs added api-ready-for-review API is ready for review, it is NOT ready for implementation and removed api-suggestion Early API idea and discussion, it is NOT ready for implementation labels May 4, 2022
@GrabYourPitchforks
Copy link
Member

I'd propose removing the default seed const. If somebody really cares, they can look it up from the spec. And it removes the temptation for people to say "I'm good because I'm using Marvin!" while still passing in a globally known seed, which would defeat the point of using the algorithm in the first place.

@bartonjs
Copy link
Member Author

bartonjs commented May 4, 2022

Presumably you think we should also get rid of the default ctor and public static byte[] Hash(byte[] source);, then?

I was debating that when drafting the proposal, but since a default seed was specified I went with "keep an existing shape". And the default seed is better than 0, since a 0 seed is bad for the algorithm.

So

namespace System.IO.Hashing
{
    public sealed partial class Marvin32 : System.IO.Hashing.NonCryptographicHashAlgorithm
    {
        public Marvin32(long seed) : base (sizeof(int)) { }
        public static byte[] Hash(byte[] source, long seed);
        public static byte[] Hash(System.ReadOnlySpan<byte> source, long seed);
        public static int Hash(System.ReadOnlySpan<byte> source, System.Span<byte> destination, long seed);
        public static bool TryHash(System.ReadOnlySpan<byte> source, System.Span<byte> destination, long seed, out int bytesWritten);

        public override void Append(System.ReadOnlySpan<byte> source);
        protected override void GetCurrentHashCore(System.Span<byte> destination);
        public override void Reset();
    }
}

@GSPP
Copy link

GSPP commented May 15, 2022

What is the use case for anyone using this hash function? My understanding is that, since Marvin32 was invented, 20 years of fast hash research have occurred. Modern fast hashes should just be better.

@bartonjs
Copy link
Member Author

My understanding is that, since Marvin32 was invented, 20 years of fast hash research have occurred.

I believe Marvin was invented in 2012, or thereabouts; which might make it newer than the XxHash hashes we've already added.

What is the use case for anyone using this hash function?

The concrete requests I've seen seem to stem from "we're already using a copy of the code" and/or "interop".

@GSPP
Copy link

GSPP commented May 21, 2022

In my estimation, this hash algorithm is likely to be legacy tech 5-10 years from now (perhaps even right now). I don't believe it should be eternalized in the framework to be supported forever.

Creative alternative: Refactor the existing code to be easy to copy out. Those users who require that will then have a decent workaround. This strategy avoids the support burden.

@GrabYourPitchforks
Copy link
Member

Something to consider is that the entire System.IO.Hashing package ships out-of-band. It's not part of the inbox SDK. To me, this makes it no more "eternal" than any other OOB package which has shipped over the years, and those packages don't enjoy a "supported forever" policy. (Case in point: all of the OOB System.Web.* releases over the years, most of which are no longer supported.)

@Genbox
Copy link

Genbox commented Jul 10, 2022

A bit of info for those not well versed in non-cryptographic hash functions that might take an interest in this conversation.

Marvin hash was patented back in 2012. It is described as a fast, seeded, and secure non-cryptographic hash function. It is currently used as the default implementation in GetHashCode() on string. It uses a mixer function based on Xor-Shift-Add to achieve avalancing in parallel on x86 systems.

xxHash2 was first released back in 2014, it has since changed a lot. It is also a fast, seeded and secure non-cryptographic hash function. It not only has a 32bit version, but also a 64bit version that is optimized for 64bit systems. It acheves its speed through a mixer function using the Add-Shift-Multiply scheme and a final avalanche function that mixes using Xor-Shift-Multiply. I belive this trick comes from the Murmur2 hash.

For comparison, I've taken HashDepot, uranium62, System.IO.Hashing, and my own implementation (Genbox) and benchmarked them.

Size is the number of bytes of random input.
MiB/s is mibibytes pr. second. Not representative of performance, but close enough for our purposes.

|               Method |     Size |              Mean |          Error |          StdDev |  MiB/s |
|--------------------- |--------- |------------------:|---------------:|----------------:|-------:|
|      Marvin32_Dotnet |       16 |          5.494 ns |      0.0258 ns |       0.0215 ns |  2,778 |
|      xxHash32_Genbox |       16 |          4.516 ns |      0.0951 ns |       0.0843 ns |  3,379 |
|      xxHash64_Genbox |       16 |          4.221 ns |      0.1043 ns |       0.1799 ns |  3,615 |
|   xxHash32_uranium62 |       16 |          4.418 ns |      0.0424 ns |       0.0397 ns |  3,454 |
|   xxHash64_uranium62 |       16 |          4.799 ns |      0.0899 ns |       0.0797 ns |  3,179 |
| xxHash3_64_uranium62 |       16 |          7.647 ns |      0.1503 ns |       0.1406 ns |  1,995 |
|   xxHash32_HashDepot |       16 |          5.953 ns |      0.0488 ns |       0.0381 ns |  2,563 |
|   xxHash64_HashDepot |       16 |          5.475 ns |      0.1021 ns |       0.0955 ns |  2,787 |
|    xxHash32_SystemIo |       16 |         13.532 ns |      0.3177 ns |       0.3262 ns |  1,128 |
|    xxHash64_SystemIo |       16 |         11.298 ns |      0.2744 ns |       0.2567 ns |  1,351 |
|      Marvin32_Dotnet | 33554432 | 10,234,784.062 ns | 14,305.1077 ns |  13,381.0066 ns |  3,127 |
|      xxHash32_Genbox | 33554432 |  5,775,679.688 ns | 97,172.8479 ns | 103,973.8585 ns |  5,540 |
|      xxHash64_Genbox | 33554432 |  3,093,052.505 ns | 58,241.7440 ns |  59,809.9831 ns | 10,346 |
|   xxHash32_uranium62 | 33554432 |  3,917,721.931 ns | 10,430.3703 ns |   9,246.2516 ns |  8,168 |
|   xxHash64_uranium62 | 33554432 |  2,736,906.808 ns | 11,850.7975 ns |  10,505.4233 ns | 11,692 |
| xxHash3_64_uranium62 | 33554432 |  2,018,361.263 ns |  6,866.7623 ns |   5,361.1163 ns | 15,854 |
|   xxHash32_HashDepot | 33554432 |  5,235,107.572 ns | 21,428.2085 ns |  17,893.5246 ns |  6,113 |
|   xxHash64_HashDepot | 33554432 |  5,354,266.707 ns | 48,639.5425 ns |  40,616.2209 ns |  5,977 |
|    xxHash32_SystemIo | 33554432 |  6,314,372.042 ns | 91,408.7424 ns |  81,031.4692 ns |  5,068 |
|    xxHash64_SystemIo | 33554432 |  3,469,453.153 ns | 47,291.9702 ns |  41,923.0998 ns |  9,223 |

Speed is one thing, but another more important factor is random distribution (part of security). Unfortunately, I've not had the time to port it to the SMhasher framework, so I'm unable to backup my claims, but I would belive Marvin would perform fine, as it mixes pretty well with well-studied techniques.

@Genbox
Copy link

Genbox commented Jul 10, 2022

I'm with @GSPP on this one. Marvin hash is not a first-order citizen like many other hash functions (farmHash, xxHash, SipHash), so it has not seen widespread implementation in software. To me, there are only three reasons why anything should be implemented in .NET:

  1. To provide a good, stable, fast default implementation of something that is already widespread. CRC32 is a good example of that.
  2. To provide a new - better than the existing - algorithm or implementation such that people can migrate to it over time. xxHash is an example of that.
  3. To provide something that is helps integration. For example, Microsoft.Extensions.Logging's ILogger interface is now widely used as a foundation API for many different logger implementations.

As Marvin32 is none of the above, I'd say time is better spent optimizing on xxHash, as it seems to be 2-3x slower than other implementations.

@bartonjs bartonjs modified the milestones: 7.0.0, Future Jul 18, 2022
@bartonjs bartonjs modified the milestones: Future, 8.0.0 Aug 26, 2022
@bartonjs
Copy link
Member Author

bartonjs commented Sep 6, 2022

Video

Based on low current demand, and there being arguably better alternatives already implemented (such as xxHash32), this doesn't seem like a thing we should be adding at this time.

(If this makes you sad and you really wanted this, then please speak up.)

@bartonjs bartonjs closed this as completed Sep 6, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Oct 7, 2022
@teo-tsirpanis teo-tsirpanis added api-suggestion Early API idea and discussion, it is NOT ready for implementation and removed api-ready-for-review API is ready for review, it is NOT ready for implementation labels Apr 2, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.IO.Hashing
Projects
None yet
Development

No branches or pull requests

7 participants