Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add vectorized implementation of hex encoding/decoding #39702

Closed
wants to merge 2 commits into from
Closed

Add vectorized implementation of hex encoding/decoding #39702

wants to merge 2 commits into from

Conversation

tkp1n
Copy link
Contributor

@tkp1n tkp1n commented Jul 21, 2020

History

#37546 introduced a public API to encode/decode bytes from/to hexadecimal strings. It also consolidated a lot of internal implementations of such methods to the common class HexConverter.

The implementation in the PR mentioned above was kept simple to ensure the API gets merged in time for .NET 5 with optimizations left for further PRs.

Summary

This PR now introduces vectorized methods for the encoding/decoding of hex data as a follow-up to the above-mentioned PR.
With relatively little added complexity we achieve more than an order of magnitude in performance improvements for larger data sets (e.g. 2048 bytes).

Perf-Numbers

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.388 (2004/?/20H1)
Intel Core i7-9750H CPU 2.60GHz, 1 CPU, 12 logical and 6 physical cores
Method Job EnvironmentVariables NofBytes Mean Error StdDev
Decode AVX2 COMPlus_EnableAVX2=1 2048 194.46 ns 1.101 ns 0.976 ns
Decode Scalar COMPlus_EnableAVX2=0 2048 4,036.43 ns 21.000 ns 17.536 ns
Encode AVX2 COMPlus_EnableAVX2=1 2048 305.11 ns 2.013 ns 1.785 ns
Encode Scalar COMPlus_EnableAVX2=0 2048 4,584.03 ns 32.259 ns 30.175 ns

Review Input / Open Issues

@tannergooding and @gfoidl you may be interested

This is marked as a draft PR for several reasons:

  • It is unclear whether the optimizations will be deemed worthy to be merged (perf vs. complexity trade-off).
  • The implementation is not yet polished and only good for initial feedback.
  • The build is currently failing, as we would have to exclude (#if) the vectorized methods from builds not supporting intrinsics. I didn't find any code based on intrinsics in the Common directory to base this kind of build restrictions on.

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jul 21, 2020
Copy link
Member

@gfoidl gfoidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did just a quick pass over the code (will have a deeper look a bit later).

Do you plan to provide a SSE2 / SSE41 (for blend) path too?
I think it should be possible to do so...

src/libraries/Common/src/System/HexConverter.cs Outdated Show resolved Hide resolved
src/libraries/Common/src/System/HexConverter.cs Outdated Show resolved Hide resolved
src/libraries/Common/src/System/HexConverter.cs Outdated Show resolved Hide resolved
src/libraries/Common/src/System/HexConverter.cs Outdated Show resolved Hide resolved
src/libraries/Common/src/System/HexConverter.cs Outdated Show resolved Hide resolved
src/libraries/Common/src/System/HexConverter.cs Outdated Show resolved Hide resolved
@gfoidl
Copy link
Member

gfoidl commented Jul 21, 2020

performance improvements for larger data sets (e.g. 2048 bytes).

When the PR isn't a draft anymore, we should collect numbers for differnt input-size to verify that small inputs (common case?) don't regress (too much).

@tkp1n
Copy link
Contributor Author

tkp1n commented Jul 21, 2020

Did just a quick pass over the code (will have a deeper look a bit later).

Thanks for the initial review. I'll make your suggested adjustments.

Do you plan to provide a SSE2 / SSE41 (for blend) path too?
I think it should be possible to do so...

I have the following algorithms ready but decided to start here with AVX2 (biggest bang for the buck).

Encode:

  • AVX2
  • SSSE3
  • SSE2

Decode:

  • AVX2
  • SSE41
  • SSSE3
  • SSE2

Feel free to browse through my HexMate repository to check out those implementations as well.

When the PR isn't a draft anymore, we should collect numbers for differnt input-size to verify that small inputs (common case?) don't regress (too much).

Absolutely! I'll make the effort of creating those numbers, once I've seen some interest in actually adding a vectorized implementation from the team.


private static ReadOnlySpan<byte> LowerHexLookupTable => new byte[]
{
0x30, 0x31, 0x32, 0x33, 0x34, 0x35, 0x36, 0x37, 0x38, 0x39, 0x61, 0x62, 0x63, 0x64, 0x65, 0x66,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

** not relevant yet -- only when SSE2 comes into play **

Had a quick look at HexMate, and saw that (as expected) the constants for AVX2 just doubles SSE2 in the upper lane.

So in order to save some (disk) space for the assembly, one could use this outline:

Vector128<byte> sseConstant = ReadVector(Sse2ContantData);
Vector256<byte> avxConstant = Vector256.Create(sseConstant, sseConstant);

In essence it's about avoiding to have duplicated data in the "text-section" of the DLL.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I added your suggestion to the HexMate repo (here) 👍

I let this open as a reminder in case we get to SSE2 in this PR.

@tkp1n
Copy link
Contributor Author

tkp1n commented Jul 21, 2020

Tagging @bartonjs, @stephentoub and @GrabYourPitchforks to get a sense of the teams interest in this as you all contributed to the original issue (#17837) and/or the PR that recently added the public API (#37546).

@danmoseley
Copy link
Member

The build errors are because some projects using HexConverter.cs do not reference S.R.Intrinsics and S.N.Vectors. @Anipik from a layering POV is it OK for those other projects to have references to those?

@Anipik
Copy link
Contributor

Anipik commented Aug 10, 2020

@Anipik from a layering POV is it OK for those other projects to have references to those?

We should be fine with using S.R.Intrinsics.
But we will have to do case by case for S.N vectors as it depends on a system.Memory package reference, which i dont think we want to reference from the shared framework.

@sandreenko sandreenko added area-System.Text.Encoding and removed area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI labels Aug 19, 2020
@ghost
Copy link

ghost commented Aug 19, 2020

Tagging subscribers to this area: @tarekgh, @krwq
See info in area-owners.md if you want to be subscribed.

@GrabYourPitchforks
Copy link
Member

Going through old inbox notifications and came across this. Does #44111 render this moot?

@GrabYourPitchforks
Copy link
Member

Now that #44111 is in we should consider how to morph this PR to provide additional benefit. Since this PR focuses on AVX/AVX2, does it provide a significant performance boost over the SSSE3 code paths for the common cases of processing 16/32/64 bytes?

Base automatically changed from master to main March 1, 2021 09:06
@jeffhandley
Copy link
Member

@tkp1n Would you be able to take a look at the changes from #44111 and update this PR with what you think would be the best updated approach? Thanks!

@tkp1n
Copy link
Contributor Author

tkp1n commented Apr 18, 2021

@jeffhandley I‘ll look into it 👍

@tkp1n
Copy link
Contributor Author

tkp1n commented Apr 18, 2021

I've updated my encoding method locally and ran some benchmarks for 32 and 64 byte inputs. (The algorithm I'm proposing works with input buffers of >= 32 bytes for encoding and output buffers of >= 32 bytes for decoding hex characters)

It looks like we could achieve a > 2x improvement using AVX2.

Method Size Mean Error StdDev Ratio
Ssse3 32 12.693 ns 0.0595 ns 0.0528 ns 1.00
Avx2 32 5.108 ns 0.1262 ns 0.1119 ns 0.40
Ssse3 64 16.871 ns 0.1416 ns 0.1105 ns 1.00
Avx2 64 7.450 ns 0.0756 ns 0.0670 ns 0.44

@GrabYourPitchforks, @jeffhandley Let me know if you think the speedup justifies maintaining yet another code path for encoding/decoding hex chars. If you think so, I'll gladly put in the work to update the PR accordingly.

@tkp1n
Copy link
Contributor Author

tkp1n commented Apr 18, 2021

"Converting" the current encoding algorithm to AVX2 allows us to encode input buffers of >= 8 bytes but at a slower speed:

Method Size Mean Error StdDev Ratio RatioSD
Ssse3 16 7.074 ns 0.0693 ns 0.0648 ns 1.00 0.00
Avx2ChunksOf8 16 5.122 ns 0.0371 ns 0.0310 ns 0.72 0.01
Ssse3 32 10.915 ns 0.1368 ns 0.1213 ns 1.00 0.00
Avx2ChunksOf8 32 7.222 ns 0.1629 ns 0.1811 ns 0.66 0.02
Ssse3 64 16.922 ns 0.1938 ns 0.1718 ns 1.00 0.00
Avx2ChunksOf8 64 12.028 ns 0.0490 ns 0.0434 ns 0.71 0.01

@rcollette
Copy link

It is a shame that the internal class HexConverter allows for specifying casing but but the public method Converter.ToHexString() does not.

One real-world example of where lower case must be used is:
https://docs.aws.amazon.com/AmazonS3/latest/API/sig-v4-header-based-auth.html

I know it's not directly related to this optimization, but hopeful that the optimization will handle lowercase and expose it.

@danmoseley
Copy link
Member

@rcollette as far as I can tell, these perf changes do not break the contract of HexConverter.ToString that accepts the desired casing. If you believe that Converter.ToHexString(..) should have an overload to select casing, which seems a quite reasonable desire (given your example use case), it seems that can happen independently. The first step would be for you to create an API proposal here. If approved, we'd then need a volunteer to implement it.

@GrabYourPitchforks
Copy link
Member

Thanks for the investigation! TBH the thing I'm struggling with is that this seems like a lot of code (+ extra review, maintenance, unit tests, etc.) for something that's on the expected high end going to save us around 10 ns per invocation. We don't tend to have frequent use of AVX2 intrinsics from within CoreLib, with the notable exceptions of two really hot code families: (a) strchr / memmov / memset; and (b) base64encode/decode, since these buffers tend to be arbitrarily long. If there were evidence that this savings would manifest in real-world applications, that'd help assuage my concerns.

But let's try for some extra perspectives here.

@tannergooding @EgorBo, since you both have experience in this code, do you think there are scenarios that could benefit from this being AVX2-enlightened? Please feel free to tell me that I'm being too grumpy and that some apps would see clear benefits from this. :)

I'm operating under the assumption that most apps which perform hex processing are doing so with fixed-length payloads, generally no more than 64 bytes (512 bits), and only occasionally with higher lengths. I'm also assuming that any workload which involves hex conversion is dominated by whatever processing takes place over the raw decoded binary buffer.

@tannergooding
Copy link
Member

AVX2 (or more specifically Vector256<T>) support likely depends entirely on the average payload size. If we believe that is small, then V256<T> support is likely not beneficial and may actually be harmful due to additional branches or masking required to handle small payloads.

@danmoseley
Copy link
Member

@EgorBo thoughts about @GrabYourPitchforks question above?

@jeffhandley
Copy link
Member

@EgorBo it looks like we're still waiting for your comments on this.

@terrajobst terrajobst added the community-contribution Indicates that the PR has been added by a community member label Jul 19, 2021
@jeffhandley
Copy link
Member

@GrabYourPitchforks This PR is assigned to you for follow-up/decision before the RC1 snap, but I've also assigned @EgorBo since we're awaiting their feedback.

@jeffhandley
Copy link
Member

I'm updating this PR to target the 7.0.0 milestone as we were not able to finish the review and validation of this change before .NET 6.0 RC1 snapped. I will resolve the conflicts and push to the branch so that we can review as soon as time permits.

@jeffhandley jeffhandley added this to the 7.0.0 milestone Aug 19, 2021
@jeffhandley
Copy link
Member

I was finally able to take another look at this PR, the comment history, and another review of the code. I've decided, largely based on the comments from @GrabYourPitchforks (here) and @tannergooding (here), that we will close this PR without merging it, leaving the changes from #44111 in place.

@tkp1n Thanks for your effort on this and your patience with this PR being open as long as it was before we came to this conclusion.

@jeffhandley jeffhandley closed this Oct 9, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Nov 8, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Runtime community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.