Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Ascii (and Utf8) encoding #85266

Merged
merged 9 commits into from
May 12, 2023
Merged

Conversation

Daniel-Svensson
Copy link
Contributor

@Daniel-Svensson Daniel-Svensson commented Apr 24, 2023

Reduce overhead of NarrowUtf16ToAscii when Vector128 is hardware accelerated.

  1. Inline NarrowUtf16ToAscii_Intrinsified
    For 33 char all ascii case the speedup is roughly
  • On skylake (i6700k) class hardware the time taken goes from > 6,6 -> < 5,9 ns (<90% of time)
  • On Zen 3 5,1 ns -> 3,9ns (~77% time)
  1. Use more efficient code to store "half a vector"
    Use a single movsd instruction instead of a long series of 10-16 instruction which stores temp on the stack.

@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Apr 24, 2023
@ghost
Copy link

ghost commented Apr 24, 2023

Tagging subscribers to this area: @dotnet/area-system-text-encoding
See info in area-owners.md if you want to be subscribed.

Issue Details

Reduce overhead of NarrowUtf16ToAscii when Vector128 is hardware accelerated.

  1. Inline NarrowUtf16ToAscii_Intrinsified
    For 33 char all ascii case the speedup is roughly
  • On skylake (i6700k) class hardware the time taken goes from > 6,6 -> < 5,9 ns (<90% of time)
  • On Zen 3 5,1 ns -> 3,9ns (~77% time)
  1. Use more efficient code to store "half a vector"
    Use a single movq instruction instead of a long series of 10-16 instruction which stores temp on the stack.
Author: Daniel-Svensson
Assignees: -
Labels:

area-System.Text.Encoding, community-contribution

Milestone: -

@EgorBo
Copy link
Member

EgorBo commented Apr 24, 2023

Inline NarrowUtf16ToAscii_Intrinsified
For 33 char all ascii case the speedup is roughly

What about smaller size (which are more popular)?

We're seeing more and more problems because of AggressiveInlining attribute placed on large methods (like yours) in the inliner

@Daniel-Svensson
Copy link
Contributor Author

Daniel-Svensson commented May 2, 2023

Inline NarrowUtf16ToAscii_Intrinsified
For 33 char all ascii case the speedup is roughly

What about smaller size (which are more popular)?

We're seeing more and more problems because of AggressiveInlining attribute placed on large methods (like yours) in the inliner

I did not see any regression for smaller inputs, but maybe an unexptected improvement.
I did re run som benchmarks with smaller inputs with Intel Core i7-6700K and AMD Zen3

  • "Ascii_Local_NarrowUtf16ToAscii_v1_StoreLower" is the numbers without inlining and
  • "Ascii_Local_NarrowUtf16ToAscii_v2_Inline" is with inlining. You can ignore the rests (they are experiments that I might create separate PRs for)
BenchmarkDotNet=v0.13.5, OS=Windows 11 (10.0.22621.1555/22H2/2022Update/SunValley2)
AMD Ryzen 9 5900X, 1 CPU, 24 logical and 12 physical cores
.NET SDK=8.0.100-preview.2.23157.25
  [Host]     : .NET 8.0.0 (8.0.23.12803), X64 RyuJIT AVX2
  Job-BXGTAG : .NET 8.0.0 (8.0.23.12803), X64 RyuJIT AVX2

MaxRelativeError=0.01  IterationTime=300.0000 ms  WarmupCount=1  
Method StringLengthInChars Scenario Mean Error StdDev Median Ratio RatioSD
Ascii_Local_NarrowUtf16ToAscii_v1_StoreLower 5 AsciiOnly 2.048 ns 0.0214 ns 0.0376 ns 2.038 ns 1.00 0.00
Ascii_Local_NarrowUtf16ToAscii_v2_Inline 5 AsciiOnly 1.713 ns 0.0161 ns 0.0143 ns 1.714 ns 0.83 0.02
Ascii_Local_NarrowUtf16ToAscii_v1_StoreLower 8 AsciiOnly 2.499 ns 0.0342 ns 0.0722 ns 2.465 ns 1.00 0.00
Ascii_Local_NarrowUtf16ToAscii_v2_Inline 8 AsciiOnly 2.096 ns 0.0245 ns 0.0229 ns 2.091 ns 0.84 0.03
Ascii_Local_NarrowUtf16ToAscii_v1_StoreLower 15 AsciiOnly 3.167 ns 0.0261 ns 0.0218 ns 3.168 ns 1.00 0.00
Ascii_Local_NarrowUtf16ToAscii_v2_Inline 15 AsciiOnly 2.723 ns 0.0326 ns 0.0694 ns 2.693 ns 0.89 0.02
Ascii_Local_NarrowUtf16ToAscii_v1_StoreLower 16 AsciiOnly 3.099 ns 0.0314 ns 0.0294 ns 3.109 ns 1.00 0.00
Ascii_Local_NarrowUtf16ToAscii_v2_Inline 16 AsciiOnly 2.736 ns 0.0387 ns 0.0414 ns 2.717 ns 0.88 0.02
Ascii_Local_NarrowUtf16ToAscii_v1_StoreLower 19 AsciiOnly 3.405 ns 0.0407 ns 0.0381 ns 3.401 ns 1.00 0.00
Ascii_Local_NarrowUtf16ToAscii_v2_Inline 19 AsciiOnly 3.097 ns 0.0338 ns 0.0299 ns 3.090 ns 0.91 0.01
Ascii_Local_NarrowUtf16ToAscii_v1_StoreLower 31 AsciiOnly 4.393 ns 0.0278 ns 0.0260 ns 4.388 ns 1.00 0.00
Ascii_Local_NarrowUtf16ToAscii_v2_Inline 31 AsciiOnly 4.280 ns 0.0536 ns 0.0573 ns 4.283 ns 0.97 0.01
Ascii_Local_NarrowUtf16ToAscii_simple_loop 31 AsciiOnly 2.746 ns 0.0349 ns 0.0327 ns 2.738 ns 0.63 0.01
Ascii_Local_NarrowUtf16ToAscii_v3 31 AsciiOnly 2.868 ns 0.0378 ns 0.0405 ns 2.866 ns 0.65 0.01
Ascii_Local_NarrowUtf16ToAscii_v4_if 31 AsciiOnly 2.900 ns 0.0245 ns 0.0230 ns 2.893 ns 0.66 0.01
Ascii_Local_NarrowUtf16ToAscii_v1_StoreLower 33 AsciiOnly 3.908 ns 0.0358 ns 0.0525 ns 3.881 ns 1.00 0.00
Ascii_Local_NarrowUtf16ToAscii_v2_Inline 33 AsciiOnly 2.916 ns 0.0284 ns 0.0266 ns 2.905 ns 0.74 0.01

{
// Below code translates to a single write on x86 (for both 32 and 64 bit)
// - we use double instead of long so that the JIT writes directly to memory without intermediate (register or stack in case of 32 bit)
Unsafe.WriteUnaligned<double>(ref Unsafe.Add(ref bytePtr, elementOffset), byteVector.AsDouble().ToScalar());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I don't think that comment is needed because it is also expected to be a single instruction on arm. Also I'm not sure double is any better than long here, jit is expected to do the right thing anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can believe that the double hack helps on 32-bit x86. We do not pay as much attention to the codegen quality on 32-bit x86 and there are definitely issues. I do not think it is worth it to be adding workarounds like this for x86 quality issues to CoreLib. If issues like this one are important to fix, it would be better to fix it in the JIT.

Copy link
Member

@EgorBo EgorBo May 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I missed the fact that the same code with long generates bad codegen on x86, I hoped JIT would emit the same as for double 🙁 @jkotas do you mean the whole helper call is not needed? (it is needed because the original code was using Vector64 that we don't recommend using on x86/64) or you're fine with changing this to long and remove notes? (I agree that we'd better fix this in JIT for 32bit we likely already a few similar patterns in BCL)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

(32bit)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkotas do you mean the whole helper call is not needed?

I assume that the helper call is needed to avoid Vector64 that does not produce good code on x64. Without the helper call., the alternative sequence that avoids Vector64 would have to be manually inlined in every place.

@EgorBo
Copy link
Member

EgorBo commented May 7, 2023

@Daniel-Svensson can you please do the same for https://github.com/dotnet/runtime/blob/main/src/libraries/System.Private.CoreLib/src/System/Text/Ascii.CaseConversion.cs#LL532C46-L532C46 ? (you can either inline Unsafe.WriteUnaligned there or move your helper to Vector128 as an internal API)

@Daniel-Svensson
Copy link
Contributor Author

@Daniel-Svensson can you please do the same for https://github.com/dotnet/runtime/blob/main/src/libraries/System.Private.CoreLib/src/System/Text/Ascii.CaseConversion.cs#LL532C46-L532C46 ? (you can either inline Unsafe.WriteUnaligned there or move your helper to Vector128 as an internal API)

I moved the helper to Vector128 and called it from there.

I also found an old unused helper (0) references with the same purpose in that file, do you want me to remove that method or update it to call the new helper ?

[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static unsafe void Widen8To16AndAndWriteTo(Vector128<byte> narrowVector, char* pDest, nuint destOffset)
{
if (Vector256.IsHardwareAccelerated)
{
Vector256<ushort> wide = Vector256.WidenLower(narrowVector.ToVector256Unsafe());
wide.StoreUnsafe(ref *(ushort*)pDest, destOffset);
}
else
{
Vector128.WidenLower(narrowVector).StoreUnsafe(ref *(ushort*)pDest, destOffset);
Vector128.WidenUpper(narrowVector).StoreUnsafe(ref *(ushort*)pDest, destOffset + 8);
}
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static unsafe void Narrow16To8AndAndWriteTo(Vector128<ushort> wideVector, byte* pDest, nuint destOffset)
{
Vector128<byte> narrow = Vector128.Narrow(wideVector, wideVector);
if (Sse2.IsSupported)
{
// MOVQ is supported even on x86, unaligned accesses allowed
Sse2.StoreScalar((ulong*)(pDest + destOffset), narrow.AsUInt64());
}
else if (Vector64.IsHardwareAccelerated)
{
narrow.GetLower().StoreUnsafe(ref *pDest, destOffset);
}
else
{
Unsafe.WriteUnaligned<ulong>(pDest + destOffset, narrow.AsUInt64().ToScalar());
}
}

Copy link
Member

@adamsitnik adamsitnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impressive optimization @Daniel-Svensson !

I also found an old unused helper (0) references with the same purpose in that file,

Please just remove the unused helper.

@adamsitnik adamsitnik added this to the 8.0.0 milestone May 11, 2023
@adamsitnik adamsitnik added the tenet-performance Performance related issue label May 11, 2023
@adamsitnik adamsitnik self-assigned this May 11, 2023
Copy link
Member

@adamsitnik adamsitnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, again thank you for your contribution @Daniel-Svensson !

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Text.Encoding community-contribution Indicates that the PR has been added by a community member tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants