Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix GetIndexOfFirstNonAsciiByte_Vector not taken on ARM64 #90527

Closed
wants to merge 4 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
176 changes: 29 additions & 147 deletions src/libraries/System.Private.CoreLib/src/System/Text/Ascii.Utility.cs
Original file line number Diff line number Diff line change
Expand Up @@ -52,29 +52,6 @@ private static bool AllCharsInUInt64AreAscii<T>(ulong value)
: AllCharsInUInt64AreAscii(value);
}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
[CompExactlyDependsOn(typeof(AdvSimd.Arm64))]
private static int GetIndexOfFirstNonAsciiByteInLane_AdvSimd(Vector128<byte> value, Vector128<byte> bitmask)
{
if (!AdvSimd.Arm64.IsSupported || !BitConverter.IsLittleEndian)
{
throw new PlatformNotSupportedException();
}

// extractedBits[i] = (value[i] >> 7) & (1 << (12 * (i % 2)));
Vector128<byte> mostSignificantBitIsSet = AdvSimd.ShiftRightArithmetic(value.AsSByte(), 7).AsByte();
Vector128<byte> extractedBits = AdvSimd.And(mostSignificantBitIsSet, bitmask);

// collapse mask to lower bits
extractedBits = AdvSimd.Arm64.AddPairwise(extractedBits, extractedBits);
ulong mask = extractedBits.AsUInt64().ToScalar();

// calculate the index
int index = BitOperations.TrailingZeroCount(mask) >> 2;
Debug.Assert((mask != 0) ? index < 16 : index >= 16);
return index;
}

/// <summary>
/// Given a DWORD which represents two packed chars in machine-endian order,
/// <see langword="true"/> iff the first char (in machine-endian order) is ASCII.
Expand All @@ -101,11 +78,9 @@ internal static unsafe nuint GetIndexOfFirstNonAsciiByte(byte* pBuffer, nuint bu
// like pmovmskb which we know are optimized, and (b) we can avoid downclocking the processor while
// this method is running.

if (!Vector512.IsHardwareAccelerated &&
!Vector256.IsHardwareAccelerated &&
(Sse2.IsSupported || AdvSimd.IsSupported))
if (!Vector512.IsHardwareAccelerated && !Vector256.IsHardwareAccelerated && Sse2.IsSupported)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do this, then doesn't this change leave a bunch of dead code in GetIndexOfFirstNonAsciiByte_Intrinsified that should be cleaned up? It has multiple blocks dedicated to AdvSimd which would never be reached. At which point the method should also be renamed to be SSE-specific.

But I'd also explicitly asked about this:
#88532 (comment)
and was told the preferring of the intrinsics path was by design. Maybe it was only measured on x64 and not on Arm?

cc: @tannergooding, @anthonycanino

Copy link
Contributor Author

@neon-sunset neon-sunset Aug 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do this, then doesn't this change leave a bunch of dead code in GetIndexOfFirstNonAsciiByte_Intrinsified that should be cleaned up? It has multiple blocks dedicated to AdvSimd which would never be reached. At which point the method should also be renamed to be SSE-specific.

Let me fix that.

#88532 (comment)

The _Vector path is estimated to be ~40% faster on osx-arm64, It is likely the switch was not measured on ARM64.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _Vector path is estimated to be ~40% faster on osx-arm64, It is likely the switch was not measured on ARM64.

@neon-sunset can you clarify please? .NET 7 was using the _Intrinsified path for Arm64: https://github.com/dotnet/runtime/blob/v7.0.10/src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs#L81-L91

The PR that introduced the _Vector path (#88532) didn't change _Intrinsified at all (outside using the centralized Vector128.Size constant) and didn't change which Arm64 was calling.

This then wouldn't seem like a regression, but rather simply a case where the new _Vector code is faster for Arm64. If that's the case, could you share your own numbers showing the .NET 7 vs .NET 8 difference (without this PR) and the difference for _Vector vs _Intrinsified for .NET 8 (this PR vs without this PR)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@neon-sunset can you clarify please? .NET 7 was using the _Intrinsified path for Arm64: https://github.com/dotnet/runtime/blob/v7.0.10/src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs#L81-L91

You are correct, my mistake.

This then wouldn't seem like a regression, but rather simply a case where the new _Vector code is faster for Arm64. If that's the case, could you share your own numbers showing the .NET 7 vs .NET 8 difference (without this PR) and the difference for _Vector vs _Intrinsified for .NET 8 (this PR vs without this PR)?

This is the plan, I'm just waiting for the Release build to finish to post detailed numbers 😄

{
return GetIndexOfFirstNonAsciiByte_Intrinsified(pBuffer, bufferLength);
return GetIndexOfFirstNonAsciiByte_SSE2(pBuffer, bufferLength);
}
else
{
Expand Down Expand Up @@ -347,22 +322,22 @@ private static bool ContainsNonAsciiByte_AdvSimd(uint advSimdIndex)
return advSimdIndex < 16;
}

private static unsafe nuint GetIndexOfFirstNonAsciiByte_Intrinsified(byte* pBuffer, nuint bufferLength)
[CompExactlyDependsOn(typeof(Sse2))]
private static unsafe nuint GetIndexOfFirstNonAsciiByte_SSE2(byte* pBuffer, nuint bufferLength)
{
// JIT turns the below into constants

uint SizeOfVector128 = (uint)sizeof(Vector128<byte>);
nuint MaskOfAllBitsInVector128 = (nuint)(SizeOfVector128 - 1);

Debug.Assert(Sse2.IsSupported || AdvSimd.Arm64.IsSupported, "Sse2 or AdvSimd64 required.");
Debug.Assert(BitConverter.IsLittleEndian, "This SSE2/Arm64 implementation assumes little-endian.");
Debug.Assert(Sse2.IsSupported);
Debug.Assert(BitConverter.IsLittleEndian, "This SSE2 implementation assumes little-endian.");

Vector128<byte> bitmask = BitConverter.IsLittleEndian ?
Vector128.Create((ushort)0x1001).AsByte() :
Vector128.Create((ushort)0x0110).AsByte();

uint currentSseMask = uint.MaxValue, secondSseMask = uint.MaxValue;
uint currentAdvSimdIndex = uint.MaxValue, secondAdvSimdIndex = uint.MaxValue;
byte* pOriginalBuffer = pBuffer;

// This method is written such that control generally flows top-to-bottom, avoiding
Expand All @@ -377,25 +352,10 @@ private static unsafe nuint GetIndexOfFirstNonAsciiByte_Intrinsified(byte* pBuff

// Read the first vector unaligned.

if (Sse2.IsSupported)
{
currentSseMask = (uint)Sse2.MoveMask(Sse2.LoadVector128(pBuffer)); // unaligned load
if (ContainsNonAsciiByte_Sse2(currentSseMask))
{
goto FoundNonAsciiDataInCurrentChunk;
}
}
else if (AdvSimd.Arm64.IsSupported)
currentSseMask = (uint)Sse2.MoveMask(Sse2.LoadVector128(pBuffer)); // unaligned load
if (ContainsNonAsciiByte_Sse2(currentSseMask))
{
currentAdvSimdIndex = (uint)GetIndexOfFirstNonAsciiByteInLane_AdvSimd(AdvSimd.LoadVector128(pBuffer), bitmask); // unaligned load
if (ContainsNonAsciiByte_AdvSimd(currentAdvSimdIndex))
{
goto FoundNonAsciiDataInCurrentChunk;
}
}
else
{
throw new PlatformNotSupportedException();
goto FoundNonAsciiDataInCurrentChunk;
}

// If we have less than 32 bytes to process, just go straight to the final unaligned
Expand Down Expand Up @@ -432,33 +392,14 @@ private static unsafe nuint GetIndexOfFirstNonAsciiByte_Intrinsified(byte* pBuff

do
{
if (Sse2.IsSupported)
{
Vector128<byte> firstVector = Sse2.LoadAlignedVector128(pBuffer);
Vector128<byte> secondVector = Sse2.LoadAlignedVector128(pBuffer + SizeOfVector128);
Vector128<byte> firstVector = Sse2.LoadAlignedVector128(pBuffer);
Vector128<byte> secondVector = Sse2.LoadAlignedVector128(pBuffer + SizeOfVector128);

currentSseMask = (uint)Sse2.MoveMask(firstVector);
secondSseMask = (uint)Sse2.MoveMask(secondVector);
if (ContainsNonAsciiByte_Sse2(currentSseMask | secondSseMask))
{
goto FoundNonAsciiDataInInnerLoop;
}
}
else if (AdvSimd.Arm64.IsSupported)
currentSseMask = (uint)Sse2.MoveMask(firstVector);
secondSseMask = (uint)Sse2.MoveMask(secondVector);
if (ContainsNonAsciiByte_Sse2(currentSseMask | secondSseMask))
{
Vector128<byte> firstVector = AdvSimd.LoadVector128(pBuffer);
Vector128<byte> secondVector = AdvSimd.LoadVector128(pBuffer + SizeOfVector128);

currentAdvSimdIndex = (uint)GetIndexOfFirstNonAsciiByteInLane_AdvSimd(firstVector, bitmask);
secondAdvSimdIndex = (uint)GetIndexOfFirstNonAsciiByteInLane_AdvSimd(secondVector, bitmask);
if (ContainsNonAsciiByte_AdvSimd(currentAdvSimdIndex) || ContainsNonAsciiByte_AdvSimd(secondAdvSimdIndex))
{
goto FoundNonAsciiDataInInnerLoop;
}
}
else
{
throw new PlatformNotSupportedException();
goto FoundNonAsciiDataInInnerLoop;
}

pBuffer += 2 * SizeOfVector128;
Expand All @@ -482,25 +423,10 @@ private static unsafe nuint GetIndexOfFirstNonAsciiByte_Intrinsified(byte* pBuff
// At least one full vector's worth of data remains, so we can safely read it.
// Remember, at this point pBuffer is still aligned.

if (Sse2.IsSupported)
{
currentSseMask = (uint)Sse2.MoveMask(Sse2.LoadAlignedVector128(pBuffer));
if (ContainsNonAsciiByte_Sse2(currentSseMask))
{
goto FoundNonAsciiDataInCurrentChunk;
}
}
else if (AdvSimd.Arm64.IsSupported)
{
currentAdvSimdIndex = (uint)GetIndexOfFirstNonAsciiByteInLane_AdvSimd(AdvSimd.LoadVector128(pBuffer), bitmask);
if (ContainsNonAsciiByte_AdvSimd(currentAdvSimdIndex))
{
goto FoundNonAsciiDataInCurrentChunk;
}
}
else
currentSseMask = (uint)Sse2.MoveMask(Sse2.LoadAlignedVector128(pBuffer));
if (ContainsNonAsciiByte_Sse2(currentSseMask))
{
throw new PlatformNotSupportedException();
goto FoundNonAsciiDataInCurrentChunk;
}

IncrementCurrentOffsetBeforeFinalUnalignedVectorRead:
Expand All @@ -516,27 +442,10 @@ private static unsafe nuint GetIndexOfFirstNonAsciiByte_Intrinsified(byte* pBuff

pBuffer += (bufferLength & MaskOfAllBitsInVector128) - SizeOfVector128;

if (Sse2.IsSupported)
{
currentSseMask = (uint)Sse2.MoveMask(Sse2.LoadVector128(pBuffer)); // unaligned load
if (ContainsNonAsciiByte_Sse2(currentSseMask))
{
goto FoundNonAsciiDataInCurrentChunk;
}

}
else if (AdvSimd.Arm64.IsSupported)
{
currentAdvSimdIndex = (uint)GetIndexOfFirstNonAsciiByteInLane_AdvSimd(AdvSimd.LoadVector128(pBuffer), bitmask); // unaligned load
if (ContainsNonAsciiByte_AdvSimd(currentAdvSimdIndex))
{
goto FoundNonAsciiDataInCurrentChunk;
}

}
else
currentSseMask = (uint)Sse2.MoveMask(Sse2.LoadVector128(pBuffer)); // unaligned load
if (ContainsNonAsciiByte_Sse2(currentSseMask))
{
throw new PlatformNotSupportedException();
goto FoundNonAsciiDataInCurrentChunk;
}

pBuffer += SizeOfVector128;
Expand All @@ -551,46 +460,19 @@ private static unsafe nuint GetIndexOfFirstNonAsciiByte_Intrinsified(byte* pBuff
// instead be the second mask. If so, skip the entire first mask and drain ASCII bytes
// from the second mask.

if (Sse2.IsSupported)
{
if (!ContainsNonAsciiByte_Sse2(currentSseMask))
{
pBuffer += SizeOfVector128;
currentSseMask = secondSseMask;
}
}
else if (AdvSimd.IsSupported)
if (!ContainsNonAsciiByte_Sse2(currentSseMask))
{
if (!ContainsNonAsciiByte_AdvSimd(currentAdvSimdIndex))
{
pBuffer += SizeOfVector128;
currentAdvSimdIndex = secondAdvSimdIndex;
}
}
else
{
throw new PlatformNotSupportedException();
pBuffer += SizeOfVector128;
currentSseMask = secondSseMask;
}
FoundNonAsciiDataInCurrentChunk:


if (Sse2.IsSupported)
{
// The mask contains - from the LSB - a 0 for each ASCII byte we saw, and a 1 for each non-ASCII byte.
// Tzcnt is the correct operation to count the number of zero bits quickly. If this instruction isn't
// available, we'll fall back to a normal loop.
Debug.Assert(ContainsNonAsciiByte_Sse2(currentSseMask), "Shouldn't be here unless we see non-ASCII data.");
pBuffer += (uint)BitOperations.TrailingZeroCount(currentSseMask);
}
else if (AdvSimd.Arm64.IsSupported)
{
Debug.Assert(ContainsNonAsciiByte_AdvSimd(currentAdvSimdIndex), "Shouldn't be here unless we see non-ASCII data.");
pBuffer += currentAdvSimdIndex;
}
else
{
throw new PlatformNotSupportedException();
}
// The mask contains - from the LSB - a 0 for each ASCII byte we saw, and a 1 for each non-ASCII byte.
// Tzcnt is the correct operation to count the number of zero bits quickly. If this instruction isn't
// available, we'll fall back to a normal loop.
Debug.Assert(ContainsNonAsciiByte_Sse2(currentSseMask), "Shouldn't be here unless we see non-ASCII data.");
pBuffer += (uint)BitOperations.TrailingZeroCount(currentSseMask);

goto Finish;

Expand Down
Loading