-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARM64: Optimize IndexOf(byte) and IndexOf(char) APIs using intrinsics. #37624
Conversation
@dotnet/jit-contrib , @TamarChristinaArm , @tannergooding , @GrabYourPitchforks, @danmosemsft |
@kunalspathak the answer is... it depends.. on older cores such as Cortex-A55 your sequence will probably be faster, on newer cores like Cortex-A76 (which is what AoR is tuned for) the |
Thanks @TamarChristinaArm. So probably, switching to the other version will be sensible to do? #Resolved |
I would say yes, even though you may get a lower increased based on hardware you may be benchmarking on now, on the long run it would the better sequence. #Resolved |
@TamarChristinaArm Is there any table explaining instruction costs? #Resolved |
So far I was referring to Cortex-A55 optimization guide, but looking at Cortex-A76 optimization guide, |
src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Char.cs
Outdated
Show resolved
Hide resolved
Overall LGTM, just a few questions/comments 😄 #Resolved |
Thanks @tannergooding . As Tamar suggested, I am going to switch over to the other implementation and then address your comments. |
src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs
Outdated
Show resolved
Hide resolved
Also moved the code in a common method that can be used to `FindFirstMatchedLane` in other APIs as well.
I do not think we need to use a mask at all. On input var zip = AdvSimd.Arm64.AddPairwise(vector.AsByte(), vector.AsByte()); // Sequence of 00 and fe bytes, two identical copies
var scalar = zip.AsUInt64().ToScalar(); // Number like 0x00fe00fefe000000, each char corresponds to 1 byte = 8 bits
return BitOperations.TrailingZeroCount(scalar) >> 3; |
Thanks for your suggestion Anton. Did you mean to also |
72bec3b
to
1ee145e
Compare
@kunalspathak Why do we need two |
Ah, I see what you mean. I was thinking that your code is for var themask = Vector128.Create((byte)1, 16, 1, 16, 1, 16, 1, 16, 1, 16, 1, 16, 1, 16, 1, 16);
var masked = AdvSimd.And(vector, themask);
var zip = AdvSimd.Arm64.AddPairwise(masked, masked);
return BitOperations.TrailingZeroCount(zip.AsUInt64().ToScalar()) >> 2; |
@kunalspathak Yes, I was thinking about the |
src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Char.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Char.cs
Outdated
Show resolved
Hide resolved
Looks great to me. One related thing I noticed is that the |
|
Found following tracking issues:
|
Optimize
IndexOf()
using ARM64 intrinsics. The only portion to be optimized here is searching 128 bytes at a time for byte/char and if found, finding the first index that matched. SSE/AVX usesMoveMask
which makes it easier to find the matching index, but I didn't find an equivalent instruction/intrinsic to do that in ARM64. Hence I did little masking to find out the matching index. Here is my relevant C# code that selects the first lane that was set inCompareEqual()
:which generates the following code:
I also checked ARM's optimized-routines as well in llvm and it is equivalent to following C# code:
The generated code for the same is:
I don't see why would this code be any better than the one that this PR produces. @TamarChristinaArm, let me know what you think.
Here are the performance improvement results. Performance code borrowed from here.
Performance results
Here is another run for different array. Again, code borrowed from here.
Performance results
Contributes to #33308 and #33707.
Update:
JIT_before_indexof.txt
JIT_after_indexof.txt
Here are the revised performance improvement results. Performance code borrowed from here.
Revised performance results
Here is another run for different array. Again, code borrowed from here.
Revised performance results