-
Notifications
You must be signed in to change notification settings - Fork 4.9k
Make use of CopyBlock for non-overlapping spans #17063
Conversation
src/System.Memory/src/System/Span.cs
Outdated
{ | ||
Unsafe.Add<T>(ref dst, i) = Unsafe.Add<T>(ref src, i); | ||
Unsafe.CopyBlock<T>(ref dst, ref src, (uint)length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does BlockCopy work for overlapping ranges where the source is after (address wise) the destination?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no Unsafe.CopyBlock<T>
. I do not see how this can ever work. Note that this implementation is only for slow span, and so you have to go out of your way to build and test it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jkotas This is my first touch of CoreFx and I learned something new today. I was clearly not building what I thought I was building. Apparently, the file I was changing is only included in builds on particular platforms that I wasn't hitting until I submitted this.
I am testing a new commit that addresses the real issue right now. I will update this PR soon.
// Source address greater than or equal to destination address. Can do normal copy. | ||
for (int i = 0; i < length; i++) | ||
{ | ||
Unsafe.Add<T>(ref dst, i) = Unsafe.Add<T>(ref src, i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be unrolled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was initially just fixing the perf for non-overlapping blocks. If you want this unrolled, I can add that as another commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either a different commit or a different PR, but I think we should try to make the gap between overlapping and non-overlapping as small as possible. Otherwise it creates annoying performance traps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The slower path for overlapping exists in the fast Span as well. Fixing it for fast Span should be higher priority than for the slow Span.
Do we have a benchmark for this checked into coreclr? so that the JIT team can work on optimizing code gen for it? cc: @ahsonkhan |
src/System.Memory/src/System/Span.cs
Outdated
Unsafe.CopyBlock<T>(ref dst, ref src, (uint)length); | ||
void* pSrc = Unsafe.AsPointer<T>(ref src); | ||
void* pDst = Unsafe.AsPointer<T>(ref dst); | ||
Unsafe.CopyBlockUnaligned(pDst, pSrc, (uint)(length * Unsafe.SizeOf<T>())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
length * Unsafe.SizeOf<T>()
can be more than 4GB. It is not ok to cast it to uint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unsafe.CopyBlock only takes a uint in the 3rd argument.
Should I add a check and fail on overflow or is there a bug in CopyBlock?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should call CopyBlock multiple times as necessary.
CopyBlock is basically just exposing cpblk
instruction as an API. The cpblk
instruction takes uint size unfortunately.
src/System.Memory/src/System/Span.cs
Outdated
Unsafe.CopyBlock<T>(ref dst, ref src, (uint)length); | ||
void* pSrc = Unsafe.AsPointer<T>(ref src); | ||
void* pDst = Unsafe.AsPointer<T>(ref dst); | ||
Unsafe.CopyBlockUnaligned(pDst, pSrc, (uint)(length * Unsafe.SizeOf<T>())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CopyBlockUnaligned only works for elements that do not contain GC pointers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jkotas Is it okay to call CopyBlock with an unaligned buffer? Or do we need to ensure it is aligned?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling CopyBlockUnaligned on unaligned buffer is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jkotas Is it okay to call CopyBlock with an unaligned buffer? Or do we need to ensure it is aligned?
Calling CopyBlockUnaligned on unaligned buffer is fine.
CopyBlock or CopyBlockUnaligned?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CopyBlockUnaligned. It will do the same thing as CopyBlock on x86 and x64 because of these platforms do not require alignment.
@KrzysztofCwalina, there is a benchmark for Span.CopyTo, however it only tests fast span. |
@ahsonkhan, thanks. Fast span only benchmarks is good enough. I don't think we want to spend much more resources optimizing JIT for the slow span |
src/System.Memory/src/System/Span.cs
Outdated
// Source address greater than or equal to destination address. Can do normal copy. | ||
for (int i = 0; i < length; i++) | ||
IntPtr tailDiff = Unsafe.ByteOffset<T>(ref Unsafe.Add<T>(ref dst, destLength), ref src); | ||
bool isOverlapped = (sizeof(IntPtr) == sizeof(int)) ? tailDiff.ToInt32() < 0 : tailDiff.ToInt64() < 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jkotas, these sizeof checks won't be necessary once we have support for native int, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right.
src/System.Memory/src/System/Span.cs
Outdated
@@ -386,6 +382,25 @@ public bool TryCopyTo(Span<T> destination) | |||
return false; | |||
} | |||
|
|||
[MethodImpl(MethodImplOptions.AggressiveInlining)] | |||
private unsafe void CopyBlocks(ref T dst, ref T src, int length) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth having a fast path for when length < uint.MaxValue
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is slow span only and this implementation is already 9x faster than what we had. I could add that optimization if we really want it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not that important 9x is good 😄. That can be up for grabs
src/System.Memory/src/System/Span.cs
Outdated
while (i-- != 0) | ||
IntPtr tailDiff = Unsafe.ByteOffset<T>(ref Unsafe.Add<T>(ref src, length), ref dst); | ||
bool isOverlapped = (sizeof(IntPtr) == sizeof(int)) ? tailDiff.ToInt32() < 0 : tailDiff.ToInt64() < 0; | ||
if (!isOverlapped) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You also need to check for SpanHelpers.IsReferenceOrContainsReferences() and use the slower path if it returns true.
@dotnet-bot test Innerloop OSX10.12 Debug x64 Build and Test |
@jkotas @KrzysztofCwalina Any more comments? |
src/System.Memory/src/System/Span.cs
Outdated
} | ||
|
||
return true; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we refactor out the if(srcGreaterThanDst) block and if(dstGreaterThanSrc) block into helper methods? They are very similar.
Unsafe.Add<T>(ref dst, i) = Unsafe.Add<T>(ref src, i); | ||
// Source address less than destination address. Must do backward copy. | ||
int i = length; | ||
while (i-- != 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have for loop above vs while loop here. Is there a reason for this inconsistency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I never changed that part. Not sure if there is a performance reason it was done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the loop above walks memory forward. This one walks memory backwards. Walking memory forward is faster on some hardware as the CPU prefetches memory in the former case and not the latter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@KrzysztofCwalina I think the question was about for vs while? Is there a reason for was used with one loop but while with the backwards loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, got it. In general, I think it's better to use a for loop as it makes it more likely that bounds check optimizations will kick in. Otherwise, the optimizer might not realize that the while loop variable is actually the Length.
But, maybe our optimizer is better than I realize? @russellhadley?
src/System.Memory/src/System/Span.cs
Outdated
bool isOverlapped = (sizeof(IntPtr) == sizeof(int)) ? tailDiff.ToInt32() < 0 : tailDiff.ToInt64() < 0; | ||
if (!isOverlapped && !SpanHelpers.IsReferenceOrContainsReferences<T>()) | ||
{ | ||
CopyBlocks(ref dst, ref src, srcLength); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Why can't we call CopyBlocks if SpanHelpers.IsReferenceOrContainsReferences()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jkotas Also curious
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't we call CopyBlocks if SpanHelpers.IsReferenceOrContainsReferences
Because of CopyBlocks does not have the correct GC tracking (does not update cards for generational GC).
If you would like to see the details, take a look at the optimized Span Copy implementation in CoreCLR that updates the cards in bulk (dotnet/coreclr#9999).
src/System.Memory/src/System/Span.cs
Outdated
|
||
bool srcGreaterThanDst = (sizeof(IntPtr) == sizeof(int)) ? srcMinusDst.ToInt32() >= 0 : srcMinusDst.ToInt64() >= 0; | ||
if (srcGreaterThanDst) | ||
if (!TryFastCopy(ref dst, destLength, ref src, length)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any significant perf change because of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is about the same logic as before but actually slightly more simple. In my tests, there isn't anything noticeably different than the first version.
FYI, the code diff here looks a lot dirtier than it actually is. In Beyond Compare, it is much easier to consume. I now just add 2 new private methods that do the "fast" version of copy checking and work. Then I wrapped the original code in a fast attempt, but with a slow fallback. |
src/System.Memory/src/System/Span.cs
Outdated
// Source address greater than or equal to destination address. Can do normal copy. | ||
for (int i = 0; i < length; i++) | ||
{ | ||
Unsafe.Add<T>(ref dst, i) = Unsafe.Add<T>(ref src, i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great to have an issue opened to unroll the loop later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#17118
src/System.Memory/src/System/Span.cs
Outdated
private unsafe bool TryFastCopy(ref T dst, int dstLength, ref T src, int srcLength) | ||
{ | ||
IntPtr srcMinusDst = Unsafe.ByteOffset<T>(ref dst, ref src); | ||
bool srcGreaterThanDst = (sizeof(IntPtr) == sizeof(int)) ? srcMinusDst.ToInt32() >= 0 : srcMinusDst.ToInt64() >= 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not simply call ToInt64, i.e. don't we properly preserve the sign when we convert 32-bit IntPtr to Int64?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was not certain of this either. The original code did this. I assumed there was a reason. @jkotas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not simply call ToInt64
There are other places that check for sizeof(IntPtr) == sizeof(int)
in the Span implementation. It is done to avoid performance hit from using long on 32-bit platforms. Workaround for lack of native int type in C#...
src/System.Memory/src/System/Span.cs
Outdated
} | ||
|
||
[MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
private unsafe bool TryFastCopy(ref T dst, int dstLength, ref T src, int srcLength) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to add more comments in the method (like you did above) to describe all the pointer arithmetics and checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
src/System.Memory/src/System/Span.cs
Outdated
|
||
[MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
private unsafe void CopyBlocks(ref T dst, ref T src, int length) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worth adding Debug.Assert(length>=0);
And in general, would be good to have more debug assertions in code that deals with pointer arithmetic. It's difficult to assert the correctness something without seeing calling methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done. I also added an early exit for zero length span at the top level public method.
src/System.Memory/src/System/Span.cs
Outdated
} | ||
|
||
[MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
private unsafe bool TryFastCopy(ref T dst, int dstLength, ref T src, int srcLength) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the point of splitting this into multiple methods? Are they even going to get inlined? (Note that AggressiveInlining does not guarantee inlining - the JIT often bails if the method is too complex.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was for readability, but I'm fine with merging them back together.
looks good. |
Part of dotnet/corefxlab#1314 |
Fix for #16840.
For non-overlapping spans, this makes copy about 9-10x faster. For overlapping spans, we still take the slow paths.
[Edit] 9-10x for sufficiently large blocks (> 2 KB). Though a 256 byte block was still about 4x faster. A 16 byte block was only marginally faster.