Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.
/ corefx Public archive

Make use of CopyBlock for non-overlapping spans #17063

Merged
merged 8 commits into from
Mar 15, 2017
Merged

Make use of CopyBlock for non-overlapping spans #17063

merged 8 commits into from
Mar 15, 2017

Conversation

shiftylogic
Copy link
Contributor

@shiftylogic shiftylogic commented Mar 13, 2017

Fix for #16840.

For non-overlapping spans, this makes copy about 9-10x faster. For overlapping spans, we still take the slow paths.

[Edit] 9-10x for sufficiently large blocks (> 2 KB). Though a 256 byte block was still about 4x faster. A 16 byte block was only marginally faster.

@shiftylogic
Copy link
Contributor Author

{
Unsafe.Add<T>(ref dst, i) = Unsafe.Add<T>(ref src, i);
Unsafe.CopyBlock<T>(ref dst, ref src, (uint)length);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does BlockCopy work for overlapping ranges where the source is after (address wise) the destination?

Copy link
Member

@jkotas jkotas Mar 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no Unsafe.CopyBlock<T>. I do not see how this can ever work. Note that this implementation is only for slow span, and so you have to go out of your way to build and test it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkotas This is my first touch of CoreFx and I learned something new today. I was clearly not building what I thought I was building. Apparently, the file I was changing is only included in builds on particular platforms that I wasn't hitting until I submitted this.

I am testing a new commit that addresses the real issue right now. I will update this PR soon.

// Source address greater than or equal to destination address. Can do normal copy.
for (int i = 0; i < length; i++)
{
Unsafe.Add<T>(ref dst, i) = Unsafe.Add<T>(ref src, i);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be unrolled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was initially just fixing the perf for non-overlapping blocks. If you want this unrolled, I can add that as another commit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either a different commit or a different PR, but I think we should try to make the gap between overlapping and non-overlapping as small as possible. Otherwise it creates annoying performance traps.

Copy link
Member

@jkotas jkotas Mar 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The slower path for overlapping exists in the fast Span as well. Fixing it for fast Span should be higher priority than for the slow Span.

@KrzysztofCwalina
Copy link
Member

Do we have a benchmark for this checked into coreclr? so that the JIT team can work on optimizing code gen for it?

cc: @ahsonkhan

Unsafe.CopyBlock<T>(ref dst, ref src, (uint)length);
void* pSrc = Unsafe.AsPointer<T>(ref src);
void* pDst = Unsafe.AsPointer<T>(ref dst);
Unsafe.CopyBlockUnaligned(pDst, pSrc, (uint)(length * Unsafe.SizeOf<T>()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

length * Unsafe.SizeOf<T>() can be more than 4GB. It is not ok to cast it to uint.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsafe.CopyBlock only takes a uint in the 3rd argument.

Should I add a check and fail on overflow or is there a bug in CopyBlock?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should call CopyBlock multiple times as necessary.

CopyBlock is basically just exposing cpblk instruction as an API. The cpblk instruction takes uint size unfortunately.

Unsafe.CopyBlock<T>(ref dst, ref src, (uint)length);
void* pSrc = Unsafe.AsPointer<T>(ref src);
void* pDst = Unsafe.AsPointer<T>(ref dst);
Unsafe.CopyBlockUnaligned(pDst, pSrc, (uint)(length * Unsafe.SizeOf<T>()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CopyBlockUnaligned only works for elements that do not contain GC pointers.

Copy link
Contributor Author

@shiftylogic shiftylogic Mar 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkotas Is it okay to call CopyBlock with an unaligned buffer? Or do we need to ensure it is aligned?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling CopyBlockUnaligned on unaligned buffer is fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkotas Is it okay to call CopyBlock with an unaligned buffer? Or do we need to ensure it is aligned?

Calling CopyBlockUnaligned on unaligned buffer is fine.

CopyBlock or CopyBlockUnaligned?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CopyBlockUnaligned. It will do the same thing as CopyBlock on x86 and x64 because of these platforms do not require alignment.

@ahsonkhan
Copy link
Member

Do we have a benchmark for this checked into coreclr? so that the JIT team can work on optimizing code gen for it?

@KrzysztofCwalina, there is a benchmark for Span.CopyTo, however it only tests fast span.
https://github.com/dotnet/coreclr/blob/master/tests/src/JIT/Performance/CodeQuality/Span/SpanBench.cs#L597
Issue: https://github.com/dotnet/coreclr/issues/10098

@KrzysztofCwalina
Copy link
Member

@ahsonkhan, thanks. Fast span only benchmarks is good enough. I don't think we want to spend much more resources optimizing JIT for the slow span

// Source address greater than or equal to destination address. Can do normal copy.
for (int i = 0; i < length; i++)
IntPtr tailDiff = Unsafe.ByteOffset<T>(ref Unsafe.Add<T>(ref dst, destLength), ref src);
bool isOverlapped = (sizeof(IntPtr) == sizeof(int)) ? tailDiff.ToInt32() < 0 : tailDiff.ToInt64() < 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkotas, these sizeof checks won't be necessary once we have support for native int, correct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right.

@@ -386,6 +382,25 @@ public bool TryCopyTo(Span<T> destination)
return false;
}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
private unsafe void CopyBlocks(ref T dst, ref T src, int length)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth having a fast path for when length < uint.MaxValue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is slow span only and this implementation is already 9x faster than what we had. I could add that optimization if we really want it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not that important 9x is good 😄. That can be up for grabs

while (i-- != 0)
IntPtr tailDiff = Unsafe.ByteOffset<T>(ref Unsafe.Add<T>(ref src, length), ref dst);
bool isOverlapped = (sizeof(IntPtr) == sizeof(int)) ? tailDiff.ToInt32() < 0 : tailDiff.ToInt64() < 0;
if (!isOverlapped)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You also need to check for SpanHelpers.IsReferenceOrContainsReferences() and use the slower path if it returns true.

@shiftylogic
Copy link
Contributor Author

@dotnet-bot test Innerloop OSX10.12 Debug x64 Build and Test

@shiftylogic
Copy link
Contributor Author

shiftylogic commented Mar 14, 2017

@jkotas @KrzysztofCwalina Any more comments?

}

return true;
}
Copy link
Member

@ahsonkhan ahsonkhan Mar 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we refactor out the if(srcGreaterThanDst) block and if(dstGreaterThanSrc) block into helper methods? They are very similar.

Unsafe.Add<T>(ref dst, i) = Unsafe.Add<T>(ref src, i);
// Source address less than destination address. Must do backward copy.
int i = length;
while (i-- != 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have for loop above vs while loop here. Is there a reason for this inconsistency?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I never changed that part. Not sure if there is a performance reason it was done.

Copy link
Member

@KrzysztofCwalina KrzysztofCwalina Mar 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the loop above walks memory forward. This one walks memory backwards. Walking memory forward is faster on some hardware as the CPU prefetches memory in the former case and not the latter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KrzysztofCwalina I think the question was about for vs while? Is there a reason for was used with one loop but while with the backwards loop?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, got it. In general, I think it's better to use a for loop as it makes it more likely that bounds check optimizations will kick in. Otherwise, the optimizer might not realize that the while loop variable is actually the Length.

But, maybe our optimizer is better than I realize? @russellhadley?

bool isOverlapped = (sizeof(IntPtr) == sizeof(int)) ? tailDiff.ToInt32() < 0 : tailDiff.ToInt64() < 0;
if (!isOverlapped && !SpanHelpers.IsReferenceOrContainsReferences<T>())
{
CopyBlocks(ref dst, ref src, srcLength);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Why can't we call CopyBlocks if SpanHelpers.IsReferenceOrContainsReferences()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkotas Also curious

Copy link
Member

@jkotas jkotas Mar 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we call CopyBlocks if SpanHelpers.IsReferenceOrContainsReferences

Because of CopyBlocks does not have the correct GC tracking (does not update cards for generational GC).

If you would like to see the details, take a look at the optimized Span Copy implementation in CoreCLR that updates the cards in bulk (dotnet/coreclr#9999).


bool srcGreaterThanDst = (sizeof(IntPtr) == sizeof(int)) ? srcMinusDst.ToInt32() >= 0 : srcMinusDst.ToInt64() >= 0;
if (srcGreaterThanDst)
if (!TryFastCopy(ref dst, destLength, ref src, length))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any significant perf change because of this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is about the same logic as before but actually slightly more simple. In my tests, there isn't anything noticeably different than the first version.

@shiftylogic
Copy link
Contributor Author

FYI, the code diff here looks a lot dirtier than it actually is. In Beyond Compare, it is much easier to consume. I now just add 2 new private methods that do the "fast" version of copy checking and work. Then I wrapped the original code in a fast attempt, but with a slow fallback.

// Source address greater than or equal to destination address. Can do normal copy.
for (int i = 0; i < length; i++)
{
Unsafe.Add<T>(ref dst, i) = Unsafe.Add<T>(ref src, i);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to have an issue opened to unroll the loop later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#17118

private unsafe bool TryFastCopy(ref T dst, int dstLength, ref T src, int srcLength)
{
IntPtr srcMinusDst = Unsafe.ByteOffset<T>(ref dst, ref src);
bool srcGreaterThanDst = (sizeof(IntPtr) == sizeof(int)) ? srcMinusDst.ToInt32() >= 0 : srcMinusDst.ToInt64() >= 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not simply call ToInt64, i.e. don't we properly preserve the sign when we convert 32-bit IntPtr to Int64?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not certain of this either. The original code did this. I assumed there was a reason. @jkotas?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not simply call ToInt64

There are other places that check for sizeof(IntPtr) == sizeof(int) in the Span implementation. It is done to avoid performance hit from using long on 32-bit platforms. Workaround for lack of native int type in C#...

}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
private unsafe bool TryFastCopy(ref T dst, int dstLength, ref T src, int srcLength)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to add more comments in the method (like you did above) to describe all the pointer arithmetics and checks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


[MethodImpl(MethodImplOptions.AggressiveInlining)]
private unsafe void CopyBlocks(ref T dst, ref T src, int length)
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth adding Debug.Assert(length>=0);
And in general, would be good to have more debug assertions in code that deals with pointer arithmetic. It's difficult to assert the correctness something without seeing calling methods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. I also added an early exit for zero length span at the top level public method.

}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
private unsafe bool TryFastCopy(ref T dst, int dstLength, ref T src, int srcLength)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the point of splitting this into multiple methods? Are they even going to get inlined? (Note that AggressiveInlining does not guarantee inlining - the JIT often bails if the method is too complex.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was for readability, but I'm fine with merging them back together.

@KrzysztofCwalina
Copy link
Member

looks good.

@shiftylogic shiftylogic merged commit 1014d2f into dotnet:master Mar 15, 2017
@karelz karelz modified the milestone: 2.0.0 Mar 18, 2017
@ahsonkhan
Copy link
Member

Part of dotnet/corefxlab#1314

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants