Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change bound checking in SSE/AVX intrinsics to avoid pointer overflow #821

Closed
wants to merge 3 commits into from

Conversation

briancylui
Copy link
Contributor

@briancylui briancylui commented Sep 5, 2018

Aims to solve #980

Suggested by @ahsonkhan to avoid integer overflow in bound checking inside SSE/AVX intrinsics implementation, i.e. change all while (pCurrent + 8 OR 4 <= pEnd) into while (pEnd - pCurrent >= 8 OR 4).

Perf tests results before and after the change are shown below:

Before the change:

BenchmarkDotNet=v0.11.1, OS=Windows 10.0.17134.228 (1803/April2018Update/Redstone4)
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100-alpha1-20180720-2
  [Host] : .NET Core 3.0.0-preview1-26710-03 (CoreCLR 4.6.26710.05, CoreFX 4.6.26708.04), 64bit RyuJIT

Toolchain=InProcessToolchain
Type Method Mean Error StdDev
AvxPerformanceTests SumU 159.4 us 1.104 us 0.9784 us
NativePerformanceTests SumU 283.5 us 5.492 us 4.8687 us
SsePerformanceTests SumU 281.2 us 1.472 us 1.3045 us
AvxPerformanceTests AddU 276.1 us 3.018 us 2.520 us
NativePerformanceTests AddU 330.1 us 3.585 us 3.178 us
SsePerformanceTests AddU 325.6 us 6.883 us 7.926 us

After the change:

BenchmarkDotNet=v0.11.1, OS=Windows 10.0.17134.228 (1803/April2018Update/Redstone4)
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100-alpha1-20180720-2
  [Host] : .NET Core 3.0.0-preview1-26710-03 (CoreCLR 4.6.26710.05, CoreFX 4.6.26708.04), 64bit RyuJIT

Toolchain=InProcessToolchain
Type Method Mean Error StdDev
AvxPerformanceTests SumU 183.5 us 3.621 us 3.023 us
NativePerformanceTests SumU 281.6 us 5.261 us 4.921 us
SsePerformanceTests SumU 294.1 us 2.080 us 1.946 us
AvxPerformanceTests AddU 296.3 us 5.185 us 4.850 us
NativePerformanceTests AddU 335.1 us 3.053 us 2.707 us
SsePerformanceTests AddU 345.0 us 2.155 us 1.800 us

Both SSE and AVX implementations are slower by 10-20% after this change.

In my opinion, after seeing the perf results, I may not recommend merging this PR. I may wait until the alternative suggested by @tannergooding in an earlier PR review has been implemented (2nd item under "Functionality" in briancylui#2):

var remainder = count % elementsPerIteration;
float* pEnd = pdst + (count - remainder);
while (pDstCurrent < pEnd)
{ … }

Another question I have is: would pDstCurrent + 8 OR 4 ever have the possibility to result in integer overflow? According to my knowledge, pEnd is initialized as pDstCurrent + count, and there are Contract.Asserts in the wrapper class to check that count does not exceed the original array length. I'm not sure, and am open to any PR comments and advice.

cc: @danmosemsft @eerhardt @tannergooding @ahsonkhan

@eerhardt
Copy link
Member

eerhardt commented Sep 5, 2018

I’m not sure I follow the reasoning. These are pointer operations. So the concern is that we are less than 4 or 8 elements away from the end of the memory? Would the OS ever let us get that close?
I also agree with the point that we are checking the array/span length above these methods, so we are guaranteed to be within the bounds of memory.

@briancylui
Copy link
Contributor Author

@ahsonkhan: I may share the same concern as @eerhardt - would love to learn and hear back.

@ahsonkhan
Copy link
Member

ahsonkhan commented Sep 5, 2018

Another question I have is: would pDstCurrent + 8 OR 4 ever have the possibility to result in integer overflow

Imo, given these are public APIs that anyone can call (with potentially invalid inputs), the inputs should be verified within the method body. Being explicit about assertions like src length == dst length would be good (and also act as self-documentation).
Edit: Nevermind, the class is internal.

According to my knowledge, pEnd is initialized as pDstCurrent + count, and there are Contract.Asserts in the wrapper class to check that count does not exceed the original array length. I'm not sure, and am open to any PR comments and advice.

I am not too familiar with the code base here, but don't contract.asserts run in debug mode only? Are these checks unnecessary in release?

Both SSE and AVX implementations are slower by 10-20% after this change.

How about something like the following:

int i = 0;
for (; i < src.Length - 8; i += 8)
{
    Vector256<float> srcVector = Avx.LoadVector256(pSrcCurrent);
    Vector256<float> dstVector = Avx.LoadVector256(pDstCurrent);

    result256 = Avx.Add(result256, Avx.Multiply(srcVector, dstVector));

    pSrcCurrent += 8;
    pDstCurrent += 8;
}

if (src.Length - i <= 4)
{
    ...
    i += 4;
}
while (i < src.Length)
{
    ...
    i++;
}

The number of assembly instruction is essentially identical here (saves a lea, costs an extra add): https://www.diffchecker.com/MywBeoFH (before in red, after in green)

image

Just my two cents. I would leave it up to others who have more context in this space to validate.

Would the OS ever let us get that close?

No, it won't. I discussed this with @GrabYourPitchforks, and its a general coding guideline to avoid arithmetic overflows like this (on the, albeit unlikely, chance the OS behavior changes in the future).

@briancylui
Copy link
Contributor Author

@ahsonkhan: Thank you for your comments! My apologies that I may have given the wrong hint that the Sse/AvxIntrinsics class is public during our previous conversation - it's actually internal after checking. Thank you very much for pointing that out!

Regarding Contracts.Assert being run in Debug only, it may be a very good point for future follow-up. If the APIs currently do not do any length checking in Release, probably we should do that in the future.

Thank you for the link to the DiffChecker - it looks amazing! Following your logic, would changing while (pCurrent + 8 OR 4 <= pEnd) into while (pCurrent <= pEnd - 8 OR 4) have a similar effect?

@ahsonkhan
Copy link
Member

ahsonkhan commented Sep 5, 2018

Following your logic, would changing while (pCurrent + 8 OR 4 <= pEnd) into while (pCurrent <= pEnd - 8 OR 4) have a similar effect?

That has a similar concern (but with underflow). If src.Length < 8, and in the unlikely chance that the array starts at a memory address close to the beginning of the address space, then (pCurrent <= pEnd - 8) would be true, even though it shouldn't. That is because comparisons are done as if they are unsigned integers.

https://docs.microsoft.com/en-us/dotnet/csharp/programming-guide/unsafe-code-pointers/pointer-comparison

The comparison operators compare the addresses of the two operands as if they are unsigned integers.

For example:

// Assume:
src.Length = 6;
pCurrent = 4;
pEnd = pCurrent + src.Length // 4 + 6 * 4 = 28;

if (pCurrent <= pEnd - 8)
{
   // 4 <= 28 - (8 * 4)
   // 4 <= 28 - 32
   // 4 <= (ulong)-4
   // We don't expect to be here, but we will since -4 as a ulong is a really large number.
}

@GrabYourPitchforks
Copy link
Member

So the concern is that we are less than 4 or 8 elements away from the end of the memory? Would the OS ever let us get that close?

@ahsonkhan and I spoke about this at length yesterday.

With current operating systems, no. But I don't know what OSes we'll be running on in 10 years. Maybe we'll have a fully managed OS like Singularity, and maybe it'll give us access to the full range of addressable memory. I can't predict the future, and I don't want to chance having to chase down logic errors in this code in ten years' time if there's a theoretical overflow condition that we can identify and address now.

@GrabYourPitchforks
Copy link
Member

GrabYourPitchforks commented Sep 6, 2018

FWIW, there is an exception that both C# and C allow when comparing pointers. The element just past the end of the array is always addressable, and it's always guaranteed to compare greater than the address of any element in the array.

That is:

T* ptr; // = pointer to first element in array
size_t numElements; // total number of elements in the array

for (size_t i = 0; i < numElements; i++) {
    assert(&ptr[i] < &ptr[numElements], "This is guaranteed by the language.");
}

A corollary to this is that T* end = ptr + numElements is guaranteed not to integer overflow (so the array can't be placed at the very, very end of addressable memory), but T* end = ptr + numElements + 1 has no such guarantee.

Note: the element just past the end of the array is addressable but not necessarily dereferenceable. That is, T element = ptr[numElements] still has undefined behavior, such as populating the register with garbage or even AVing.

@eerhardt
Copy link
Member

eerhardt commented Sep 6, 2018

Ah, ok. I did some more reading, looked at this code again and I understand the concern. The array may be at the very end of memory space. And when pDstCurrent is less than 8 elements from the end, then while (pDstCurrent + 8 <= pDstEnd) will go out of bounds, and potentially overflow. And if it overflows, it will wrap around making condition true.

I think it makes sense using the for (; i < src.Length - 8; i += 8) pattern that @ahsonkhan proposes above or the remainder pattern suggested by @tannergooding in an earlier PR review.

@briancylui
Copy link
Contributor Author

Thanks for all the comments. After some reading, I agree that the problem is potential illegal memory access due to incrementing the pCurrent pointer out of bounds of the Span<float> object. Accessing out-of-bound object can already cause security concerns, and it would be even worse if the pointer got incremented into illegal memory. I should probably change the title of this PR to "avoid pointer overflow" or "avoid illegal memory access" to avoid any possible confusion, and look forward to implementing this fix tomorrow.

A very good takeaway from @ahsonkhan and @GrabYourPitchforks's comments is that we may not want to increment/decrement any pointer for bound checking, and I think it is also a very good opportunity to follow through on @eerhardt and @tannergooding's suggestion, which will pave the way for implementing double-computing. I will spend some time tomorrow to look into how to implement these fixes concretely.

Thank you very much for all the comments!

@briancylui briancylui changed the title Change bound checking in SSE/AVX intrinsics to avoid integer overflow Change bound checking in SSE/AVX intrinsics to avoid pointer overflow Sep 6, 2018
@briancylui
Copy link
Contributor Author

Pushed a new commit and deleted the previous commit to show you all a sample of the changes I will make globally.

In the new commit, only the AddScalarU intrinsic has been changed in its SSE and AVX implementations. Once you think that the changes look fine to you, or you may some suggestions to edit them, the changes can be made on a global scale to the rest of the intrinsics.

The performance test results are shown below:

After the change:

BenchmarkDotNet=v0.11.1, OS=Windows 10.0.17134.228 (1803/April2018Update/Redstone4)
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100-alpha1-20180720-2
  [Host] : .NET Core 3.0.0-preview1-26710-03 (CoreCLR 4.6.26710.05, CoreFX 4.6.26708.04), 64bit RyuJIT

Toolchain=InProcessToolchain
Type Method Mean Error StdDev Median
AvxPerformanceTests AddScalarU 142.9 us 2.8338 us 7.0045 us 140.1 us
NativePerformanceTests AddScalarU 191.2 us 2.3916 us 1.8672 us 190.9 us
SsePerformanceTests AddScalarU 172.7 us 0.7552 us 0.5896 us 172.6 us

Before the change:

BenchmarkDotNet=v0.11.1, OS=Windows 10.0.17134.228 (1803/April2018Update/Redstone4)
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100-alpha1-20180720-2
  [Host] : .NET Core 3.0.0-preview1-26710-03 (CoreCLR 4.6.26710.05, CoreFX 4.6.26708.04), 64bit RyuJIT

Toolchain=InProcessToolchain
Type Method Mean Error StdDev Median
AvxPerformanceTests AddScalarU 141.8 us 1.623 us 1.439 us 141.5 us
NativePerformanceTests AddScalarU 191.8 us 4.514 us 10.901 us 186.5 us
SsePerformanceTests AddScalarU 180.3 us 3.653 us 7.125 us 177.1 us

The performance results are comparable after the change: SEE got faster by a bit, while AVX got slower by just a bit. The second use of Math.DivRem for countSse and remainderSse may have made the AVX implementation a bit slower.

Look forward to all your comments. Since tomorrow is the last day of my internship, and I have other work items to finish before I go, I don't think I can finish this PR by tomorrow, but I will keep my branch in my fork up so that other contributors can use your previous commits. While I will leave this PR open for the moment, please feel free to close it whenever you feel appropriate. Thank you.

@briancylui
Copy link
Contributor Author

test MachineLearning-CI please

@@ -20,6 +20,10 @@ internal static class AvxIntrinsics
{
private static readonly Vector256<float> _absMask256 = Avx.StaticCast<int, float>(Avx.SetAllVector256(0x7FFFFFFF));

// The count of 32-bit floats in Vector256<T>
private const int AvxAlignment = 8;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't alignment, but rather the number of 32-bit elements that Vector256 can hold...

Maybe Vector256SingleElementCount or something similar...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about Vector256FloatCount? Is there any preference for SingleElement?

}

while (pDstCurrent < pDstEnd)
for (int i = 0; i < remainder; i++)
{
Vector128<float> dstVector = Sse.LoadScalarVector128(pDstCurrent);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be way more readable as pDstCurrent[i] += scalar, and it should produce the same code.

The various scalar intrinsics are really meant for places where you have to interop between Vector and Scalar code, or where you need some scalar operations which aren't expressible in normal C# code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, perf test results show that this change improves the runtime significantly:

After all changes, including changing scalar intrinsics to indexed code:

BenchmarkDotNet=v0.11.1, OS=Windows 10.0.17134.228 (1803/April2018Update/Redstone4)
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100-alpha1-20180720-2
  [Host] : .NET Core 3.0.0-preview1-26710-03 (CoreCLR 4.6.26710.05, CoreFX 4.6.26708.04), 64bit RyuJIT

Toolchain=InProcessToolchain
Type Method Mean Error StdDev
AvxPerformanceTests AddScalarU 172.1 us 2.589 us 2.422 us
NativePerformanceTests AddScalarU 216.9 us 1.785 us 1.491 us
SsePerformanceTests AddScalarU 209.7 us 1.492 us 1.246 us

After partial changes (removing the 2nd time of Math.DivRem):

BenchmarkDotNet=v0.11.1, OS=Windows 10.0.17134.228 (1803/April2018Update/Redstone4)
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100-alpha1-20180720-2
  [Host] : .NET Core 3.0.0-preview1-26710-03 (CoreCLR 4.6.26710.05, CoreFX 4.6.26708.04), 64bit RyuJIT

Toolchain=InProcessToolchain
Type Method Mean Error StdDev
AvxPerformanceTests AddScalarU 193.3 us 3.6653 us 3.4285 us
NativePerformanceTests AddScalarU 215.4 us 0.7014 us 0.6218 us
SsePerformanceTests AddScalarU 238.2 us 4.1955 us 3.5034 us

Before the change:

BenchmarkDotNet=v0.11.1, OS=Windows 10.0.17134.228 (1803/April2018Update/Redstone4)
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100-alpha1-20180720-2
  [Host] : .NET Core 3.0.0-preview1-26710-03 (CoreCLR 4.6.26710.05, CoreFX 4.6.26708.04), 64bit RyuJIT

Toolchain=InProcessToolchain
Type Method Mean Error StdDev
AvxPerformanceTests AddScalarU 200.3 us 3.919 us 4.513 us
NativePerformanceTests AddScalarU 249.4 us 4.400 us 4.116 us
SsePerformanceTests AddScalarU 235.3 us 1.393 us 1.235 us

@briancylui
Copy link
Contributor Author

briancylui commented Sep 7, 2018

Pushed a new commit responding to @tannergooding's latest comments. Since the following suggested changes show a significant +10% perf improvement, I will propagate these changes globally to get end-to-end perf results as soon as possible today:

After all changes, including changing scalar intrinsics to indexed code:

BenchmarkDotNet=v0.11.1, OS=Windows 10.0.17134.228 (1803/April2018Update/Redstone4)
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100-alpha1-20180720-2
  [Host] : .NET Core 3.0.0-preview1-26710-03 (CoreCLR 4.6.26710.05, CoreFX 4.6.26708.04), 64bit RyuJIT

Toolchain=InProcessToolchain
Type Method Mean Error StdDev
AvxPerformanceTests AddScalarU 172.1 us 2.589 us 2.422 us
NativePerformanceTests AddScalarU 216.9 us 1.785 us 1.491 us
SsePerformanceTests AddScalarU 209.7 us 1.492 us 1.246 us

After partial changes (removing the 2nd time of Math.DivRem):

BenchmarkDotNet=v0.11.1, OS=Windows 10.0.17134.228 (1803/April2018Update/Redstone4)
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100-alpha1-20180720-2
  [Host] : .NET Core 3.0.0-preview1-26710-03 (CoreCLR 4.6.26710.05, CoreFX 4.6.26708.04), 64bit RyuJIT

Toolchain=InProcessToolchain
Type Method Mean Error StdDev
AvxPerformanceTests AddScalarU 193.3 us 3.6653 us 3.4285 us
NativePerformanceTests AddScalarU 215.4 us 0.7014 us 0.6218 us
SsePerformanceTests AddScalarU 238.2 us 4.1955 us 3.5034 us

Before the change:

BenchmarkDotNet=v0.11.1, OS=Windows 10.0.17134.228 (1803/April2018Update/Redstone4)
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100-alpha1-20180720-2
  [Host] : .NET Core 3.0.0-preview1-26710-03 (CoreCLR 4.6.26710.05, CoreFX 4.6.26708.04), 64bit RyuJIT

Toolchain=InProcessToolchain
Type Method Mean Error StdDev
AvxPerformanceTests AddScalarU 200.3 us 3.919 us 4.513 us
NativePerformanceTests AddScalarU 249.4 us 4.400 us 4.116 us
SsePerformanceTests AddScalarU 235.3 us 1.393 us 1.235 us

@ahsonkhan
Copy link
Member

we may not want to increment/decrement any pointer for bound checking

Right.

Since the following suggested changes show a significant +10% perf improvement

Nice! The implementation can be reasoned about more easily now as well.
+ We removed the risks related to overflow :)

…r all AVX intrinsics, except MatMul's and those involving AbsMask
}

for (int i = 0; i < remainder - 4; i++)
for (int i = 0; i < remainder % 4; i++)
Copy link
Member

@ahsonkhan ahsonkhan Sep 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't use the relatively expensive modulo operator here.

Maybe, do remainder -= 4 in the above if (remainder >= 4) block and then your for loop can just be:

for (int i = 0; i < remainder; i++)
{
    ...
}

@briancylui
Copy link
Contributor Author

The latest commit contains the global replacement of scalar operations by indexed code for almost all AVX intrinsics (except for MatMul's and those methods that involve AbsMask). Changes to SSE intrinsics have not been done yet. The latest perf results are shown at the bottom of briancylui#1. Please note that now it is ready to do end-to-end perf testing using the KMeansAndLogisticRegression benchmark in test\Microsoft.ML.Benchmarks, but it is only that I didn't have enough time to test it by the end of my internship.

Thank you everyone for your feedback! I will keep my branches in my fork up there, and feel free to continue developing using my commits. Have a nice day!

@markusweimer
Copy link
Member

This PR does not reference an issue. Can you please file one and reference it in the PR description?

@briancylui
Copy link
Contributor Author

@ahsonkhan raised the issue and suggested the change. Anyone familiar with the issue is welcome to carry on this project after my internship if time allows.

@briancylui
Copy link
Contributor Author

Created an issue #980, which this PR now references. The issue may need some polishing, but it solves the problem for the time being. Hope it helps other people interested take on this project!

@danmoseley
Copy link
Member

Seems this is almost done. @tannergooding will finish it up in a little bit.

@Zruty0
Copy link
Contributor

Zruty0 commented Oct 5, 2018

@tannergooding , are you working on this?

@tannergooding
Copy link
Member

It's on my backlog, but lower priority than a few other items right now.

@shauheen
Copy link
Contributor

@tannergooding should we close the PR and open it when you think you can update the branch?

@tannergooding
Copy link
Member

That is fine with me.

@ghost ghost locked as resolved and limited conversation to collaborators Mar 29, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants