-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change bound checking in SSE/AVX intrinsics to avoid pointer overflow #821
Conversation
I’m not sure I follow the reasoning. These are pointer operations. So the concern is that we are less than 4 or 8 elements away from the end of the memory? Would the OS ever let us get that close? |
@ahsonkhan: I may share the same concern as @eerhardt - would love to learn and hear back. |
I am not too familiar with the code base here, but don't contract.asserts run in debug mode only? Are these checks unnecessary in release?
How about something like the following: int i = 0;
for (; i < src.Length - 8; i += 8)
{
Vector256<float> srcVector = Avx.LoadVector256(pSrcCurrent);
Vector256<float> dstVector = Avx.LoadVector256(pDstCurrent);
result256 = Avx.Add(result256, Avx.Multiply(srcVector, dstVector));
pSrcCurrent += 8;
pDstCurrent += 8;
}
if (src.Length - i <= 4)
{
...
i += 4;
}
while (i < src.Length)
{
...
i++;
} The number of assembly instruction is essentially identical here (saves a Just my two cents. I would leave it up to others who have more context in this space to validate.
No, it won't. I discussed this with @GrabYourPitchforks, and its a general coding guideline to avoid arithmetic overflows like this (on the, albeit unlikely, chance the OS behavior changes in the future). |
@ahsonkhan: Thank you for your comments! My apologies that I may have given the wrong hint that the Sse/AvxIntrinsics class is public during our previous conversation - it's actually internal after checking. Thank you very much for pointing that out! Regarding Thank you for the link to the DiffChecker - it looks amazing! Following your logic, would changing |
That has a similar concern (but with underflow). If src.Length < 8, and in the unlikely chance that the array starts at a memory address close to the beginning of the address space, then
For example: // Assume:
src.Length = 6;
pCurrent = 4;
pEnd = pCurrent + src.Length // 4 + 6 * 4 = 28;
if (pCurrent <= pEnd - 8)
{
// 4 <= 28 - (8 * 4)
// 4 <= 28 - 32
// 4 <= (ulong)-4
// We don't expect to be here, but we will since -4 as a ulong is a really large number.
} |
@ahsonkhan and I spoke about this at length yesterday. With current operating systems, no. But I don't know what OSes we'll be running on in 10 years. Maybe we'll have a fully managed OS like Singularity, and maybe it'll give us access to the full range of addressable memory. I can't predict the future, and I don't want to chance having to chase down logic errors in this code in ten years' time if there's a theoretical overflow condition that we can identify and address now. |
FWIW, there is an exception that both C# and C allow when comparing pointers. The element just past the end of the array is always addressable, and it's always guaranteed to compare greater than the address of any element in the array. That is: T* ptr; // = pointer to first element in array
size_t numElements; // total number of elements in the array
for (size_t i = 0; i < numElements; i++) {
assert(&ptr[i] < &ptr[numElements], "This is guaranteed by the language.");
} A corollary to this is that Note: the element just past the end of the array is addressable but not necessarily dereferenceable. That is, |
Ah, ok. I did some more reading, looked at this code again and I understand the concern. The array may be at the very end of memory space. And when I think it makes sense using the |
Thanks for all the comments. After some reading, I agree that the problem is potential illegal memory access due to incrementing the A very good takeaway from @ahsonkhan and @GrabYourPitchforks's comments is that we may not want to increment/decrement any pointer for bound checking, and I think it is also a very good opportunity to follow through on @eerhardt and @tannergooding's suggestion, which will pave the way for implementing double-computing. I will spend some time tomorrow to look into how to implement these fixes concretely. Thank you very much for all the comments! |
da7b5a7
to
2e0033e
Compare
Pushed a new commit and deleted the previous commit to show you all a sample of the changes I will make globally. In the new commit, only the The performance test results are shown below: After the change:BenchmarkDotNet=v0.11.1, OS=Windows 10.0.17134.228 (1803/April2018Update/Redstone4)
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100-alpha1-20180720-2
[Host] : .NET Core 3.0.0-preview1-26710-03 (CoreCLR 4.6.26710.05, CoreFX 4.6.26708.04), 64bit RyuJIT
Toolchain=InProcessToolchain
Before the change:BenchmarkDotNet=v0.11.1, OS=Windows 10.0.17134.228 (1803/April2018Update/Redstone4)
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100-alpha1-20180720-2
[Host] : .NET Core 3.0.0-preview1-26710-03 (CoreCLR 4.6.26710.05, CoreFX 4.6.26708.04), 64bit RyuJIT
Toolchain=InProcessToolchain
The performance results are comparable after the change: SEE got faster by a bit, while AVX got slower by just a bit. The second use of Look forward to all your comments. Since tomorrow is the last day of my internship, and I have other work items to finish before I go, I don't think I can finish this PR by tomorrow, but I will keep my branch in my fork up so that other contributors can use your previous commits. While I will leave this PR open for the moment, please feel free to close it whenever you feel appropriate. Thank you. |
test MachineLearning-CI please |
@@ -20,6 +20,10 @@ internal static class AvxIntrinsics | |||
{ | |||
private static readonly Vector256<float> _absMask256 = Avx.StaticCast<int, float>(Avx.SetAllVector256(0x7FFFFFFF)); | |||
|
|||
// The count of 32-bit floats in Vector256<T> | |||
private const int AvxAlignment = 8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't alignment, but rather the number of 32-bit elements that Vector256 can hold...
Maybe Vector256SingleElementCount
or something similar...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about Vector256FloatCount
? Is there any preference for SingleElement
?
} | ||
|
||
while (pDstCurrent < pDstEnd) | ||
for (int i = 0; i < remainder; i++) | ||
{ | ||
Vector128<float> dstVector = Sse.LoadScalarVector128(pDstCurrent); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be way more readable as pDstCurrent[i] += scalar
, and it should produce the same code.
The various scalar
intrinsics are really meant for places where you have to interop between Vector and Scalar code, or where you need some scalar operations which aren't expressible in normal C# code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, perf test results show that this change improves the runtime significantly:
After all changes, including changing scalar intrinsics to indexed code:
BenchmarkDotNet=v0.11.1, OS=Windows 10.0.17134.228 (1803/April2018Update/Redstone4)
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100-alpha1-20180720-2
[Host] : .NET Core 3.0.0-preview1-26710-03 (CoreCLR 4.6.26710.05, CoreFX 4.6.26708.04), 64bit RyuJIT
Toolchain=InProcessToolchain
Type | Method | Mean | Error | StdDev |
---|---|---|---|---|
AvxPerformanceTests | AddScalarU | 172.1 us | 2.589 us | 2.422 us |
NativePerformanceTests | AddScalarU | 216.9 us | 1.785 us | 1.491 us |
SsePerformanceTests | AddScalarU | 209.7 us | 1.492 us | 1.246 us |
After partial changes (removing the 2nd time of Math.DivRem):
BenchmarkDotNet=v0.11.1, OS=Windows 10.0.17134.228 (1803/April2018Update/Redstone4)
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100-alpha1-20180720-2
[Host] : .NET Core 3.0.0-preview1-26710-03 (CoreCLR 4.6.26710.05, CoreFX 4.6.26708.04), 64bit RyuJIT
Toolchain=InProcessToolchain
Type | Method | Mean | Error | StdDev |
---|---|---|---|---|
AvxPerformanceTests | AddScalarU | 193.3 us | 3.6653 us | 3.4285 us |
NativePerformanceTests | AddScalarU | 215.4 us | 0.7014 us | 0.6218 us |
SsePerformanceTests | AddScalarU | 238.2 us | 4.1955 us | 3.5034 us |
Before the change:
BenchmarkDotNet=v0.11.1, OS=Windows 10.0.17134.228 (1803/April2018Update/Redstone4)
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100-alpha1-20180720-2
[Host] : .NET Core 3.0.0-preview1-26710-03 (CoreCLR 4.6.26710.05, CoreFX 4.6.26708.04), 64bit RyuJIT
Toolchain=InProcessToolchain
Type | Method | Mean | Error | StdDev |
---|---|---|---|---|
AvxPerformanceTests | AddScalarU | 200.3 us | 3.919 us | 4.513 us |
NativePerformanceTests | AddScalarU | 249.4 us | 4.400 us | 4.116 us |
SsePerformanceTests | AddScalarU | 235.3 us | 1.393 us | 1.235 us |
Pushed a new commit responding to @tannergooding's latest comments. Since the following suggested changes show a significant +10% perf improvement, I will propagate these changes globally to get end-to-end perf results as soon as possible today: After all changes, including changing scalar intrinsics to indexed code:BenchmarkDotNet=v0.11.1, OS=Windows 10.0.17134.228 (1803/April2018Update/Redstone4)
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100-alpha1-20180720-2
[Host] : .NET Core 3.0.0-preview1-26710-03 (CoreCLR 4.6.26710.05, CoreFX 4.6.26708.04), 64bit RyuJIT
Toolchain=InProcessToolchain
After partial changes (removing the 2nd time of Math.DivRem):BenchmarkDotNet=v0.11.1, OS=Windows 10.0.17134.228 (1803/April2018Update/Redstone4)
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100-alpha1-20180720-2
[Host] : .NET Core 3.0.0-preview1-26710-03 (CoreCLR 4.6.26710.05, CoreFX 4.6.26708.04), 64bit RyuJIT
Toolchain=InProcessToolchain
Before the change:BenchmarkDotNet=v0.11.1, OS=Windows 10.0.17134.228 (1803/April2018Update/Redstone4)
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100-alpha1-20180720-2
[Host] : .NET Core 3.0.0-preview1-26710-03 (CoreCLR 4.6.26710.05, CoreFX 4.6.26708.04), 64bit RyuJIT
Toolchain=InProcessToolchain
|
Right.
Nice! The implementation can be reasoned about more easily now as well. |
…r all AVX intrinsics, except MatMul's and those involving AbsMask
} | ||
|
||
for (int i = 0; i < remainder - 4; i++) | ||
for (int i = 0; i < remainder % 4; i++) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't use the relatively expensive modulo operator here.
Maybe, do remainder -= 4
in the above if (remainder >= 4)
block and then your for loop can just be:
for (int i = 0; i < remainder; i++)
{
...
}
The latest commit contains the global replacement of scalar operations by indexed code for almost all AVX intrinsics (except for MatMul's and those methods that involve AbsMask). Changes to SSE intrinsics have not been done yet. The latest perf results are shown at the bottom of briancylui#1. Please note that now it is ready to do end-to-end perf testing using the KMeansAndLogisticRegression benchmark in Thank you everyone for your feedback! I will keep my branches in my fork up there, and feel free to continue developing using my commits. Have a nice day! |
This PR does not reference an issue. Can you please file one and reference it in the PR description? |
@ahsonkhan raised the issue and suggested the change. Anyone familiar with the issue is welcome to carry on this project after my internship if time allows. |
Created an issue #980, which this PR now references. The issue may need some polishing, but it solves the problem for the time being. Hope it helps other people interested take on this project! |
Seems this is almost done. @tannergooding will finish it up in a little bit. |
@tannergooding , are you working on this? |
It's on my backlog, but lower priority than a few other items right now. |
@tannergooding should we close the PR and open it when you think you can update the branch? |
That is fine with me. |
Aims to solve #980
Suggested by @ahsonkhan to avoid integer overflow in bound checking inside SSE/AVX intrinsics implementation, i.e. change all
while (pCurrent + 8 OR 4 <= pEnd)
intowhile (pEnd - pCurrent >= 8 OR 4)
.Perf tests results before and after the change are shown below:
Before the change:
After the change:
Both SSE and AVX implementations are slower by 10-20% after this change.
In my opinion, after seeing the perf results, I may not recommend merging this PR. I may wait until the alternative suggested by @tannergooding in an earlier PR review has been implemented (2nd item under "Functionality" in briancylui#2):
Another question I have is: would
pDstCurrent + 8 OR 4
ever have the possibility to result in integer overflow? According to my knowledge,pEnd
is initialized aspDstCurrent + count
, and there areContract.Asserts
in the wrapper class to check thatcount
does not exceed the original array length. I'm not sure, and am open to any PR comments and advice.cc: @danmosemsft @eerhardt @tannergooding @ahsonkhan