-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some System.Decimal performance improvements #99212
base: main
Are you sure you want to change the base?
Conversation
…se code than Job=ShortRun IterationCount=3 LaunchCount=1 WarmupCount=3 | Method | a | b | Mean | Error | StdDev | Allocated | |--------------------------- |-- |----------- |---------:|----------:|----------:|----------:| | Mul64By32_New | 3 | 4294967295 | 2.068 ns | 0.0459 ns | 0.0383 ns | - | | Mul64By32_Ori | 3 | 4294967295 | 2.916 ns | 0.0231 ns | 0.0193 ns | - |
- Add comment to BigMul64By32 and make it return nunit to avoid clearing upper 32 bits - Simplify IncreaseScale
src/libraries/System.Private.CoreLib/src/System/Decimal.DecCalc.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Decimal.DecCalc.cs
Outdated
Show resolved
Hide resolved
Cc @jkotas for the questions about relative prioritization of 32bit performance. |
It's similar to #99196 (and what I did in https://github.com/huoyaoyuan/runtime/tree/decimal-divrem) |
Tagging subscribers to this area: @dotnet/area-system-numerics Issue DetailsSpeed up decimal Multiplication and Division (mostly on x64, but some minor improvements on x86). It does mostly 3 tricks:
This is small port of the code from #7778 now that DivRem is accessible. Questions
Remarks for better performance
BenchmarksFull benchmark code can be found at https://github.com/Daniel-Svensson/ClrExperiments/tree/master/ClrDecimal/Benchmarks 64bit Divide
64bit Multiply
32bit32bit benchmarks results
|
We do not actively invest into improving 32bit performance specifically. At the same, we avoid regressing 32bit performance unless there is a very good reason. |
// TODO: https://github.com/dotnet/runtime/issues/5213 | ||
ulong tmp, div; | ||
if (bufNum.U2 != 0) | ||
if (X86.X86Base.X64.IsSupported) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a lot of extra code for a relatively small performance increase on modern CPUs.
We currently use Skylake
as the baseline for a lot of our perf score numbers and 3ns savings for 13 lines of new code (+23 more for x86) doesn't really seem worth it.
Ideally any improvements would be shared across all 3 platforms or be significant enough to make the additional complexity worthwhile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I removed the x64 specific code and just kept the x86 since it was just a few ns faster than the x86 specific code (as long as there are no branch misspredictions) and the x86 should be faster on Skylake
even with a misspredicted branch.
It stills gives around ~10ns faster division for 96/32 case
I can change it to x64 only code insted (2 if statements and 2 DivRem calls) if you rather like it.
/// <param name="den">64-bit divisor</param> | ||
/// <returns>Returns quotient. Remainder overwrites lower 64-bits of dividend.</returns> | ||
[MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
private static unsafe ulong Div128By64(Buf16* bufNum, ulong den) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same general comment, but this is notably a case where we'd ideally just use UInt128.operator /
and have the JIT recognize the general pattern instead.
Doing general purpose optimizations is almost always a better option than doing one offs. Ideally the JIT recognizes cases like ulong / uint
or uint128 / ulong
and does the optimal thing on platforms that support it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general that would be god, but in this specific case there we know several things that the compiler cannot know and optimize for.
- denominator has already been shifted left so it has highest bit set
- The metod will probably be called several times in a row so it makes sense to do bit shifting etc before it is called
- den > bufNum->High64
However since both types lives in CoreLib, several helpers here could probably be made internal and useable by both UInt128 and Decimal division code. For example I think both this metod and maybe division by 32bit could be usefull from within UInt128
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened #99747 with a similar fix for UInt128.
I did not make this internal and call it from UInt128 at the moment, but if the code works as fast without the divisor having the highest bit set then it might make sense to do so.
/// <returns>hi bits of the result</returns> | ||
/// <remarks>returns nuint instead of uint to skip clearing upper 32bits on 64bit platforms</remarks> | ||
[MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
private static nuint BigMul64By32(ulong a, uint b, out ulong low) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same general comment. This is another case where this should really just be a general purpose opt in the JIT, recognizing ulong * uint
and optimizing it accordingly, rather than requiring a full ulong * ulong
, just opting to do whichever is most efficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good if the JIT could recognice it and optimize the code.
Since the currently most popular approach to the get more than 64bit result is to call BigMul I hope that that method can be optimized in the future.
I created an internal overload of BigMul in the time beeing, which can easily be removed when the JIT can optimize 32*64 bit => 128bit multiplications
#if TARGET_32BIT | ||
if (Bmi2.IsSupported) | ||
{ | ||
uint low; | ||
uint high = Bmi2.MultiplyNoFlags(a, b, &low); | ||
return ((ulong)high << 32) | low; | ||
} | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIR, the intent was for us to have an overload that returns a tuple and uses the JIT multi-reg return hookups, but we haven't exposed that yet.
We ideally aren't removing opts like this, especially when even with the memory access it is potentially faster or close enough in most cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I added it back, but it contributed to more than 20% slower division (more than 10ns for 1/3) so I removed a of calls and replaced it with the fallback path (ulong)(uint)a * (uint)b
which the JIT recognize and generates optmized code for.
Since the JIT already has specialized codegen for 32 * 32 => 64bit
division would it not make sense to just emit mulx
there (or for normal multiplication as well) if it could lead to more efficient code (maybe when writing to memory or when the register usage gets better).
It seems clang uses mulx
for 32 * 32 => 64bit
in 32bit mode and 64 * 64 => 128bit
in 64bit mode.
I'm, personally, not a huge fan of a lot of the changes here. It's a lot of new code for what appears to be some relatively minor perf increases overall. Where the perf increase is more measurable (specifically |
…it can easily be removed once JIT recognize and optimize "ulong * uint"
This has build failures that need to be addressed. They come from using |
@tannergooding I've opted into preview features for the file, hopefully it should compile again. Hope it is ok to opt into preview features for the whole file instead of doing 3 pairs of suppress/restore |
Rerunning CI. Going to finish reviewing this today after CI finishes |
Speed up decimal Multiplication and Division (mostly on x64, but some minor improvements on x86).
It does mostly 3 tricks:
ulong Math.BigMul(uint, uint)
to not useBMI2.MultiplyNoFlags
for better generated code (it gets rid of the memory write)This is small port of the code from #7778 now that DivRem is accessible.
There is some more fixes that can be done at a later time (such as using 64 by 32 multiply in more places), but many of the original changes are not relevant due either missing low lewel primitives (such as carry) or worse code generated code (big mul)
Questions
IntPtr.Size
and#if TARGET_32BIT
for target conditional code, what is the preferred style ?Remarks for better performance
cmp
andtest
. #76502 or (ADX instrincts)shld / ShiftLeft128
would be useful in a few placesBenchmarks
Full benchmark code can be found at https://github.com/Daniel-Svensson/ClrExperiments/tree/master/ClrDecimal/Benchmarks
Overview
Divide x64
64bit Multiply
x86 (32-bit)
32bit benchmarks results