-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve precision of TP exponentiation functions. #99705
Conversation
Tagging subscribers to this area: @dotnet/area-system-numerics |
Are there corresponding changes to be made to VectorMath in corelib? |
Not sure. Do these routines have precision issues as well? |
It's a copy of the same code, e.g. runtime/src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/VectorMath.cs Line 11 in 4cf19ee
|
I was only running the .NET 9 tests when working on this, so clearly something's off here. |
Vector128<double> a0 = MultiplyAddEstimateOperator<double>.Invoke(Vector128.Create(C12), r, Vector128.Create(C11)); | ||
Vector128<double> a1 = MultiplyAddEstimateOperator<double>.Invoke(Vector128.Create(C10), r, Vector128.Create(C9)); | ||
Vector128<double> a2 = MultiplyAddEstimateOperator<double>.Invoke(Vector128.Create(C8), r, Vector128.Create(C7)); | ||
Vector128<double> a3 = MultiplyAddEstimateOperator<double>.Invoke(Vector128.Create(C6), r, Vector128.Create(C5)); | ||
Vector128<double> a4 = MultiplyAddEstimateOperator<double>.Invoke(Vector128.Create(C4), r, Vector128.Create(C3)); | ||
Vector128<double> a5 = MultiplyAddEstimateOperator<double>.Invoke(a0, r2, a1); | ||
Vector128<double> a6 = MultiplyAddEstimateOperator<double>.Invoke(a2, r2, a3); | ||
Vector128<double> a7 = MultiplyAddEstimateOperator<double>.Invoke(a4, r2, r + Vector128<double>.One); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This many locals isn't really "good" for the JIT and can hinder some optimizations.
It can also make it a little bit harder to follow the intended computation ordering, which can itself limit other optimizations or reorderings the JIT would be allowed to do.
We could reduce the local count significantly if we broke it up like this, and then its also equivalent in codegen to as if there were no locals at all:
Vector128<double> t1 = MultiplyAddEstimate(
MultiplyAddEstimate(Vector128.Create(C12), r, Vector128.Create(C11)),
r2,
MultiplyAddEstimate(Vector128.Create(C10), r, Vector128.Create(C9))
);
Vector128<double> t2 = MultiplyAddEstimate(
MultiplyAddEstimate(Vector128.Create(C8), r, Vector128.Create(C7)),
r2,
MultiplyAddEstimate(Vector128.Create(C6), r, Vector128.Create(C5))
);
t1 = MultiplyAddEstimate(t1, r8, t2);
t2 = MultiplyAddEstimate(Vector128.Create(C4), r, Vector128.Create(C3));
t2 = MultiplyAddEstimate(t2, r2, r + Vector128<double>.One);
MultiplyAddEstimate(t1, r4, t2);
Updating the implementations to use FMA where applicable.
NB tests might require some calibration pending test results from platforms without FMA support.