Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve precision of TP exponentiation functions. #99705

Closed

Conversation

eiriktsarpalis
Copy link
Member

@eiriktsarpalis eiriktsarpalis commented Mar 13, 2024

Updating the implementations to use FMA where applicable.

NB tests might require some calibration pending test results from platforms without FMA support.

Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-numerics
See info in area-owners.md if you want to be subscribed.

@stephentoub
Copy link
Member

Are there corresponding changes to be made to VectorMath in corelib?

@eiriktsarpalis
Copy link
Member Author

Are there corresponding changes to be made to VectorMath in corelib?

Not sure. Do these routines have precision issues as well?

@stephentoub
Copy link
Member

Are there corresponding changes to be made to VectorMath in corelib?

Not sure. Do these routines have precision issues as well?

It's a copy of the same code, e.g.

public static TVectorDouble ExpDouble<TVectorDouble, TVectorInt64, TVectorUInt64>(TVectorDouble x)
. The code being changed here for Exp for example only applies to .NET 8; on .NET 9 that code from Corelib ends up being used.

@eiriktsarpalis
Copy link
Member Author

I was only running the .NET 9 tests when working on this, so clearly something's off here.

Comment on lines +207 to +214
Vector128<double> a0 = MultiplyAddEstimateOperator<double>.Invoke(Vector128.Create(C12), r, Vector128.Create(C11));
Vector128<double> a1 = MultiplyAddEstimateOperator<double>.Invoke(Vector128.Create(C10), r, Vector128.Create(C9));
Vector128<double> a2 = MultiplyAddEstimateOperator<double>.Invoke(Vector128.Create(C8), r, Vector128.Create(C7));
Vector128<double> a3 = MultiplyAddEstimateOperator<double>.Invoke(Vector128.Create(C6), r, Vector128.Create(C5));
Vector128<double> a4 = MultiplyAddEstimateOperator<double>.Invoke(Vector128.Create(C4), r, Vector128.Create(C3));
Vector128<double> a5 = MultiplyAddEstimateOperator<double>.Invoke(a0, r2, a1);
Vector128<double> a6 = MultiplyAddEstimateOperator<double>.Invoke(a2, r2, a3);
Vector128<double> a7 = MultiplyAddEstimateOperator<double>.Invoke(a4, r2, r + Vector128<double>.One);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This many locals isn't really "good" for the JIT and can hinder some optimizations.

It can also make it a little bit harder to follow the intended computation ordering, which can itself limit other optimizations or reorderings the JIT would be allowed to do.

We could reduce the local count significantly if we broke it up like this, and then its also equivalent in codegen to as if there were no locals at all:

Vector128<double> t1 = MultiplyAddEstimate(
    MultiplyAddEstimate(Vector128.Create(C12), r, Vector128.Create(C11)),
    r2,
    MultiplyAddEstimate(Vector128.Create(C10), r, Vector128.Create(C9))
);

Vector128<double> t2 = MultiplyAddEstimate(
   MultiplyAddEstimate(Vector128.Create(C8),  r, Vector128.Create(C7)),
   r2,
   MultiplyAddEstimate(Vector128.Create(C6),  r, Vector128.Create(C5))
);

t1 = MultiplyAddEstimate(t1, r8, t2);
t2 = MultiplyAddEstimate(Vector128.Create(C4),  r, Vector128.Create(C3));
t2 = MultiplyAddEstimate(t2, r2, r + Vector128<double>.One);

MultiplyAddEstimate(t1, r4, t2);

You can see it results in smaller codegen, lets the initial multiplies be parallel dispatched up front, keeps the total register usage lower, etc: https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABABgAJiBGAOgCUBXAOwwEt8YaBJFqVp3KzC4A3AFgAUGUq1GLdpx4Y+AobhoANABxJxUgMyUATOQDC5AN6TyN8gAc+AN2wYY5SAIzkAJhAbAAGzdTQwBeclIaAFZSWLj4uKotPVt7Jxc3D1wvX38gsxRycMiqJDLyivKqAHYU2wdWZ1d3CE8fP0DgqKKImlIUUsrylABOI2ra63r05qycjvzTJB7I2K19Dc3NqKix6qM6mwamzNbs9rzg6pW+0ip9LUenkaiqIyMUfsO0xoyWttynTMWhuiRGWgGRiQIxQ+ihE1GUW+xz+cwuQNMIx6RhoKC0dzxSCiRnu1RQeP0AFE4KQkVMjjNTgCFsEqBRwjjqjtqkh3lE4VFYaQjFSabp6T8Tv9zoDFlQqNiaDy4UYRnDwUYYi8UNTSJMJKkUbMzvNLmY3orRjD9P19FQXlootUqLrkhKJcRDAA1GBgDDQN5aAA8spgAD5yABZAAUPr9AaMwdDEagAEpLBLUnH/VBAyGWSmTOEoOQAFTkKDfLO+nN55MVwrFkzlqAHTO2bMJpMFisg4uFFsoKsdmtd/OXCPYdlRhgBNh2AIATwAgt5vJTsuwMrHR7nEzRTLBt6Y3qm0BXz5291oD0fXNGT1RU6nhzYr3We9gFeFI7P50vV3XTd8G3d990PGBjzZM8L3IMCbwg48RmfV84N3D8J3IbAixnOdWAXFc1w3NgQPveDb0g+9TC0GCoEvdDwLvGAH2qFD2zfBju0w7Awlw/9CKAkjQM4ijjyQWj6Pja9RKoqI2INVIRykjDOknRs+PwgCiOA4TlMYyjmNMFAJLQvSEKYh99HkxSlNrRNx1UrDuh/P9NIE4it3vKdz1bc8vxfdjTLsrjHOwZYXLwgjAI80jmOwnyjD8qzUPghygkna4Iv46KdK89AK0SityAAaiCsdkxoAB5JgYAChSbMCgB6Rq7AgJcemwboWxBUqwrLBsSqw/UbI4sy0vDew2sXHpf0irTBM8uKkB8/LsFY1DiGuWbsu0oSvKiHytHPVqlzq1IAF93XqyhvU48aIwAOR3Mb6zTDNrurF6e1bHoSxbFK7te9SfpbNsPts8rvr7AbBwBr7MIwb8NKi3bFujQLUm2tycr25jyMQqjTx8yTgpkwz5WfNAMdsXzqZsLGUYW2LntJgnydIEz8YszFn0CurAtS+sMBwhn5pi7c6dF9zcrxkS2YfGjz1grmDJYym6dp8HFKlnG0ZVsSYOVuXubk1M+b0QLEZm1zGfF+9EcO89hbOkbhetubpdxlmuzJh9jKVuiyuk+WQms12RZtsWZejYWEp8wbBYLKqav5rXKC2yPPbRh2GydowXZsS7rolI03ETzCddR5ny8coIADMMBJyHML4ABzAALRug5U9KsLXGAmG8M3rqsNPWDr8howAMRA7hcAAZQYOxWqgVwh8C0eRsUzbyBn7AaEr7xo3rru287vz+8HgvFKLrfAp34+YAb/qz4wdNesv7xviL86gA=

@github-actions github-actions bot locked and limited conversation to collaborators Apr 13, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants