Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use BigMul for 32x32=64 in decimal #93345

Merged
merged 6 commits into from
Nov 6, 2023
Merged

Conversation

lilinus
Copy link
Contributor

@lilinus lilinus commented Oct 11, 2023

No need to have a separate implementation in DecCalc

@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Oct 11, 2023
@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Oct 11, 2023
@ghost
Copy link

ghost commented Oct 11, 2023

Tagging subscribers to this area: @dotnet/area-system-numerics
See info in area-owners.md if you want to be subscribed.

Issue Details

No need to have a separate implementation in DecCalc

Author: lilinus
Assignees: -
Labels:

area-System.Numerics, community-contribution, needs-area-label

Milestone: -

@@ -184,20 +184,7 @@ private static ulong UInt32x32To64(uint a, uint b)

private static void UInt64x64To128(ulong a, ulong b, ref DecCalc result)
{
ulong low = UInt32x32To64((uint)a, (uint)b); // lo partial prod
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you delete UInt32x32To64 as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could yes, but:

  • Unfortunately does not exist any BigMul method for two uint32.
  • It is used in a lot of places where narrowing/casting from ulongs to uints are done, and the code is pretty convoluted already. This shows intent quite nicely

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you measure the performance of this change? Last time I checked, a similar change caused a significant performance regression because the codegen for Bmi2.MultiplyNoFlags is suboptimal.

@@ -184,20 +184,7 @@ private static ulong UInt32x32To64(uint a, uint b)

private static void UInt64x64To128(ulong a, ulong b, ref DecCalc result)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be replaced, technically speaking, by Math.BigMul(ulong, ulong, out ulong) as well.

Then it can also be deleted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah sure, NB that overflow check and insertion to DecCalc result are still in body below. Do we prefer to move those around to where this is now called or should i keep this method?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably better to keep the extra checks centralized here and just defer the algorithm to Math.BigMul

@lewing lewing removed the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Oct 11, 2023
@lilinus
Copy link
Contributor Author

lilinus commented Oct 12, 2023

@dotnet-policy-service agree

@@ -381,7 +376,7 @@ private static uint Div96By64(ref Buf12 bufNum, ulong den)

// Compute full remainder, rem = dividend - (quo * divisor).
//
ulong prod = UInt32x32To64(quo, (uint)den); // quo * lo divisor
ulong prod = quo * (den & uint.MaxValue); // quo * lo divisor
Copy link
Member

@tannergooding tannergooding Oct 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably needs a comment on what it's doing.

But, notably, this may regress 32-bit platforms as it will now do a more expensive 64x64=64 multiplication, rather than doing the cheaper 32x32=64.

In general an internal Math.BigMul(uint a, uint b, out uint low) could be defined that uses Bmi2.MultiplyNoFlags, ArmBase.MultiplyHigh, and otherwise falls back to the naive algorithm of (ulong)a * b

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved all the stuff into an internal ulong Math.BigMul(uint, uint) (since we had a long Math.BigMul(int, int)). I tried using initrinsic for x86 on 32 bit. Doesn't seem to exist a ArmBase.MultiplyHigh yet.

Copy link
Member

@tannergooding tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@tannergooding
Copy link
Member

Closing and reopening to requeue CI and try to get the tests all passing. They look unrelated.

@adamsitnik adamsitnik merged commit 4f26ec9 into dotnet:main Nov 6, 2023
173 of 175 checks passed
@adamsitnik adamsitnik added this to the 9.0.0 milestone Nov 6, 2023
if (Bmi2.IsSupported)
{
uint low;
uint high = Bmi2.MultiplyNoFlags(a, b, &low);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you measure the performance for this? It's very likely to be significantly worse than the previous codegen that the JIT automatically generated for uint*uint->ulong multiplication on x86.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is indeed a lot worse, which is strange since it shouldn't be. mulx is the preferred instruction to do this on x86, at least for cases like this.

Guessing the JIT has various work that needs to be done to ensure it does the right thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry no i didn't know how to do benchmarks. Probably we should just remove the intrinsic block for x86 in BigMul methods if they are slow? Same for 64 x 64 = 128

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably we should just remove the intrinsic block for x86 in BigMul methods if they are slow

Yes, preferably with an issue around it.

Same for 64 x 64 = 128

No. There isn't a primitive type and no corresponding decomp or other work that makes the naive thing more efficient.

There's still more improvements that could be done around Bmi2.X64.MultiplyNoFlags, but nothing that makes it less efficient than the FOIL based fallback.

Copy link
Contributor

@pentp pentp Nov 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an existing issue for MULX codegen quality: #11782
And a different related issue: #75594

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And a 3rd issue that is very closely related to this DecCalc code: #58263

@lilinus lilinus deleted the decimal-calc-bigmul branch November 13, 2023 13:19
@github-actions github-actions bot locked and limited conversation to collaborators Dec 14, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Numerics community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants