Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement 128-bit Multiply intrinsic for x86/x64 #58263

Open
pentp opened this issue Aug 27, 2021 · 8 comments
Open

Implement 128-bit Multiply intrinsic for x86/x64 #58263

pentp opened this issue Aug 27, 2021 · 8 comments
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Runtime.Intrinsics needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration
Milestone

Comments

@pentp
Copy link
Contributor

pentp commented Aug 27, 2021

Background and motivation

Intel x86/x64 provides MUL/IMUL instructions that compute the low and high bits of a multiplication in a single instruction.
This would be very useful for Math.BigMul implementations (currently using intrinsics on ARM64 and MULX on x64, but that has bad CQ at the moment).
This would also speed up System.Decimal calculations significantly (currently can't use Math.BigMul there because it's slower than the hand-tuned existing code that composes big multiplications from smaller 32x32 multiplications).

API Proposal

namespace System.Runtime.Intrinsics.X86
{
    partial class X86Base
    {
        partial class X64
        {
            internal static (ulong Lower, ulong Upper) Multiply(ulong left, ulong right);

            internal static (ulong Lower, long Upper) Multiply(long left, long right);
        }
    }
}

Related DivMod API: #27292

API Usage

In decimal.DecCalc.VarDecFromR8 this would allow:

// Add -power factors of 10, -power <= (29 - 15) = 14.
power = -power;
if (X86.X86Base.X64.IsSupported || Arm.ArmBase.Arm64.IsSupported)
{
    ulong low64;
    ulong hi64 = Math.BigMul(mant, s_ulongPowers10[power], out low64);
    if (hi64 > uint.MaxValue)
        Number.ThrowOverflowException(TypeCode.Decimal);
    result.High = (uint)hi64;
    result.Low64 = low64;
}
else if (power < 10)
{
    uint pow10 = s_powers10[power];
    ulong low64 = UInt32x32To64((uint)mant, pow10);
    ulong hi64 = UInt32x32To64((uint)(mant >> 32), pow10);
    result.Low = (uint)low64;
    hi64 += low64 >> 32;
    result.Mid = (uint)hi64;
    hi64 >>= 32;
    result.High = (uint)hi64;
}
else
{
    UInt64x64To128(mant, s_ulongPowers10[power], ref result);
}

Another example in decimal.DecCalc.VarDecMul that would only be faster on x64 (because on ARM64 BigMul is actually two expensive instructions):

// Highest 32 bits is non-zero. Calculate 5 more partial products.
if (X86.X86Base.X64.IsSupported)
{
    ulong mid64 = tmp;
    tmp = Math.BigMul(d1.High, d2.Low64, out tmp2);
    if (mid64 > (mid64 += tmp2)) // add with carry detection
        tmp++;
    tmp += Math.BigMul(d2.High, d1.Low64, out tmp2);
    if (mid64 > (mid64 += tmp2)) // add with carry detection
        tmp++;
    bufProd.Mid64 = mid64;
}
else
{
    tmp2 = UInt32x32To64(d1.Low, d2.High);
    tmp += tmp2; // this could generate carry
    uint tmp3 = 0;
    if (tmp < tmp2) // detect carry
        tmp3 = 1;

    tmp2 = UInt32x32To64(d1.High, d2.Low);
    tmp += tmp2; // this could generate carry
    bufProd.U2 = (uint)tmp;
    if (tmp < tmp2) // detect carry
        tmp3++;
    tmp2 = ((ulong)tmp3 << 32) | (tmp >> 32);

    tmp = UInt32x32To64(d1.Mid, d2.High);
    tmp += tmp2; // this could generate carry
    tmp3 = 0;
    if (tmp < tmp2) // detect carry
        tmp3 = 1;

    tmp2 = UInt32x32To64(d1.High, d2.Mid);
    tmp += tmp2; // this could generate carry
    bufProd.U3 = (uint)tmp;
    if (tmp < tmp2) // detect carry
        tmp3++;
    tmp = ((ulong)tmp3 << 32) | (tmp >> 32);
}
bufProd.High64 = UInt32x32To64(d1.High, d2.High) + tmp;
hiProd = 5;
@pentp pentp added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label Aug 27, 2021
@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Aug 27, 2021
@ghost
Copy link

ghost commented Aug 27, 2021

Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics
See info in area-owners.md if you want to be subscribed.

Issue Details

Background and motivation

Intel x86/x64 provides MUL/IMUL instructions that compute the low and high bits of a multiplication in a single instruction.
This would be very useful for Math.BigMul implementations (currently using intrinsics on ARM64 and MULX on x64, but that has bad CQ at the moment).
This would also speed up System.Decimal calculations significantly (using Math.BigMul there isn't an option because it's faster than the current algorithm only if it's a single HW instruction).

API Proposal

namespace System.Runtime.Intrinsics.X86
{
    partial class X86Base
    {
        // the 32-bit/nint variants might not be needed actually as they don't provide much value
        public (uint Lower, uint Upper) Multiply(uint left, uint right);

        public (uint Lower, int Upper) Multiply(int left, int right);

        public (nuint Lower, nuint Upper) Multiply(nuint left, nuint right);

        public (nuint Lower, nint Upper) Multiply(nint left, nint right);

        partial class X64
        {
            public (ulong Lower, ulong Upper) Multiply(ulong left, ulong right);

            public (ulong Lower, long Upper) Multiply(long left, long right);
        }
    }
}

Related DivMod API: #27292

API Usage

In decimal.DecCalc.VarDecFromR8 this would allow:

// Add -power factors of 10, -power <= (29 - 15) = 14.
power = -power;
if (X86.X86Base.X64.IsSupported || Arm.ArmBase.Arm64.IsSupported)
{
    ulong low64;
    ulong hi64 = Math.BigMul(mant, s_ulongPowers10[power], out low64);
    if (hi64 > uint.MaxValue)
        Number.ThrowOverflowException(TypeCode.Decimal);
    result.High = (uint)hi64;
    result.Low64 = low64;
}
else if (power < 10)
{
    uint pow10 = s_powers10[power];
    ulong low64 = UInt32x32To64((uint)mant, pow10);
    ulong hi64 = UInt32x32To64((uint)(mant >> 32), pow10);
    result.Low = (uint)low64;
    hi64 += low64 >> 32;
    result.Mid = (uint)hi64;
    hi64 >>= 32;
    result.High = (uint)hi64;
}
else
{
    UInt64x64To128(mant, s_ulongPowers10[power], ref result);
}

Another example in decimal.DecCalc.VarDecMul that would directly depend on X64.Multiply (because on ARM64 BigMul would be actually slower than the current code):

// Highest 32 bits is non-zero. Calculate 5 more partial products.
if (X86.X86Base.X64.IsSupported)
{
    ulong mid64 = tmp;
    (tmp2, tmp) = X86.X86Base.X64.Multiply(d1.High, d2.Low64);
    if (mid64 > (mid64 += tmp2)) // add with carry detection
        tmp++;
    (tmp2, ulong tmphigh) = X86.X86Base.X64.Multiply(d2.High, d1.Low64, &tmp2);
    tmp += tmphigh;
    if (mid64 > (mid64 += tmp2)) // add with carry detection
        tmp++;
    bufProd.Mid64 = mid64;
}
else
{
    tmp2 = UInt32x32To64(d1.Low, d2.High);
    tmp += tmp2; // this could generate carry
    uint tmp3 = 0;
    if (tmp < tmp2) // detect carry
        tmp3 = 1;

    tmp2 = UInt32x32To64(d1.High, d2.Low);
    tmp += tmp2; // this could generate carry
    bufProd.U2 = (uint)tmp;
    if (tmp < tmp2) // detect carry
        tmp3++;
    tmp2 = ((ulong)tmp3 << 32) | (tmp >> 32);

    tmp = UInt32x32To64(d1.Mid, d2.High);
    tmp += tmp2; // this could generate carry
    tmp3 = 0;
    if (tmp < tmp2) // detect carry
        tmp3 = 1;

    tmp2 = UInt32x32To64(d1.High, d2.Mid);
    tmp += tmp2; // this could generate carry
    bufProd.U3 = (uint)tmp;
    if (tmp < tmp2) // detect carry
        tmp3++;
    tmp = ((ulong)tmp3 << 32) | (tmp >> 32);
}
bufProd.High64 = UInt32x32To64(d1.High, d2.High) + tmp;
hiProd = 5;
Author: pentp
Assignees: -
Labels:

api-suggestion, area-System.Runtime.Intrinsics, untriaged

Milestone: -

@jkotas
Copy link
Member

jkotas commented Aug 27, 2021

because on ARM64 BigMul would be actually slower than the current code

Why would it be slower?

@pentp
Copy link
Contributor Author

pentp commented Aug 27, 2021

because on ARM64 BigMul would be actually slower than the current code

Why would it be slower?

ARM64 doesn't have a single instruction for BigMul, so it uses 64-bit MUL + UMULH, which are twice as expensive as 32x32->64 UMULL (which UInt32x32To64 maps to after #57926 is merged). So in this example it could be up to 2x slower (minus the extra adds/shifts).

@pentp
Copy link
Contributor Author

pentp commented Aug 27, 2021

I suppose we could start off with the API being internal only and use it through Math.BigMul while checking for X86.X86Base.X64.IsSupported. In this case this issue isn't an actual API proposal, but a proposal to implement the intrinsics by RyuJIT.
Should I refactor the proposal to reflect that?

@pentp pentp changed the title [API Proposal]: Add Multiply intrinsic for x86/x64 Implement 128-bit Multiply intrinsic for x86/x64 Aug 27, 2021
@pentp
Copy link
Contributor Author

pentp commented Aug 27, 2021

There is an existing PR for getting MULX codegen to be actually useful: #37928
If we can resurrect and merge that, then adding MUL and IMUL to X86Base should be relatively straightforward.

@jeffhandley jeffhandley added needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration and removed untriaged New issue has not been triaged by the area owner labels Sep 30, 2021
@jeffhandley jeffhandley added this to the Future milestone Sep 30, 2021
@deeprobin
Copy link
Contributor

Maybe a somewhat unusual question, but wouldn't it make sense to create a new primitive instead of working with an ulong tuple? For example UInt128 (doublelong, quad, ...).

@huoyaoyuan
Copy link
Member

@tannergooding How about this, together with DivRem?
I don't know the deep difference between MUL/IMUL and MULX/IMULX, and which group we should prefer. I think the most of existing hardware should support BMI2 now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Runtime.Intrinsics needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration
Projects
None yet
Development

No branches or pull requests

5 participants