Implement 128-bit Multiply intrinsic for x86/x64 #58263

pentp · 2021-08-27T13:23:55Z

Background and motivation

Intel x86/x64 provides MUL/IMUL instructions that compute the low and high bits of a multiplication in a single instruction.
This would be very useful for Math.BigMul implementations (currently using intrinsics on ARM64 and MULX on x64, but that has bad CQ at the moment).
This would also speed up System.Decimal calculations significantly (currently can't use Math.BigMul there because it's slower than the hand-tuned existing code that composes big multiplications from smaller 32x32 multiplications).

API Proposal

namespace System.Runtime.Intrinsics.X86
{
    partial class X86Base
    {
        partial class X64
        {
            internal static (ulong Lower, ulong Upper) Multiply(ulong left, ulong right);

            internal static (ulong Lower, long Upper) Multiply(long left, long right);
        }
    }
}

Related DivMod API: #27292

API Usage

In decimal.DecCalc.VarDecFromR8 this would allow:

// Add -power factors of 10, -power <= (29 - 15) = 14.
power = -power;
if (X86.X86Base.X64.IsSupported || Arm.ArmBase.Arm64.IsSupported)
{
    ulong low64;
    ulong hi64 = Math.BigMul(mant, s_ulongPowers10[power], out low64);
    if (hi64 > uint.MaxValue)
        Number.ThrowOverflowException(TypeCode.Decimal);
    result.High = (uint)hi64;
    result.Low64 = low64;
}
else if (power < 10)
{
    uint pow10 = s_powers10[power];
    ulong low64 = UInt32x32To64((uint)mant, pow10);
    ulong hi64 = UInt32x32To64((uint)(mant >> 32), pow10);
    result.Low = (uint)low64;
    hi64 += low64 >> 32;
    result.Mid = (uint)hi64;
    hi64 >>= 32;
    result.High = (uint)hi64;
}
else
{
    UInt64x64To128(mant, s_ulongPowers10[power], ref result);
}

Another example in decimal.DecCalc.VarDecMul that would only be faster on x64 (because on ARM64 BigMul is actually two expensive instructions):

// Highest 32 bits is non-zero. Calculate 5 more partial products.
if (X86.X86Base.X64.IsSupported)
{
    ulong mid64 = tmp;
    tmp = Math.BigMul(d1.High, d2.Low64, out tmp2);
    if (mid64 > (mid64 += tmp2)) // add with carry detection
        tmp++;
    tmp += Math.BigMul(d2.High, d1.Low64, out tmp2);
    if (mid64 > (mid64 += tmp2)) // add with carry detection
        tmp++;
    bufProd.Mid64 = mid64;
}
else
{
    tmp2 = UInt32x32To64(d1.Low, d2.High);
    tmp += tmp2; // this could generate carry
    uint tmp3 = 0;
    if (tmp < tmp2) // detect carry
        tmp3 = 1;

    tmp2 = UInt32x32To64(d1.High, d2.Low);
    tmp += tmp2; // this could generate carry
    bufProd.U2 = (uint)tmp;
    if (tmp < tmp2) // detect carry
        tmp3++;
    tmp2 = ((ulong)tmp3 << 32) | (tmp >> 32);

    tmp = UInt32x32To64(d1.Mid, d2.High);
    tmp += tmp2; // this could generate carry
    tmp3 = 0;
    if (tmp < tmp2) // detect carry
        tmp3 = 1;

    tmp2 = UInt32x32To64(d1.High, d2.Mid);
    tmp += tmp2; // this could generate carry
    bufProd.U3 = (uint)tmp;
    if (tmp < tmp2) // detect carry
        tmp3++;
    tmp = ((ulong)tmp3 << 32) | (tmp >> 32);
}
bufProd.High64 = UInt32x32To64(d1.High, d2.High) + tmp;
hiProd = 5;

The text was updated successfully, but these errors were encountered:

dotnet-issue-labeler · 2021-08-27T13:24:00Z

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

ghost · 2021-08-27T14:50:24Z

Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics
See info in area-owners.md if you want to be subscribed.

Issue Details

Background and motivation

Intel x86/x64 provides MUL/IMUL instructions that compute the low and high bits of a multiplication in a single instruction.
This would be very useful for Math.BigMul implementations (currently using intrinsics on ARM64 and MULX on x64, but that has bad CQ at the moment).
This would also speed up System.Decimal calculations significantly (using Math.BigMul there isn't an option because it's faster than the current algorithm only if it's a single HW instruction).

API Proposal

namespace System.Runtime.Intrinsics.X86
{
    partial class X86Base
    {
        // the 32-bit/nint variants might not be needed actually as they don't provide much value
        public (uint Lower, uint Upper) Multiply(uint left, uint right);

        public (uint Lower, int Upper) Multiply(int left, int right);

        public (nuint Lower, nuint Upper) Multiply(nuint left, nuint right);

        public (nuint Lower, nint Upper) Multiply(nint left, nint right);

        partial class X64
        {
            public (ulong Lower, ulong Upper) Multiply(ulong left, ulong right);

            public (ulong Lower, long Upper) Multiply(long left, long right);
        }
    }
}

Related DivMod API: #27292

API Usage

In decimal.DecCalc.VarDecFromR8 this would allow:

// Add -power factors of 10, -power <= (29 - 15) = 14.
power = -power;
if (X86.X86Base.X64.IsSupported || Arm.ArmBase.Arm64.IsSupported)
{
    ulong low64;
    ulong hi64 = Math.BigMul(mant, s_ulongPowers10[power], out low64);
    if (hi64 > uint.MaxValue)
        Number.ThrowOverflowException(TypeCode.Decimal);
    result.High = (uint)hi64;
    result.Low64 = low64;
}
else if (power < 10)
{
    uint pow10 = s_powers10[power];
    ulong low64 = UInt32x32To64((uint)mant, pow10);
    ulong hi64 = UInt32x32To64((uint)(mant >> 32), pow10);
    result.Low = (uint)low64;
    hi64 += low64 >> 32;
    result.Mid = (uint)hi64;
    hi64 >>= 32;
    result.High = (uint)hi64;
}
else
{
    UInt64x64To128(mant, s_ulongPowers10[power], ref result);
}

Another example in decimal.DecCalc.VarDecMul that would directly depend on X64.Multiply (because on ARM64 BigMul would be actually slower than the current code):

// Highest 32 bits is non-zero. Calculate 5 more partial products.
if (X86.X86Base.X64.IsSupported)
{
    ulong mid64 = tmp;
    (tmp2, tmp) = X86.X86Base.X64.Multiply(d1.High, d2.Low64);
    if (mid64 > (mid64 += tmp2)) // add with carry detection
        tmp++;
    (tmp2, ulong tmphigh) = X86.X86Base.X64.Multiply(d2.High, d1.Low64, &tmp2);
    tmp += tmphigh;
    if (mid64 > (mid64 += tmp2)) // add with carry detection
        tmp++;
    bufProd.Mid64 = mid64;
}
else
{
    tmp2 = UInt32x32To64(d1.Low, d2.High);
    tmp += tmp2; // this could generate carry
    uint tmp3 = 0;
    if (tmp < tmp2) // detect carry
        tmp3 = 1;

    tmp2 = UInt32x32To64(d1.High, d2.Low);
    tmp += tmp2; // this could generate carry
    bufProd.U2 = (uint)tmp;
    if (tmp < tmp2) // detect carry
        tmp3++;
    tmp2 = ((ulong)tmp3 << 32) | (tmp >> 32);

    tmp = UInt32x32To64(d1.Mid, d2.High);
    tmp += tmp2; // this could generate carry
    tmp3 = 0;
    if (tmp < tmp2) // detect carry
        tmp3 = 1;

    tmp2 = UInt32x32To64(d1.High, d2.Mid);
    tmp += tmp2; // this could generate carry
    bufProd.U3 = (uint)tmp;
    if (tmp < tmp2) // detect carry
        tmp3++;
    tmp = ((ulong)tmp3 << 32) | (tmp >> 32);
}
bufProd.High64 = UInt32x32To64(d1.High, d2.High) + tmp;
hiProd = 5;

Author:	pentp
Assignees:	-
Labels:	`api-suggestion`, `area-System.Runtime.Intrinsics`, `untriaged`
Milestone:	-

jkotas · 2021-08-27T14:51:53Z

because on ARM64 BigMul would be actually slower than the current code

Why would it be slower?

pentp · 2021-08-27T15:10:22Z

because on ARM64 BigMul would be actually slower than the current code

Why would it be slower?

ARM64 doesn't have a single instruction for BigMul, so it uses 64-bit MUL + UMULH, which are twice as expensive as 32x32->64 UMULL (which UInt32x32To64 maps to after #57926 is merged). So in this example it could be up to 2x slower (minus the extra adds/shifts).

pentp · 2021-08-27T15:20:50Z

I suppose we could start off with the API being internal only and use it through Math.BigMul while checking for X86.X86Base.X64.IsSupported. In this case this issue isn't an actual API proposal, but a proposal to implement the intrinsics by RyuJIT.
Should I refactor the proposal to reflect that?

pentp · 2021-08-27T16:03:22Z

There is an existing PR for getting MULX codegen to be actually useful: #37928
If we can resurrect and merge that, then adding MUL and IMUL to X86Base should be relatively straightforward.

deeprobin · 2021-11-13T23:31:20Z

Maybe a somewhat unusual question, but wouldn't it make sense to create a new primitive instead of working with an ulong tuple? For example UInt128 (doublelong, quad, ...).

huoyaoyuan · 2022-03-17T06:44:23Z

@tannergooding How about this, together with DivRem?
I don't know the deep difference between MUL/IMUL and MULX/IMULX, and which group we should prefer. I think the most of existing hardware should support BMI2 now.

pentp added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label Aug 27, 2021

dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Aug 27, 2021

pentp mentioned this issue Aug 27, 2021

Add DivMod instrinct for intel x86/x64 #27292

Closed

jkotas added the area-System.Runtime.Intrinsics label Aug 27, 2021

pentp changed the title ~~[API Proposal]: Add Multiply intrinsic for x86/x64~~ Implement 128-bit Multiply intrinsic for x86/x64 Aug 27, 2021

jeffhandley added needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration and removed untriaged New issue has not been triaged by the area owner labels Sep 30, 2021

jeffhandley added this to the Future milestone Sep 30, 2021

saucecontrol mentioned this issue Apr 19, 2022

[API Proposal]: Math.MulHigh #68207

Open

tannergooding mentioned this issue Sep 14, 2022

Suboptimal x64 codegen for signed Math.BigMul #75594

Open

pentp mentioned this issue Nov 6, 2023

Use BigMul for 32x32=64 in decimal #93345

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement 128-bit Multiply intrinsic for x86/x64 #58263

Implement 128-bit Multiply intrinsic for x86/x64 #58263

pentp commented Aug 27, 2021 •

edited

Loading

dotnet-issue-labeler bot commented Aug 27, 2021

ghost commented Aug 27, 2021

Background and motivation

API Proposal

API Usage

jkotas commented Aug 27, 2021

pentp commented Aug 27, 2021

pentp commented Aug 27, 2021 •

edited

Loading

pentp commented Aug 27, 2021

deeprobin commented Nov 13, 2021

huoyaoyuan commented Mar 17, 2022

Implement 128-bit Multiply intrinsic for x86/x64 #58263

Implement 128-bit Multiply intrinsic for x86/x64 #58263

Comments

pentp commented Aug 27, 2021 • edited Loading

Background and motivation

API Proposal

API Usage

dotnet-issue-labeler bot commented Aug 27, 2021

ghost commented Aug 27, 2021

Background and motivation

API Proposal

API Usage

jkotas commented Aug 27, 2021

pentp commented Aug 27, 2021

pentp commented Aug 27, 2021 • edited Loading

pentp commented Aug 27, 2021

deeprobin commented Nov 13, 2021

huoyaoyuan commented Mar 17, 2022

pentp commented Aug 27, 2021 •

edited

Loading

pentp commented Aug 27, 2021 •

edited

Loading