Add DivMod instrinct for intel x86/x64 #27292

Daniel-Svensson · 2018-09-02T20:23:41Z

Please add hw instrinct support for divide on intel platforms.

Rationale and Usage

The intel instruction div is a bit special since 32bit divide takes a 64 dividend and 32bit divisor and returns both quotient and remainder as result.

It would be very useful, not only for speeding up Math.DivRem (https://github.com/dotnet/coreclr/issues/757 ) but especially for several decimal operations which would make it more feasible to move decimal code to C# (and maybe several number functions).
Significant speedup using div in C++ provided significant speedups in https://github.com/dotnet/coreclr/issues/10642)

Example:

The following code perfomes division of a 96bit unsigned number by a 32bit number.
Updating the 96bit number in-pace and returning the remainder
The example is simplified from actual code and contains only the "worst case" execution path.

public uint Div96By32(uint [] num,  uint denominator)
{
  if (System.Runtime.Intrinsics.X86.X86Base.IsSupported)
  {
    uint remainder;
    num[2] = System.Runtime.Intrinsics.X86.X86Base.DivMod(0, num[2], denominator, out remainder);   
    num[1] = System.Runtime.Intrinsics.X86.X86Base.DivMod(remainder, num[1], denominator, out remainder);
    num[0] = System.Runtime.Intrinsics.X86.X86Base.DivMod(remainder, num[0], denominator, out remainder);
    return remainder;
  }
 else { ... }
}

It performs the same logic as in https://github.com/dotnet/corert/blob/d82d460a8530a57e4915060be37fb42c7a661f48/src/System.Private.CoreLib/shared/System/Decimal.DecCalc.cs#L228

That code has 2 64bit divides (much slower, especially in 32bit mode), and several multiplications (workarounds instead of using '%' which would double the executed div instructions currently).

Proposed API

The API might need some design, but I think it makes sense to keep it similar to Math.DivRem as below.

namespace System.Runtime.Intrinsics.X86
{
  public static class X86Base 
  { 
    /// Perform unsigned division of 64 bit (hi, low) by 32bit divisor
    /// returning quotient, and returning remainder as an out parameter
    public uint DivRem(uint hi, uint low, uint divisor, out uint remainder);

    /// Perform unsigned division of 64 bit (hi, low) by 32 bit divisor
    /// returning quotient, and returning remainder as an out parameter
    public int DivRem(int hi, uint low, int divisor, out int remainder);
 
    [Intrinsic]
    public static class X64
    {
        /// Perform unsigned division of 128 bit (hi, low) by 64bit divisor
        /// returning quotient, and returning remainder as an out parameter
        public ulong DivRem(ulong hi, ulong low, ulong divisor, out ulong remainder);

        /// Perform unsigned division of 128 bit (hi, low) by 64 bit divisor
        /// returning quotient, and returning remainder as an out parameter
        public long DivRem(long hi, ulong low, long divisor, out long remainder);
      }
}

Related functionality which would make sense to add.
It could use the instrinct for god performance on x86 platforms.

class System.Math
{
   // unsigned overload of  public static int DivRem (int a, int b, out int result);
   public static uint DivRem (uint a, uint b, out uint result);

   // unsigned overload of  public static int DivRem (int a, int b, out int result);
   public static ulong DivRem (ulong a, ulong b, out ulong result);
}

Details

Full description of instruction can be found in https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-instruction-set-reference-manual-325383.pdf

Updates

2018-09-04 I've tried to add signed divide, and tried to restructure the post.
I am unsure about which versions would be useful so I am adding both that came to my mind.
2020-06-15
updated to use new X86Base https://github.com/dotnet/runtime/blob/master/src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/X86/X86Base.cs
Removed 32 by 32 bit, and 64 by 64 bit division since it is better to add the missing overloads to Math and recognize them as instrincts

The text was updated successfully, but these errors were encountered:

danmoseley · 2018-09-03T01:26:56Z

@tannergooding

tannergooding · 2018-09-03T17:33:41Z

CC. @CarolEidt, would appreciate your thoughts here.

This falls into the category of "core" instructions (not behind any flag), but which also provide additional functionality that is one or more of:

Platform/Architecture Specific
Not easily mappable to IL or C# language semantics
Potentially useful for certain types of algorithms

There are other instructions that also fall into this category, such as Add with Carry, for which certain native compilers do expose such instrinsic helpers:

tannergooding · 2018-09-03T17:35:47Z

@Daniel-Svensson, could you update the original post to roughly follow the layout given as an example here: https://github.com/dotnet/corefx/blob/master/Documentation/project-docs/api-review-process.md#steps

Could you also update the original post to include the signed-division overloads (which would call IDIV)?

Daniel-Svensson · 2018-09-04T20:13:52Z

@tannergooding I've tried to update the post based on the example which the documentation linked to.

I've added two versions for IDIV since I was a bit unsure about which version you would be interested in.

I've also tried to add namespaces/classes based on some arm instrinct I found.

Regarding Add with Carry and Subtract with Carry are there any issue tracking those?

It would be very useful to have them as cross platform instrincts, 64bit operations could just fall back to 2x 32bit operations. I've have some C++ improvements to decimal arithmetics with 150%+ speedups around and having those operations would be required to have any managed implementation with similar performance)

tannergooding · 2018-09-04T21:39:33Z

I've added two versions for IDIV since I was a bit unsure about which version you would be interested in.

Both the 64-bit / 32-bit, and the 32-bit / 32-bit cases are interesting.

I've also tried to add namespaces/classes based on some arm instrinct I found.

We don't want to have a x64 specific class. Currently we are just exposing them together in the same class and calling the 64-bit only overloads (on a 32-bit box) will throw a PNSE.

Regarding Add with Carry and Subtract with Carry are there any issue tracking those

No, there isn't a tracking issue today.

It would be very useful to have them as cross platform instrincts

We actually have Math.DivRem today (for signed and unsigned versions of Int32 / Int32 and Int64 / Int64). They just don't actually do a single operation today (if the intrinsics are approved, and implemented, we would likely update the existing Math APIs to use those, where possible).

pentp · 2018-09-12T00:08:55Z

I would like to extend this proposal to also include BigMul as mentioned in https://github.com/dotnet/coreclr/issues/17818#issuecomment-395365269.
128-bit DIV and MUL are often needed together in the same place (e.g., System.Decimal).

public static class Base 
{ 
  // 32 * 32 => 64 bit signed multiply
  public uint BigMul(int a, int b, out int hi);
  // 32 * 32 => 64 bit unsigned multiply
  public uint BigMul(uint a, uint b, out uint hi);
  // 64 * 64 => 128 bit signed multiply
  public ulong BigMul(long a, long b, out long hi);
  // 64 * 64 => 128 bit unsigned multiply
  public ulong BigMul(ulong a, ulong b, out ulong hi);
}

jkotas · 2018-09-19T22:25:53Z

We should just make the existing Math.DivRem methods work as intrinsic. These do not need be new HW intrinsics methods in the public surface.

tannergooding · 2018-09-19T22:48:13Z

These do not need be new HW intrinsics methods in the public surface.

I think that is (possibly) worth further discussion.

Even if we update Math.DivRem to be intrinsic in 3.0, it will essentially be "this is maybe intrinsic, if your platform/runtime support it". While x86Base.DivRem would be "this is always intrinsic, for x86 using int/uint and for x64 using int/uint or long/ulong" (the latter of which can be important for certain performance oriented scenarios).

DivRem (like the other instructions that would be put up for proposal to include in the x86Base class) is a case where x86/x64 exposes a single instruction that provides functionality not all platforms may support. In this case, it is a combining div and rem (which on ARM, to my knowledge, there is no single instruction equivalent).

pentp · 2018-09-19T22:56:44Z

There are several issues with the existing Math.DivRem/BigMul methods that are not resolved by making them intrinsic:

No unsigned overloads.
No 128-bit division/multiplication.
No IsSupported mechanism to detect if they are intrinsics or not.

And unlike IDIV, a general purpose division method has dividend size equal to quotient size, which results in long division being performed with a 128-bit IDIV on x64 with a 2-4x latency and 4-15x throughput penalty compared to 64-bit IDIV (x86 goes through a slow helper call).
A general purpose 128-bit division method would have to be a slow helper call on all platforms and pretty much useless (e.g., Decimal has special division methods already for 96/64, 96/32, etc.).

Maybe these intrinsics don't need to be public at first, but they would be useful even just to implement the existing Math functions + Decimal math. But making them internal would limit their usefulness.

saucecontrol · 2019-02-09T20:47:22Z

A couple more useful additions: BSF and BSR

public abstract class Base 
{ 
    public abstract class X64
    {
        /// <summary>
        /// Search the mask data from least significant bit (LSB) to the most significant bit (MSB) for a set bit (1).
        /// </summary>
        public static int BitScanForward(long mask);
        public static int BitScanForward(ulong mask);

        /// <summary>
        /// Search the mask data from most significant bit (MSB) to least significant bit (LSB) for a set bit (1).
        /// </summary>
        public static int BitScanReverse(long x);
        public static int BitScanReverse(ulong x);
    }

    /// <summary>
    /// Search the mask data from least significant bit (LSB) to the most significant bit (MSB) for a set bit (1).
    /// </summary>
    public static int BitScanForward(int mask);
    public static int BitScanForward(uint mask);

    /// <summary>
    /// Search the mask data from most significant bit (MSB) to least significant bit (LSB) for a set bit (1).
    /// </summary>
    public static int BitScanReverse(int x);
    public static int BitScanReverse(uint x);
}

There are several places in CoreFX that have been updated recently to use the TZCNT and LZCNT HWIntrinsics, but they must fall back to a slow path for processors not supporting the BMI1 or LZCNT ISAs, respectively, because the JIT requires the corresponding CPUID flag be set.

However, as described here, TZCNT is encoded as REP BSF, which means it can execute on downlevel processors (with different behavior for a 0 input).

Software built with an AOT compiler can take advantage of the instruction compatibility to have a single code path when the TZCNT/BSF behavior is known to be the same, but we can't do the same with the JIT. On the other hand, we could code paths for both instructions if BSF were exposed explicitly.

Likewise, in place of LZCNT, we could use 31 ^ BSR(mask) as a fallback path.

Given the number of processors not supporting the BMI1 and LZCNT ISAs, including the modern Intel Goldmont architecture, it would be worthwhile to be able to code both paths.

Also related: https://github.com/dotnet/corefx/issues/32269

Daniel-Svensson · 2019-06-13T10:51:30Z

I would like to extend this proposal to also include BigMul as mentioned in dotnet/coreclr#17818 (comment).

@pentp I think that would be a better fit to add to Math to complement the currently available signed versions (which I guess might be in a different issue) or wherever new math operations go instead of making it platform specific. It could then be treated as an instrinct by the runtime for optimal performance. I tried to search for such an issue but did not find it.

pentp · 2019-06-13T22:45:38Z

128-bit MUL does not really fit into the existing Math class - it can only be an intrinsic on some platforms, while on others (e.g. 32-bit x86) the performance falls off a cliff. It follows that an IsSupported check is required, which also does not fit into Math, as commented above.

tannergooding · 2019-06-13T23:13:42Z

A Mul128(long, long) is really no different from FusedMultuplyAdd.

For C/C++ they expose a FP_FAST_FMA macro for this.

I dont think it would be unreasonable, or out of scope, to expose similar boolean properties for a small selection of Math APIs, like FusedMultiplyAdd or Mul128

Daniel-Svensson · 2020-06-15T18:22:53Z

I've updated the description to match that there is now a X86Base instrinct class.
@tannergooding is there anything else that should/can be changed to increase the likelihood of getting this included for 5.0?

tannergooding · 2020-06-15T18:45:38Z

I can mark it ready-for-review now but I don't know if this will get on the backlog before we lock everything down.

The API review will also need to go over whether this is actually worthwhile to expose. As @jkotas indicated, this may be sufficiently covered (for the most common cases) via Math.DivRem and we should just optimize it (which I believe is dependent on some work @CarolEidt has been doing around multi-reg returns).

The remaining reason for the intrinsic is to support 128-bit / 64-bit = 64-bit and 64-bit / 32-bit = 32-bit. However, it might also be possible to just handle those via new "general-purpose" Math APIs.

Thaina · 2021-03-15T10:21:43Z

In addition to DivMod I would like to also propose Absolute and IntFraction. I don't know it exist in low level instruction but it would be convenient to have

int abs = Math.Abs(value,out int sign); // int for -1/0/1
long l = Math.IntFraction(123.456f,out float f); // 123 and 0.456f
l = Math.IntFraction(123.456,out double d); // 123 and 0.456

Is it possible?

tannergooding · 2021-03-15T15:21:33Z

Is it possible?

Those are possible but they should be distinct proposals following the API proposal issue template: https://github.com/dotnet/runtime/issues/new?assignees=&labels=api-suggestion&template=02_api_proposal.md

The APIs you mention are also different in that they aren't intrinsics supported by hardware.
IntFraction is also better known as ModF and is internally supported but not publicly exposed: https://source.dot.net/#System.Private.CoreLib/src/System/Math.CoreCLR.cs,135

bartonjs · 2021-05-27T18:42:39Z

Video

Since the original proposal, the return+out style for intrinsics has changed to named tuples (Quotient, Remainder) for these.
"hi" should be "upper", "low" should be "lower"
The canonical order in these APIs is (lower, upper), not (upper, lower).
Added n(u)int as well
The signedness of things is probably correct, but if not, it's an API review typo and should just be fixed as appropriate.

namespace System.Runtime.Intrinsics.X86
{
  partial class X86Base 
  { 
    public (uint Quotient, uint Remainder) DivRem(uint lower, uint upper, uint divisor);

    public (int Quotient, int Remainder) DivRem(uint lower, int upper, int divisor);

    public (nuint Quotient, nuint Remainder) DivRem(nuint lower, nuint upper, nuint divisor);

    public (nint Quotient, nint Remainder) DivRem(nuint lower, nint upper, nint divisor);
 
    [Intrinsic]
    public static class X64
    {
        public (ulong Quotient, ulong Remainder) DivRem(ulong lower, ulong upper, ulong divisor);

        public (long Quotient, long Remainder) DivRem(ulong lower, long upper, long divisor);
    }
}

pentp · 2021-08-27T13:31:44Z

I added the related MUL intrinsics API proposal: #58263

deeprobin · 2022-02-04T19:09:18Z

With the implementation of this issue, should the DivRem¹ methods in Math also be changed to use the intrinsic methods if they are available on the corresponding platform?

And does anything need to be changed/added to preview features (e.g. generic math)?

.NET 6 documentation - DivRem - System.Runtime.dll ↩

huoyaoyuan · 2022-03-08T22:34:45Z

I looked into this, but found it's not easy, since the only tuple-returning intrinsic was CPUID, which is not optimal in implementation (calls a function).

Ideally, the JIT should be updated to handle multi-reg returning instructions better, and the LSRA can be tuned to handle instructions with fixed register input (namely DIV/IDIV), which can also improve the codegen for BigMul.

@tannergooding This one is too hard for community contribution. Maybe some JIT expert should also be called for this.

deeprobin · 2022-03-09T05:29:45Z

@huoyaoyuan
I've seen that too, that we return a ValueTuple. And as far as I know CPUID is also an InternalCall and not an Intrinsic or?
We could try to mimic a ValueTuple, but I'm not sure if that would work so well, since ValueTuple has the StructLayout Auto and not Sequential.

Alternatively, we could create another private static intrinsic which is easier to represent (for ex. using a ptr) and then which is then exposed via the proposed intrinsic

@tannergooding, what do you think?

tannergooding · 2022-03-09T18:03:00Z

The JIT already has support for correctly handling multi-register intrinsics. It was added by Carol Eidt a while back.

This support has already been utilized for other intrinsics such as #64864 and the same would need to apply to DivRem.

deeprobin · 2022-03-09T18:55:16Z

@tannergooding
Thanks for the info.

@huoyaoyuan
Do you want to create a pull request (if Tanner agrees, of course) or should I do it?

huoyaoyuan · 2022-03-10T11:32:09Z

If such mechanism has already been used, I should be able to complete this. I have already done managed parts.

deeprobin · 2023-07-05T08:40:35Z

@stephentoub Why was this reopened? I thought it is implemented?

stephentoub · 2023-07-05T14:03:56Z

Why was this reopened? I thought it is implemented?

Because it's currently marked as:

[RequiresPreviewFeatures("DivRem is in preview.")]

#82221

pentp · 2023-07-06T14:56:47Z

The remaining work is being tracked in #82194, so this one could still be closed?

msftgits transferred this issue from dotnet/corefx Jan 31, 2020

msftgits added this to the 5.0 milestone Jan 31, 2020

maryamariyan added the untriaged New issue has not been triaged by the area owner label Feb 23, 2020

tannergooding added arch-x64 and removed untriaged New issue has not been triaged by the area owner labels Mar 25, 2020

tannergooding modified the milestones: 5.0.0, Future Jun 23, 2020

tannergooding added api-ready-for-review API is ready for review, it is NOT ready for implementation and removed api-suggestion Early API idea and discussion, it is NOT ready for implementation labels Mar 15, 2021

bartonjs added api-approved API was approved in API review, it can be implemented and removed api-ready-for-review API is ready for review, it is NOT ready for implementation labels May 27, 2021

pentp mentioned this issue Aug 27, 2021

Implement 128-bit Multiply intrinsic for x86/x64 #58263

Open

huoyaoyuan mentioned this issue Mar 13, 2022

Implement DivRem intrinsic for X86 #66551

Merged

ghost added the in-pr There is an active PR which will close this issue when it is merged label Mar 13, 2022

tannergooding mentioned this issue Sep 14, 2022

Suboptimal x64 codegen for signed Math.BigMul #75594

Open

huoyaoyuan mentioned this issue Jan 30, 2023

Need _udiv128 intrinsic as a .Net API, eg, Math.BigDivRem. #81337

Closed

kunalspathak closed this as completed in #66551 Feb 15, 2023

ghost removed the in-pr There is an active PR which will close this issue when it is merged label Feb 15, 2023

jkotas mentioned this issue Feb 16, 2023

Mark DivRem with RequiresPreviewFeatures #82221

Merged

stephentoub reopened this Feb 16, 2023

stephentoub closed this as completed Jul 6, 2023

ghost locked as resolved and limited conversation to collaborators Aug 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DivMod instrinct for intel x86/x64 #27292

Add DivMod instrinct for intel x86/x64 #27292

Daniel-Svensson commented Sep 2, 2018 •

edited

Loading

danmoseley commented Sep 3, 2018

tannergooding commented Sep 3, 2018 •

edited

Loading

tannergooding commented Sep 3, 2018

Daniel-Svensson commented Sep 4, 2018 •

edited

Loading

tannergooding commented Sep 4, 2018

pentp commented Sep 12, 2018 •

edited

Loading

jkotas commented Sep 19, 2018

tannergooding commented Sep 19, 2018

pentp commented Sep 19, 2018 •

edited

Loading

saucecontrol commented Feb 9, 2019

Daniel-Svensson commented Jun 13, 2019

pentp commented Jun 13, 2019

tannergooding commented Jun 13, 2019

Daniel-Svensson commented Jun 15, 2020

tannergooding commented Jun 15, 2020

Thaina commented Mar 15, 2021

tannergooding commented Mar 15, 2021

bartonjs commented May 27, 2021 •

edited by dotnet-api-review bot

Loading

pentp commented Aug 27, 2021

deeprobin commented Feb 4, 2022

huoyaoyuan commented Mar 8, 2022

deeprobin commented Mar 9, 2022

tannergooding commented Mar 9, 2022

deeprobin commented Mar 9, 2022

huoyaoyuan commented Mar 10, 2022

deeprobin commented Jul 5, 2023 •

edited

Loading

stephentoub commented Jul 5, 2023

pentp commented Jul 6, 2023

Add DivMod instrinct for intel x86/x64 #27292

Add DivMod instrinct for intel x86/x64 #27292

Comments

Daniel-Svensson commented Sep 2, 2018 • edited Loading

Rationale and Usage

Proposed API

Details

Updates

danmoseley commented Sep 3, 2018

tannergooding commented Sep 3, 2018 • edited Loading

tannergooding commented Sep 3, 2018

Daniel-Svensson commented Sep 4, 2018 • edited Loading

tannergooding commented Sep 4, 2018

pentp commented Sep 12, 2018 • edited Loading

jkotas commented Sep 19, 2018

tannergooding commented Sep 19, 2018

pentp commented Sep 19, 2018 • edited Loading

saucecontrol commented Feb 9, 2019

Daniel-Svensson commented Jun 13, 2019

pentp commented Jun 13, 2019

tannergooding commented Jun 13, 2019

Daniel-Svensson commented Jun 15, 2020

tannergooding commented Jun 15, 2020

Thaina commented Mar 15, 2021

tannergooding commented Mar 15, 2021

bartonjs commented May 27, 2021 • edited by dotnet-api-review bot Loading

pentp commented Aug 27, 2021

deeprobin commented Feb 4, 2022

Footnotes

huoyaoyuan commented Mar 8, 2022

deeprobin commented Mar 9, 2022

tannergooding commented Mar 9, 2022

deeprobin commented Mar 9, 2022

huoyaoyuan commented Mar 10, 2022

deeprobin commented Jul 5, 2023 • edited Loading

stephentoub commented Jul 5, 2023

pentp commented Jul 6, 2023

Daniel-Svensson commented Sep 2, 2018 •

edited

Loading

tannergooding commented Sep 3, 2018 •

edited

Loading

Daniel-Svensson commented Sep 4, 2018 •

edited

Loading

pentp commented Sep 12, 2018 •

edited

Loading

pentp commented Sep 19, 2018 •

edited

Loading

bartonjs commented May 27, 2021 •

edited by dotnet-api-review bot

Loading

deeprobin commented Jul 5, 2023 •

edited

Loading