Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DivMod instrinct for intel x86/x64 #27292

Closed
Daniel-Svensson opened this issue Sep 2, 2018 · 28 comments · Fixed by #66551
Closed

Add DivMod instrinct for intel x86/x64 #27292

Daniel-Svensson opened this issue Sep 2, 2018 · 28 comments · Fixed by #66551
Labels
api-approved API was approved in API review, it can be implemented arch-x64 area-System.Runtime.Intrinsics
Milestone

Comments

@Daniel-Svensson
Copy link
Contributor

Daniel-Svensson commented Sep 2, 2018

Please add hw instrinct support for divide on intel platforms.

Rationale and Usage

The intel instruction div is a bit special since 32bit divide takes a 64 dividend and 32bit divisor and returns both quotient and remainder as result.

It would be very useful, not only for speeding up Math.DivRem (https://github.com/dotnet/coreclr/issues/757 ) but especially for several decimal operations which would make it more feasible to move decimal code to C# (and maybe several number functions).
Significant speedup using div in C++ provided significant speedups in https://github.com/dotnet/coreclr/issues/10642)

Example:

The following code perfomes division of a 96bit unsigned number by a 32bit number.
Updating the 96bit number in-pace and returning the remainder
The example is simplified from actual code and contains only the "worst case" execution path.

public uint Div96By32(uint [] num,  uint denominator)
{
  if (System.Runtime.Intrinsics.X86.X86Base.IsSupported)
  {
    uint remainder;
    num[2] = System.Runtime.Intrinsics.X86.X86Base.DivMod(0, num[2], denominator, out remainder);   
    num[1] = System.Runtime.Intrinsics.X86.X86Base.DivMod(remainder, num[1], denominator, out remainder);
    num[0] = System.Runtime.Intrinsics.X86.X86Base.DivMod(remainder, num[0], denominator, out remainder);
    return remainder;
  }
 else { ... }
}

It performs the same logic as in https://github.com/dotnet/corert/blob/d82d460a8530a57e4915060be37fb42c7a661f48/src/System.Private.CoreLib/shared/System/Decimal.DecCalc.cs#L228

That code has 2 64bit divides (much slower, especially in 32bit mode), and several multiplications (workarounds instead of using '%' which would double the executed div instructions currently).

Proposed API

The API might need some design, but I think it makes sense to keep it similar to Math.DivRem as below.

namespace System.Runtime.Intrinsics.X86
{
  public static class X86Base 
  { 
    /// Perform unsigned division of 64 bit (hi, low) by 32bit divisor
    /// returning quotient, and returning remainder as an out parameter
    public uint DivRem(uint hi, uint low, uint divisor, out uint remainder);

    /// Perform unsigned division of 64 bit (hi, low) by 32 bit divisor
    /// returning quotient, and returning remainder as an out parameter
    public int DivRem(int hi, uint low, int divisor, out int remainder);
 
    [Intrinsic]
    public static class X64
    {
        /// Perform unsigned division of 128 bit (hi, low) by 64bit divisor
        /// returning quotient, and returning remainder as an out parameter
        public ulong DivRem(ulong hi, ulong low, ulong divisor, out ulong remainder);

        /// Perform unsigned division of 128 bit (hi, low) by 64 bit divisor
        /// returning quotient, and returning remainder as an out parameter
        public long DivRem(long hi, ulong low, long divisor, out long remainder);
      }
}

Related functionality which would make sense to add.
It could use the instrinct for god performance on x86 platforms.

class System.Math
{
   // unsigned overload of  public static int DivRem (int a, int b, out int result);
   public static uint DivRem (uint a, uint b, out uint result);

   // unsigned overload of  public static int DivRem (int a, int b, out int result);
   public static ulong DivRem (ulong a, ulong b, out ulong result);
}

Details

Full description of instruction can be found in https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-instruction-set-reference-manual-325383.pdf

Updates

@danmoseley
Copy link
Member

@tannergooding

@tannergooding
Copy link
Member

tannergooding commented Sep 3, 2018

CC. @CarolEidt, would appreciate your thoughts here.

This falls into the category of "core" instructions (not behind any flag), but which also provide additional functionality that is one or more of:

  • Platform/Architecture Specific
  • Not easily mappable to IL or C# language semantics
  • Potentially useful for certain types of algorithms

There are other instructions that also fall into this category, such as Add with Carry, for which certain native compilers do expose such instrinsic helpers:

@tannergooding
Copy link
Member

@Daniel-Svensson, could you update the original post to roughly follow the layout given as an example here: https://github.com/dotnet/corefx/blob/master/Documentation/project-docs/api-review-process.md#steps

Could you also update the original post to include the signed-division overloads (which would call IDIV)?

@Daniel-Svensson
Copy link
Contributor Author

Daniel-Svensson commented Sep 4, 2018

@tannergooding I've tried to update the post based on the example which the documentation linked to.

I've added two versions for IDIV since I was a bit unsure about which version you would be interested in.

I've also tried to add namespaces/classes based on some arm instrinct I found.

Regarding Add with Carry and Subtract with Carry are there any issue tracking those?

It would be very useful to have them as cross platform instrincts, 64bit operations could just fall back to 2x 32bit operations. I've have some C++ improvements to decimal arithmetics with 150%+ speedups around and having those operations would be required to have any managed implementation with similar performance)

@tannergooding
Copy link
Member

I've added two versions for IDIV since I was a bit unsure about which version you would be interested in.

Both the 64-bit / 32-bit, and the 32-bit / 32-bit cases are interesting.

I've also tried to add namespaces/classes based on some arm instrinct I found.

We don't want to have a x64 specific class. Currently we are just exposing them together in the same class and calling the 64-bit only overloads (on a 32-bit box) will throw a PNSE.

Regarding Add with Carry and Subtract with Carry are there any issue tracking those

No, there isn't a tracking issue today.

It would be very useful to have them as cross platform instrincts

We actually have Math.DivRem today (for signed and unsigned versions of Int32 / Int32 and Int64 / Int64). They just don't actually do a single operation today (if the intrinsics are approved, and implemented, we would likely update the existing Math APIs to use those, where possible).

@pentp
Copy link
Contributor

pentp commented Sep 12, 2018

I would like to extend this proposal to also include BigMul as mentioned in https://github.com/dotnet/coreclr/issues/17818#issuecomment-395365269.
128-bit DIV and MUL are often needed together in the same place (e.g., System.Decimal).

public static class Base 
{ 
  // 32 * 32 => 64 bit signed multiply
  public uint BigMul(int a, int b, out int hi);
  // 32 * 32 => 64 bit unsigned multiply
  public uint BigMul(uint a, uint b, out uint hi);
  // 64 * 64 => 128 bit signed multiply
  public ulong BigMul(long a, long b, out long hi);
  // 64 * 64 => 128 bit unsigned multiply
  public ulong BigMul(ulong a, ulong b, out ulong hi);
}

@jkotas
Copy link
Member

jkotas commented Sep 19, 2018

We should just make the existing Math.DivRem methods work as intrinsic. These do not need be new HW intrinsics methods in the public surface.

@tannergooding
Copy link
Member

These do not need be new HW intrinsics methods in the public surface.

I think that is (possibly) worth further discussion.

Even if we update Math.DivRem to be intrinsic in 3.0, it will essentially be "this is maybe intrinsic, if your platform/runtime support it". While x86Base.DivRem would be "this is always intrinsic, for x86 using int/uint and for x64 using int/uint or long/ulong" (the latter of which can be important for certain performance oriented scenarios).

DivRem (like the other instructions that would be put up for proposal to include in the x86Base class) is a case where x86/x64 exposes a single instruction that provides functionality not all platforms may support. In this case, it is a combining div and rem (which on ARM, to my knowledge, there is no single instruction equivalent).

@pentp
Copy link
Contributor

pentp commented Sep 19, 2018

There are several issues with the existing Math.DivRem/BigMul methods that are not resolved by making them intrinsic:

  • No unsigned overloads.
  • No 128-bit division/multiplication.
  • No IsSupported mechanism to detect if they are intrinsics or not.

And unlike IDIV, a general purpose division method has dividend size equal to quotient size, which results in long division being performed with a 128-bit IDIV on x64 with a 2-4x latency and 4-15x throughput penalty compared to 64-bit IDIV (x86 goes through a slow helper call).
A general purpose 128-bit division method would have to be a slow helper call on all platforms and pretty much useless (e.g., Decimal has special division methods already for 96/64, 96/32, etc.).

Maybe these intrinsics don't need to be public at first, but they would be useful even just to implement the existing Math functions + Decimal math. But making them internal would limit their usefulness.

@saucecontrol
Copy link
Member

A couple more useful additions: BSF and BSR

public abstract class Base 
{ 
    public abstract class X64
    {
        /// <summary>
        /// Search the mask data from least significant bit (LSB) to the most significant bit (MSB) for a set bit (1).
        /// </summary>
        public static int BitScanForward(long mask);
        public static int BitScanForward(ulong mask);

        /// <summary>
        /// Search the mask data from most significant bit (MSB) to least significant bit (LSB) for a set bit (1).
        /// </summary>
        public static int BitScanReverse(long x);
        public static int BitScanReverse(ulong x);
    }

    /// <summary>
    /// Search the mask data from least significant bit (LSB) to the most significant bit (MSB) for a set bit (1).
    /// </summary>
    public static int BitScanForward(int mask);
    public static int BitScanForward(uint mask);

    /// <summary>
    /// Search the mask data from most significant bit (MSB) to least significant bit (LSB) for a set bit (1).
    /// </summary>
    public static int BitScanReverse(int x);
    public static int BitScanReverse(uint x);
}

There are several places in CoreFX that have been updated recently to use the TZCNT and LZCNT HWIntrinsics, but they must fall back to a slow path for processors not supporting the BMI1 or LZCNT ISAs, respectively, because the JIT requires the corresponding CPUID flag be set.

However, as described here, TZCNT is encoded as REP BSF, which means it can execute on downlevel processors (with different behavior for a 0 input).

Software built with an AOT compiler can take advantage of the instruction compatibility to have a single code path when the TZCNT/BSF behavior is known to be the same, but we can't do the same with the JIT. On the other hand, we could code paths for both instructions if BSF were exposed explicitly.

Likewise, in place of LZCNT, we could use 31 ^ BSR(mask) as a fallback path.

Given the number of processors not supporting the BMI1 and LZCNT ISAs, including the modern Intel Goldmont architecture, it would be worthwhile to be able to code both paths.

Also related: https://github.com/dotnet/corefx/issues/32269

@Daniel-Svensson
Copy link
Contributor Author

I would like to extend this proposal to also include BigMul as mentioned in dotnet/coreclr#17818 (comment).

@pentp I think that would be a better fit to add to Math to complement the currently available signed versions (which I guess might be in a different issue) or wherever new math operations go instead of making it platform specific. It could then be treated as an instrinct by the runtime for optimal performance. I tried to search for such an issue but did not find it.

@pentp
Copy link
Contributor

pentp commented Jun 13, 2019

128-bit MUL does not really fit into the existing Math class - it can only be an intrinsic on some platforms, while on others (e.g. 32-bit x86) the performance falls off a cliff. It follows that an IsSupported check is required, which also does not fit into Math, as commented above.

@tannergooding
Copy link
Member

A Mul128(long, long) is really no different from FusedMultuplyAdd.

For C/C++ they expose a FP_FAST_FMA macro for this.

I dont think it would be unreasonable, or out of scope, to expose similar boolean properties for a small selection of Math APIs, like FusedMultiplyAdd or Mul128

@msftgits msftgits transferred this issue from dotnet/corefx Jan 31, 2020
@msftgits msftgits added this to the 5.0 milestone Jan 31, 2020
@maryamariyan maryamariyan added the untriaged New issue has not been triaged by the area owner label Feb 23, 2020
@tannergooding tannergooding added arch-x64 and removed untriaged New issue has not been triaged by the area owner labels Mar 25, 2020
@Daniel-Svensson
Copy link
Contributor Author

I've updated the description to match that there is now a X86Base instrinct class.
@tannergooding is there anything else that should/can be changed to increase the likelihood of getting this included for 5.0?

@tannergooding
Copy link
Member

I can mark it ready-for-review now but I don't know if this will get on the backlog before we lock everything down.

The API review will also need to go over whether this is actually worthwhile to expose. As @jkotas indicated, this may be sufficiently covered (for the most common cases) via Math.DivRem and we should just optimize it (which I believe is dependent on some work @CarolEidt has been doing around multi-reg returns).

The remaining reason for the intrinsic is to support 128-bit / 64-bit = 64-bit and 64-bit / 32-bit = 32-bit. However, it might also be possible to just handle those via new "general-purpose" Math APIs.

@tannergooding tannergooding modified the milestones: 5.0.0, Future Jun 23, 2020
@Thaina
Copy link

Thaina commented Mar 15, 2021

In addition to DivMod I would like to also propose Absolute and IntFraction. I don't know it exist in low level instruction but it would be convenient to have

int abs = Math.Abs(value,out int sign); // int for -1/0/1
long l = Math.IntFraction(123.456f,out float f); // 123 and 0.456f
l = Math.IntFraction(123.456,out double d); // 123 and 0.456

Is it possible?

@tannergooding tannergooding added api-ready-for-review API is ready for review, it is NOT ready for implementation and removed api-suggestion Early API idea and discussion, it is NOT ready for implementation labels Mar 15, 2021
@tannergooding
Copy link
Member

Is it possible?

Those are possible but they should be distinct proposals following the API proposal issue template: https://github.com/dotnet/runtime/issues/new?assignees=&labels=api-suggestion&template=02_api_proposal.md

The APIs you mention are also different in that they aren't intrinsics supported by hardware.
IntFraction is also better known as ModF and is internally supported but not publicly exposed: https://source.dot.net/#System.Private.CoreLib/src/System/Math.CoreCLR.cs,135

@bartonjs
Copy link
Member

bartonjs commented May 27, 2021

Video

  • Since the original proposal, the return+out style for intrinsics has changed to named tuples (Quotient, Remainder) for these.
  • "hi" should be "upper", "low" should be "lower"
  • The canonical order in these APIs is (lower, upper), not (upper, lower).
  • Added n(u)int as well
  • The signedness of things is probably correct, but if not, it's an API review typo and should just be fixed as appropriate.
namespace System.Runtime.Intrinsics.X86
{
  partial class X86Base 
  { 
    public (uint Quotient, uint Remainder) DivRem(uint lower, uint upper, uint divisor);

    public (int Quotient, int Remainder) DivRem(uint lower, int upper, int divisor);

    public (nuint Quotient, nuint Remainder) DivRem(nuint lower, nuint upper, nuint divisor);

    public (nint Quotient, nint Remainder) DivRem(nuint lower, nint upper, nint divisor);
 
    [Intrinsic]
    public static class X64
    {
        public (ulong Quotient, ulong Remainder) DivRem(ulong lower, ulong upper, ulong divisor);

        public (long Quotient, long Remainder) DivRem(ulong lower, long upper, long divisor);
    }
}

@bartonjs bartonjs added api-approved API was approved in API review, it can be implemented and removed api-ready-for-review API is ready for review, it is NOT ready for implementation labels May 27, 2021
@pentp
Copy link
Contributor

pentp commented Aug 27, 2021

I added the related MUL intrinsics API proposal: #58263

@deeprobin
Copy link
Contributor

With the implementation of this issue, should the DivRem1 methods in Math also be changed to use the intrinsic methods if they are available on the corresponding platform?

And does anything need to be changed/added to preview features (e.g. generic math)?

Footnotes

  1. .NET 6 documentation - DivRem - System.Runtime.dll

@huoyaoyuan
Copy link
Member

I looked into this, but found it's not easy, since the only tuple-returning intrinsic was CPUID, which is not optimal in implementation (calls a function).

Ideally, the JIT should be updated to handle multi-reg returning instructions better, and the LSRA can be tuned to handle instructions with fixed register input (namely DIV/IDIV), which can also improve the codegen for BigMul.

@tannergooding This one is too hard for community contribution. Maybe some JIT expert should also be called for this.

@deeprobin
Copy link
Contributor

@huoyaoyuan
I've seen that too, that we return a ValueTuple. And as far as I know CPUID is also an InternalCall and not an Intrinsic or?
We could try to mimic a ValueTuple, but I'm not sure if that would work so well, since ValueTuple has the StructLayout Auto and not Sequential.

Alternatively, we could create another private static intrinsic which is easier to represent (for ex. using a ptr) and then which is then exposed via the proposed intrinsic

@tannergooding, what do you think?

@tannergooding
Copy link
Member

The JIT already has support for correctly handling multi-register intrinsics. It was added by Carol Eidt a while back.

This support has already been utilized for other intrinsics such as #64864 and the same would need to apply to DivRem.

@deeprobin
Copy link
Contributor

@tannergooding
Thanks for the info.

@huoyaoyuan
Do you want to create a pull request (if Tanner agrees, of course) or should I do it?

@huoyaoyuan
Copy link
Member

If such mechanism has already been used, I should be able to complete this. I have already done managed parts.

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Mar 13, 2022
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Feb 15, 2023
@stephentoub stephentoub reopened this Feb 16, 2023
@deeprobin
Copy link
Contributor

deeprobin commented Jul 5, 2023

@stephentoub Why was this reopened? I thought it is implemented?

@stephentoub
Copy link
Member

Why was this reopened? I thought it is implemented?

Because it's currently marked as:

[RequiresPreviewFeatures("DivRem is in preview.")]

#82221

@pentp
Copy link
Contributor

pentp commented Jul 6, 2023

The remaining work is being tracked in #82194, so this one could still be closed?

@ghost ghost locked as resolved and limited conversation to collaborators Aug 5, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-approved API was approved in API review, it can be implemented arch-x64 area-System.Runtime.Intrinsics
Projects
None yet
Development

Successfully merging a pull request may close this issue.