Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RyuJIT: x*2 -> x+x; x*1 -> x (fp) #33024

Merged
merged 15 commits into from
May 6, 2020
Merged

Conversation

EgorBo
Copy link
Member

@EgorBo EgorBo commented Mar 1, 2020

Just a small peephole optimization for floats/doubles: x * 2 -> x + x, I decided to implement it in lowering just like in LLVM (implemented in DAG*). In general mulss\sd and addss\sd are the same in terms of latency\throughput on most CPUs but add doesn't need to load 2.0 constant.
UPD also, x * 1 -> x

static float MultiplyBy2(float x) => x * 2; // or "2 * x" -> "x + x"
static float MultiplyBy1(float x) => x * 1; // or "1 * x" -> "x"

Current codegen

; Method MultiplyBy2(float):float
       C5F877               vzeroupper 
       C5FA590505000000     vmulss   xmm0, xmm0, dword ptr [reloc @RWD00]
       C3                   ret      
RWD00  dd	40000000h
; Total bytes of code: 12


; Method MultiplyBy1(float):float
       C5F877               vzeroupper 
       C5FA590505000000     vmulss   xmm0, xmm0, dword ptr [reloc @RWD00]
       C3                   ret      
RWD00  dd	3F800000h
; Total bytes of code: 12

New codegen

; Method MultiplyBy2(float):float
       C5F877               vzeroupper 
       C5FA58C0             vaddss   xmm0, xmm0, xmm0
       C3                   ret      
; Total bytes of code: 8


; Method MultiplyBy1(float):float
       C5F877               vzeroupper 
       C3                   ret      
; Total bytes of code: 4

Jit-diff (-f -pmi)

Found 5 files with textual diffs.

Summary of Code Size diffs:
(Lower is better)

Total bytes of diff: -118 (-0.00% of base)
    diff is an improvement.

Top file improvements (bytes):
         -55 : System.Private.CoreLib.dasm (-0.00% of base)
         -24 : runincontext.dasm (-0.22% of base)
         -24 : System.Runtime.Numerics.dasm (-0.03% of base)
          -8 : System.Drawing.Primitives.dasm (-0.02% of base)
          -7 : Microsoft.VisualBasic.Core.dasm (-0.00% of base)

5 total files with Code Size differences (5 improved, 0 regressed), 261 unchanged.

Top method regressions (bytes):
           1 ( 0.17% of base) : System.Private.CoreLib.dasm - System.Globalization.CalendricalCalculationsHelper:EquationOfTime(double):double

Top method improvements (bytes):
         -36 (-8.51% of base) : System.Private.CoreLib.dasm - System.Numerics.Matrix4x4:CreateFromQuaternion(System.Numerics.Quaternion):System.Numerics.Matrix4x4
         -24 (-0.85% of base) : runincontext.dasm - TestRunner:DoWorkStress():int:this
         -16 (-2.93% of base) : System.Runtime.Numerics.dasm - System.Numerics.Complex:Sqrt(System.Numerics.Complex):System.Numerics.Complex
          -8 (-12.12% of base) : System.Drawing.Primitives.dasm - System.Drawing.RectangleF:Inflate(float,float):this
          -8 (-20.51% of base) : System.Private.CoreLib.dasm - System.Diagnostics.Stopwatch:GetElapsedDateTimeTicks():long:this
          -8 (-2.84% of base) : System.Runtime.Numerics.dasm - System.Numerics.Complex:Tan(System.Numerics.Complex):System.Numerics.Complex
          -4 (-1.12% of base) : Microsoft.VisualBasic.Core.dasm - Microsoft.VisualBasic.Financial:SYD(double,double,double,double):double
          -4 (-0.84% of base) : System.Private.CoreLib.dasm - System.Numerics.Matrix4x4:CreatePerspective(float,float,float,float):System.Numerics.Matrix4x4
          -4 (-0.71% of base) : System.Private.CoreLib.dasm - System.Numerics.Matrix4x4:CreatePerspectiveOffCenter(float,float,float,float,float,float):System.Numerics.Matrix4x4
          -4 (-4.94% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector2:Reflect(System.Numerics.Vector2,System.Numerics.Vector2):System.Numerics.Vector2
          -3 (-0.40% of base) : Microsoft.VisualBasic.Core.dasm - Microsoft.VisualBasic.Financial:Rate(double,double,double,double,int,double):double

Top method regressions (percentages):
           1 ( 0.17% of base) : System.Private.CoreLib.dasm - System.Globalization.CalendricalCalculationsHelper:EquationOfTime(double):double

Top method improvements (percentages):
          -8 (-20.51% of base) : System.Private.CoreLib.dasm - System.Diagnostics.Stopwatch:GetElapsedDateTimeTicks():long:this
          -8 (-12.12% of base) : System.Drawing.Primitives.dasm - System.Drawing.RectangleF:Inflate(float,float):this
         -36 (-8.51% of base) : System.Private.CoreLib.dasm - System.Numerics.Matrix4x4:CreateFromQuaternion(System.Numerics.Quaternion):System.Numerics.Matrix4x4
          -4 (-4.94% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector2:Reflect(System.Numerics.Vector2,System.Numerics.Vector2):System.Numerics.Vector2
         -16 (-2.93% of base) : System.Runtime.Numerics.dasm - System.Numerics.Complex:Sqrt(System.Numerics.Complex):System.Numerics.Complex
          -8 (-2.84% of base) : System.Runtime.Numerics.dasm - System.Numerics.Complex:Tan(System.Numerics.Complex):System.Numerics.Complex
          -4 (-1.12% of base) : Microsoft.VisualBasic.Core.dasm - Microsoft.VisualBasic.Financial:SYD(double,double,double,double):double
         -24 (-0.85% of base) : runincontext.dasm - TestRunner:DoWorkStress():int:this
          -4 (-0.84% of base) : System.Private.CoreLib.dasm - System.Numerics.Matrix4x4:CreatePerspective(float,float,float,float):System.Numerics.Matrix4x4
          -4 (-0.71% of base) : System.Private.CoreLib.dasm - System.Numerics.Matrix4x4:CreatePerspectiveOffCenter(float,float,float,float,float,float):System.Numerics.Matrix4x4
          -3 (-0.40% of base) : Microsoft.VisualBasic.Core.dasm - Microsoft.VisualBasic.Financial:Rate(double,double,double,double,int,double):double

12 total methods with Code Size differences (11 improved, 1 regressed), 312220 unchanged.

Benchmark:

[Benchmark]
[ArgumentsSource(nameof(TestValues))]
public void MultiplyBy2(float[] array)
{
    for (int i = 0; i < array.Length; i++)
    {
        array[i] *= 2;
    }
}

Core i7 8700K (Coffee Lake), Windows 10 x64.

|         |       Mean |    Error |   StdDev |
|---------|-----------:|---------:|---------:|
| master  |   429.8 ns |  0.04 ns |  0.04 ns |
| PR      |   388.6 ns |  2.47 ns |  2.19 ns |  ~10% faster (doesn't depend on array size)

@jkotas jkotas added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed area-CodeGen labels Mar 1, 2020
Copy link

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love this optimization!

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@AndyAyersMS
Copy link
Member

Why wouldn't we do this earlier, say in morph?

@EgorBo
Copy link
Member Author

EgorBo commented Mar 9, 2020

Why wouldn't we do this earlier, say in morph?

@AndyAyersMS
I noticed that this kind of peephole optimizations are usually implemented closer to lowering in LLVM, see https://godbolt.org/z/4o9pUB

But if you think it's better to have it in morph - no problem, should be even easier than with LIR.

@EgorBo EgorBo changed the title RyuJIT: x*2 -> x+x RyuJIT: x*2 -> x+x; x*1 -> x (fp) Apr 19, 2020
@EgorBo
Copy link
Member Author

EgorBo commented Apr 19, 2020

Moved to morph.cpp, also, added x * 1 -> x optimization.

A small regression: https://www.diffchecker.com/TMKg73nv (CSE related?)
Total bytes of code 591 vs 592

cc @sandreenko

Copy link
Contributor

@sandreenko sandreenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

A small regression: https://www.diffchecker.com/TMKg73nv (CSE related?)
Total bytes of code 591 vs 592

I will check it tomorrow and merge the PR if it is not something unexpected.

src/coreclr/src/jit/morph.cpp Outdated Show resolved Hide resolved
{
if ((oper == GT_MUL) && !op1->IsCnsFltOrDbl() && op2->IsCnsFltOrDbl())
{
if (op2->AsDblCon()->gtDconVal == 2.0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In dotnet/coreclr#24584 we introduced an optimization which replaces x / pow2 with x * (1 / pow2).

In the case pow2 is 0.5, this would transform x / 0.5 into x * 2.0 and so the question is: Will this still be correctly optimized in this scenario? That is, will it become x + x?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tannergooding heh, yeah, I checked that 🙂

x / 0.5
is optimized to vaddss

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a general plan, doc, or outline of how to order these optimizations and ensure they stay correctly ordered moving forward?

Ideally, we would be able to apply these transformations without needing to consider that x / pow2 must come before x * 2.0 because reason.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no idea, we got lucky these two were executed in a correct order

I guess when you optimize something, e.g. change operator of a node (e.g. uX GT_GT 0 to X GT_NE 0) and feel this particular node can be re-optimized under a different case you should either:

  1. goto case GT_NE_OP (ugly)
  2. call fgMorphSmpOp recursively (needs some recursion depth limit just in case)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AndyAyersMS, Do you have any thoughts/comments here?

As we introduce more optimizations, this seems like something that will be easy to miss and it would be good to have a consistent process for how handling it is normally recommended...

Copy link
Member Author

@EgorBo EgorBo May 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, a good example for this problem is this PR: https://github.com/dotnet/coreclr/pull/25744/files

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's in morph is pretty messy. We don't have a good way of determining if there are dependencies between optimizations, or if one optimization may enable another. There are long-term ambitions of refactoring all this but it is not something we'll get to anytime soon.

Best I can suggest for now is to make sure you run diffs widely, eg over the entire set of Pri1 tests, and look carefully at the diffs, or (if you are internal) run your change over some of our SPMI collections.

We want to avoid repeated subtree traversals so we don't waste time, get stuck in some oscillating mode or blow the stack; so re-invoking morph recursively is only really viable if the node has been simplified in some way. If you are really sure optimizing A enables optimization B, then factor B out and invoke it as a subroutine as part finishing up A.

@sandreenko
Copy link
Contributor

Sorry for keeping you waiting, I should have reviewed it early.

However, looks like this optimization doesn't work good with minopts (or when OPTIONS: compCodeOpt = SMALL_CODE), because in this case LSRA can't allocate both LCL_VAR reads to registers.
A good case with full opts:

N011 ( 40,  6) [000010] DA-XG-------              *  STORE_LCL_VAR double V02 tmp1         d:1 mm0 REG mm0
N013 (  1,  2) [000011] ------------        t11 =    LCL_VAR   double V02 tmp1         u:1 mm0 REG mm0 <l:$201, c:$200>
N015 (  1,  2) [000014] ------------        t14 =    LCL_VAR   double V02 tmp1         u:1 mm0 (last use) REG mm0 <l:$201, c:$200>
                                                  /--*  t11    double 
                                                  +--*  t14    double 
N017 ( 47, 14) [000015] ---XG-------        t15 = *  ADD       double REG mm0 <l:$1c7, c:$1c6>

a bad case with compCodeOpt = SMALL_CODE:

N012 ( 21, 15) [000009] DA-XG-------              *  STORE_LCL_VAR double V01 tmp1          NA REG NA
N014 (  3,  4) [000010] ------------        t10 =    LCL_VAR   double V01 tmp1          mm0 REG mm0
N016 (  3,  4) [000013] -c----------        t13 =    LCL_VAR   double V01 tmp1          NA REG NA <- memory, so we push more stack and do a longer read, that generates asm regressions
                                                  /--*  t10    double 
                                                  +--*  t13    double 
N018 ( 32, 27) [000014] ---XG-------        t14 = *  ADD       double REG mm0

so I think it should be protected by opts.OptimizationEnabled().

For some reason, jit-diff doesn't react on set complus_jitminopts and I can't get JitDump for CalendricalCalculationsHelper:EquationOfTime(double):double to see what happens there, I will try to get it tomorrow.

@AndyAyersMS
Copy link
Member

You can pass --tier0 to jit-diffs diff to get a look at minopts code for most methods.

@sandreenko
Copy link
Contributor

You can pass --tier0 to jit-diffs diff to get a look at minopts code for most methods.

That is probably broken currently, I am trying to debug that.

@sandreenko
Copy link
Contributor

I got diffs for System.Globalization.CalendricalCalculationsHelper:EquationOfTime(double):double, it is just bad instruction ordering with the new weights, we see such regressions from time to time, it is just noise.

With the new transformation ADD is ordered earlier than a call and its result has to be spilled before the call. Before MUL was scheduled after the call, so there was no need for a spill. @BruceForstall we have discussed a similar issue today, cc @CarolEidt.

tree before:

N011 ( 98, 36) [000071] --C---------                 \--*  MUL       double
N007 ( 84, 22) [000185] --C---------                    +--*  INTRINSIC double sin <- call is scheduled before the second arg
N006 ( 48, 18) [000190] ------------                    |  \--*  DIV       double
N004 (  9, 10) [000188] ------------                    |     +--*  MUL       double
N002 (  1,  2) [000065] ------------                    |     |  +--*  LCL_VAR   double V03 loc2         
N003 (  3,  4) [000187] ------------                    |     |  \--*  CNS_DBL   double 3.1415926535897931
N005 (  3,  4) [000189] ------------                    |     \--*  CNS_DBL   double 180.00000000000000
N010 (  9, 10) [000064] ------------                    \--*  MUL       double // no need to spill mul result
N008 (  1,  2) [000063] ------------                       +--*  LCL_VAR   double V04 loc3         
N009 (  3,  4) [000062] ------------                       \--*  CNS_DBL   double 2.0000000000000000

tree after:

N010 ( 96, 34) [000071] --C---------                 \--*  MUL       double
N003 (  7,  8) [000333] ------------                    +--*  ADD       double // allocated to xmm1 and immediately spilled because of the second arg.
N001 (  1,  2) [000063] ------------                    |  +--*  LCL_VAR   double V04 loc3         
N002 (  1,  2) [000332] ------------                    |  \--*  LCL_VAR   double V04 loc3         
N009 ( 84, 22) [000185] --C---------                    \--*  INTRINSIC double sin
N008 ( 48, 18) [000190] ------------                       \--*  DIV       double
N006 (  9, 10) [000188] ------------                          +--*  MUL       double
N004 (  1,  2) [000065] ------------                          |  +--*  LCL_VAR   double V03 loc2         
N005 (  3,  4) [000187] ------------                          |  \--*  CNS_DBL   double 3.1415926535897931
N007 (  3,  4) [000189] ------------                          \--*  CNS_DBL   double 180.00000000000000

baseGood.txt
diffBad.txt

@EgorBo please add opts.OptimizationEnabled() guard and merge the change.

@EgorBo
Copy link
Member Author

EgorBo commented May 6, 2020

Failure is unrelated

@sandreenko sandreenko merged commit 5abde01 into dotnet:master May 6, 2020
@sandreenko
Copy link
Contributor

Thanks @EgorBo.

@ghost ghost locked as resolved and limited conversation to collaborators Dec 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants