RyuJIT: x2 -> x+x; x1 -> x (fp) #33024

EgorBo · 2020-03-01T09:31:17Z

Just a small peephole optimization for floats/doubles: x * 2 -> x + x, I decided to implement it in lowering just like in LLVM (implemented in DAG*). In general mulss\sd and addss\sd are the same in terms of latency\throughput on most CPUs but add doesn't need to load 2.0 constant.
UPD also, x * 1 -> x

static float MultiplyBy2(float x) => x * 2; // or "2 * x" -> "x + x"
static float MultiplyBy1(float x) => x * 1; // or "1 * x" -> "x"

Current codegen

; Method MultiplyBy2(float):float
       C5F877               vzeroupper 
       C5FA590505000000     vmulss   xmm0, xmm0, dword ptr [reloc @RWD00]
       C3                   ret      
RWD00  dd	40000000h
; Total bytes of code: 12


; Method MultiplyBy1(float):float
       C5F877               vzeroupper 
       C5FA590505000000     vmulss   xmm0, xmm0, dword ptr [reloc @RWD00]
       C3                   ret      
RWD00  dd	3F800000h
; Total bytes of code: 12

New codegen

; Method MultiplyBy2(float):float
       C5F877               vzeroupper 
       C5FA58C0             vaddss   xmm0, xmm0, xmm0
       C3                   ret      
; Total bytes of code: 8


; Method MultiplyBy1(float):float
       C5F877               vzeroupper 
       C3                   ret      
; Total bytes of code: 4

Jit-diff (-f -pmi)

Found 5 files with textual diffs.

Summary of Code Size diffs:
(Lower is better)

Total bytes of diff: -118 (-0.00% of base)
    diff is an improvement.

Top file improvements (bytes):
         -55 : System.Private.CoreLib.dasm (-0.00% of base)
         -24 : runincontext.dasm (-0.22% of base)
         -24 : System.Runtime.Numerics.dasm (-0.03% of base)
          -8 : System.Drawing.Primitives.dasm (-0.02% of base)
          -7 : Microsoft.VisualBasic.Core.dasm (-0.00% of base)

5 total files with Code Size differences (5 improved, 0 regressed), 261 unchanged.

Top method regressions (bytes):
           1 ( 0.17% of base) : System.Private.CoreLib.dasm - System.Globalization.CalendricalCalculationsHelper:EquationOfTime(double):double

Top method improvements (bytes):
         -36 (-8.51% of base) : System.Private.CoreLib.dasm - System.Numerics.Matrix4x4:CreateFromQuaternion(System.Numerics.Quaternion):System.Numerics.Matrix4x4
         -24 (-0.85% of base) : runincontext.dasm - TestRunner:DoWorkStress():int:this
         -16 (-2.93% of base) : System.Runtime.Numerics.dasm - System.Numerics.Complex:Sqrt(System.Numerics.Complex):System.Numerics.Complex
          -8 (-12.12% of base) : System.Drawing.Primitives.dasm - System.Drawing.RectangleF:Inflate(float,float):this
          -8 (-20.51% of base) : System.Private.CoreLib.dasm - System.Diagnostics.Stopwatch:GetElapsedDateTimeTicks():long:this
          -8 (-2.84% of base) : System.Runtime.Numerics.dasm - System.Numerics.Complex:Tan(System.Numerics.Complex):System.Numerics.Complex
          -4 (-1.12% of base) : Microsoft.VisualBasic.Core.dasm - Microsoft.VisualBasic.Financial:SYD(double,double,double,double):double
          -4 (-0.84% of base) : System.Private.CoreLib.dasm - System.Numerics.Matrix4x4:CreatePerspective(float,float,float,float):System.Numerics.Matrix4x4
          -4 (-0.71% of base) : System.Private.CoreLib.dasm - System.Numerics.Matrix4x4:CreatePerspectiveOffCenter(float,float,float,float,float,float):System.Numerics.Matrix4x4
          -4 (-4.94% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector2:Reflect(System.Numerics.Vector2,System.Numerics.Vector2):System.Numerics.Vector2
          -3 (-0.40% of base) : Microsoft.VisualBasic.Core.dasm - Microsoft.VisualBasic.Financial:Rate(double,double,double,double,int,double):double

Top method regressions (percentages):
           1 ( 0.17% of base) : System.Private.CoreLib.dasm - System.Globalization.CalendricalCalculationsHelper:EquationOfTime(double):double

Top method improvements (percentages):
          -8 (-20.51% of base) : System.Private.CoreLib.dasm - System.Diagnostics.Stopwatch:GetElapsedDateTimeTicks():long:this
          -8 (-12.12% of base) : System.Drawing.Primitives.dasm - System.Drawing.RectangleF:Inflate(float,float):this
         -36 (-8.51% of base) : System.Private.CoreLib.dasm - System.Numerics.Matrix4x4:CreateFromQuaternion(System.Numerics.Quaternion):System.Numerics.Matrix4x4
          -4 (-4.94% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector2:Reflect(System.Numerics.Vector2,System.Numerics.Vector2):System.Numerics.Vector2
         -16 (-2.93% of base) : System.Runtime.Numerics.dasm - System.Numerics.Complex:Sqrt(System.Numerics.Complex):System.Numerics.Complex
          -8 (-2.84% of base) : System.Runtime.Numerics.dasm - System.Numerics.Complex:Tan(System.Numerics.Complex):System.Numerics.Complex
          -4 (-1.12% of base) : Microsoft.VisualBasic.Core.dasm - Microsoft.VisualBasic.Financial:SYD(double,double,double,double):double
         -24 (-0.85% of base) : runincontext.dasm - TestRunner:DoWorkStress():int:this
          -4 (-0.84% of base) : System.Private.CoreLib.dasm - System.Numerics.Matrix4x4:CreatePerspective(float,float,float,float):System.Numerics.Matrix4x4
          -4 (-0.71% of base) : System.Private.CoreLib.dasm - System.Numerics.Matrix4x4:CreatePerspectiveOffCenter(float,float,float,float,float,float):System.Numerics.Matrix4x4
          -3 (-0.40% of base) : Microsoft.VisualBasic.Core.dasm - Microsoft.VisualBasic.Financial:Rate(double,double,double,double,int,double):double

12 total methods with Code Size differences (11 improved, 1 regressed), 312220 unchanged.

Benchmark:

[Benchmark]
[ArgumentsSource(nameof(TestValues))]
public void MultiplyBy2(float[] array)
{
    for (int i = 0; i < array.Length; i++)
    {
        array[i] *= 2;
    }
}

Core i7 8700K (Coffee Lake), Windows 10 x64.

|         |       Mean |    Error |   StdDev |
|---------|-----------:|---------:|---------:|
| master  |   429.8 ns |  0.04 ns |  0.04 ns |
| PR      |   388.6 ns |  2.47 ns |  2.19 ns |  ~10% faster (doesn't depend on array size)

ghost

Love this optimization!

src/coreclr/tests/src/JIT/opt/InstructionCombining/MulToAdd.cs

kunalspathak

LGTM

AndyAyersMS · 2020-03-09T17:49:53Z

Why wouldn't we do this earlier, say in morph?

EgorBo · 2020-03-09T18:59:58Z

Why wouldn't we do this earlier, say in morph?

@AndyAyersMS
I noticed that this kind of peephole optimizations are usually implemented closer to lowering in LLVM, see https://godbolt.org/z/4o9pUB

But if you think it's better to have it in morph - no problem, should be even easier than with LIR.

…to-add

EgorBo · 2020-04-19T22:36:32Z

Moved to morph.cpp, also, added x * 1 -> x optimization.

A small regression: https://www.diffchecker.com/TMKg73nv (CSE related?)
Total bytes of code 591 vs 592

cc @sandreenko

sandreenko

LGTM

A small regression: https://www.diffchecker.com/TMKg73nv (CSE related?)
Total bytes of code 591 vs 592

I will check it tomorrow and merge the PR if it is not something unexpected.

src/coreclr/src/jit/morph.cpp

…to-add

tannergooding · 2020-05-04T14:46:33Z

src/coreclr/src/jit/morph.cpp

+            {
+                if ((oper == GT_MUL) && !op1->IsCnsFltOrDbl() && op2->IsCnsFltOrDbl())
+                {
+                    if (op2->AsDblCon()->gtDconVal == 2.0)


In dotnet/coreclr#24584 we introduced an optimization which replaces x / pow2 with x * (1 / pow2).

In the case pow2 is 0.5, this would transform x / 0.5 into x * 2.0 and so the question is: Will this still be correctly optimized in this scenario? That is, will it become x + x?

@tannergooding heh, yeah, I checked that 🙂

x / 0.5
is optimized to vaddss

Do we have a general plan, doc, or outline of how to order these optimizations and ensure they stay correctly ordered moving forward?

Ideally, we would be able to apply these transformations without needing to consider that x / pow2 must come before x * 2.0 because reason.

no idea, we got lucky these two were executed in a correct order

I guess when you optimize something, e.g. change operator of a node (e.g. uX GT_GT 0 to X GT_NE 0) and feel this particular node can be re-optimized under a different case you should either:

goto case GT_NE_OP (ugly)

call fgMorphSmpOp recursively (needs some recursion depth limit just in case)

@AndyAyersMS, Do you have any thoughts/comments here?

As we introduce more optimizations, this seems like something that will be easy to miss and it would be good to have a consistent process for how handling it is normally recommended...

btw, a good example for this problem is this PR: https://github.com/dotnet/coreclr/pull/25744/files

What's in morph is pretty messy. We don't have a good way of determining if there are dependencies between optimizations, or if one optimization may enable another. There are long-term ambitions of refactoring all this but it is not something we'll get to anytime soon.

Best I can suggest for now is to make sure you run diffs widely, eg over the entire set of Pri1 tests, and look carefully at the diffs, or (if you are internal) run your change over some of our SPMI collections.

We want to avoid repeated subtree traversals so we don't waste time, get stuck in some oscillating mode or blow the stack; so re-invoking morph recursively is only really viable if the node has been simplified in some way. If you are really sure optimizing A enables optimization B, then factor B out and invoke it as a subroutine as part finishing up A.

sandreenko · 2020-05-05T09:42:01Z

Sorry for keeping you waiting, I should have reviewed it early.

However, looks like this optimization doesn't work good with minopts (or when OPTIONS: compCodeOpt = SMALL_CODE), because in this case LSRA can't allocate both LCL_VAR reads to registers.
A good case with full opts:

N011 ( 40,  6) [000010] DA-XG-------              *  STORE_LCL_VAR double V02 tmp1         d:1 mm0 REG mm0
N013 (  1,  2) [000011] ------------        t11 =    LCL_VAR   double V02 tmp1         u:1 mm0 REG mm0 <l:$201, c:$200>
N015 (  1,  2) [000014] ------------        t14 =    LCL_VAR   double V02 tmp1         u:1 mm0 (last use) REG mm0 <l:$201, c:$200>
                                                  /--*  t11    double 
                                                  +--*  t14    double 
N017 ( 47, 14) [000015] ---XG-------        t15 = *  ADD       double REG mm0 <l:$1c7, c:$1c6>

a bad case with compCodeOpt = SMALL_CODE:

N012 ( 21, 15) [000009] DA-XG-------              *  STORE_LCL_VAR double V01 tmp1          NA REG NA
N014 (  3,  4) [000010] ------------        t10 =    LCL_VAR   double V01 tmp1          mm0 REG mm0
N016 (  3,  4) [000013] -c----------        t13 =    LCL_VAR   double V01 tmp1          NA REG NA <- memory, so we push more stack and do a longer read, that generates asm regressions
                                                  /--*  t10    double 
                                                  +--*  t13    double 
N018 ( 32, 27) [000014] ---XG-------        t14 = *  ADD       double REG mm0

so I think it should be protected by opts.OptimizationEnabled().

For some reason, jit-diff doesn't react on set complus_jitminopts and I can't get JitDump for CalendricalCalculationsHelper:EquationOfTime(double):double to see what happens there, I will try to get it tomorrow.

AndyAyersMS · 2020-05-05T15:34:29Z

You can pass --tier0 to jit-diffs diff to get a look at minopts code for most methods.

sandreenko · 2020-05-05T19:50:54Z

You can pass --tier0 to jit-diffs diff to get a look at minopts code for most methods.

That is probably broken currently, I am trying to debug that.

sandreenko · 2020-05-06T03:04:27Z

I got diffs for System.Globalization.CalendricalCalculationsHelper:EquationOfTime(double):double, it is just bad instruction ordering with the new weights, we see such regressions from time to time, it is just noise.

With the new transformation ADD is ordered earlier than a call and its result has to be spilled before the call. Before MUL was scheduled after the call, so there was no need for a spill. @BruceForstall we have discussed a similar issue today, cc @CarolEidt.

tree before:

N011 ( 98, 36) [000071] --C---------                 \--*  MUL       double
N007 ( 84, 22) [000185] --C---------                    +--*  INTRINSIC double sin <- call is scheduled before the second arg
N006 ( 48, 18) [000190] ------------                    |  \--*  DIV       double
N004 (  9, 10) [000188] ------------                    |     +--*  MUL       double
N002 (  1,  2) [000065] ------------                    |     |  +--*  LCL_VAR   double V03 loc2         
N003 (  3,  4) [000187] ------------                    |     |  \--*  CNS_DBL   double 3.1415926535897931
N005 (  3,  4) [000189] ------------                    |     \--*  CNS_DBL   double 180.00000000000000
N010 (  9, 10) [000064] ------------                    \--*  MUL       double // no need to spill mul result
N008 (  1,  2) [000063] ------------                       +--*  LCL_VAR   double V04 loc3         
N009 (  3,  4) [000062] ------------                       \--*  CNS_DBL   double 2.0000000000000000

tree after:

N010 ( 96, 34) [000071] --C---------                 \--*  MUL       double
N003 (  7,  8) [000333] ------------                    +--*  ADD       double // allocated to xmm1 and immediately spilled because of the second arg.
N001 (  1,  2) [000063] ------------                    |  +--*  LCL_VAR   double V04 loc3         
N002 (  1,  2) [000332] ------------                    |  \--*  LCL_VAR   double V04 loc3         
N009 ( 84, 22) [000185] --C---------                    \--*  INTRINSIC double sin
N008 ( 48, 18) [000190] ------------                       \--*  DIV       double
N006 (  9, 10) [000188] ------------                          +--*  MUL       double
N004 (  1,  2) [000065] ------------                          |  +--*  LCL_VAR   double V03 loc2         
N005 (  3,  4) [000187] ------------                          |  \--*  CNS_DBL   double 3.1415926535897931
N007 (  3,  4) [000189] ------------                          \--*  CNS_DBL   double 180.00000000000000

baseGood.txt
diffBad.txt

@EgorBo please add opts.OptimizationEnabled() guard and merge the change.

…to-add

EgorBo · 2020-05-06T18:48:10Z

Failure is unrelated

sandreenko · 2020-05-06T19:00:47Z

Thanks @EgorBo.

EgorBo added 2 commits March 1, 2020 11:54

Optimize "x * 2" to "x + x" for floats

c4306be

Add tests

3584ca2

Dotnet-GitSync-Bot added the area-CodeGen label Mar 1, 2020

jkotas added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed area-CodeGen labels Mar 1, 2020

Fix typo

5bb8450

ghost approved these changes Mar 1, 2020

View reviewed changes

kunalspathak reviewed Mar 4, 2020

View reviewed changes

src/coreclr/tests/src/JIT/opt/InstructionCombining/MulToAdd.cs Outdated Show resolved Hide resolved

kunalspathak approved these changes Mar 4, 2020

View reviewed changes

Address feedback

fa5412c

jaredpar mentioned this pull request Mar 6, 2020

Error downloading artifacts: could not find end of central directory record #33271

Closed

EgorBo added 2 commits April 19, 2020 18:54

Merge branch 'master' of github.com:dotnet/runtime into opt-mul-by-2-…

3923e2d

…to-add

Move to morph.cpp, also add "x*1" optimization

1fd8158

EgorBo changed the title ~~RyuJIT: x*2 -> x+x~~ RyuJIT: x*2 -> x+x; x*1 -> x (fp) Apr 19, 2020

EgorBo added 5 commits April 20, 2020 01:38

clean up

bc1b461

clean up

27f9019

update op2

bd987d9

Fix incorrect "fgMakeMultiUse" usage

18436fb

update op1

6fe3b93

sandreenko approved these changes May 4, 2020

View reviewed changes

src/coreclr/src/jit/morph.cpp Outdated Show resolved Hide resolved

EgorBo added 2 commits May 4, 2020 13:17

Merge branch 'master' of github.com:dotnet/runtime into opt-mul-by-2-…

481d802

…to-add

Address feedback

060972c

tannergooding reviewed May 4, 2020

View reviewed changes

EgorBo added 2 commits May 6, 2020 16:08

Merge branch 'master' of github.com:dotnet/runtime into opt-mul-by-2-…

7f56e79

…to-add

Add opts.OptimizationEnabled()

c964a00

safern mentioned this pull request May 6, 2020

Test failure: Assertion failed '(jitIntrinsic->gtSIMDSize == 12) || (jitIntrinsic->gtSIMDSize == 16)' #35877

Closed

sandreenko merged commit 5abde01 into dotnet:master May 6, 2020

This was referenced Jun 12, 2020

[Perf -40%] Benchstone.BenchF.Simpsn.Test DrewScoggins/performance-2#534

Open

[Perf -40%] Benchstone.BenchF.Simpsn.Test #37802

Closed

ghost locked as resolved and limited conversation to collaborators Dec 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RyuJIT: x2 -> x+x; x1 -> x (fp) #33024

RyuJIT: x2 -> x+x; x1 -> x (fp) #33024

EgorBo commented Mar 1, 2020 •

edited

Loading

ghost left a comment

kunalspathak left a comment

AndyAyersMS commented Mar 9, 2020

EgorBo commented Mar 9, 2020 •

edited

Loading

EgorBo commented Apr 19, 2020 •

edited

Loading

sandreenko left a comment

tannergooding May 4, 2020

EgorBo May 4, 2020

tannergooding May 4, 2020

EgorBo May 4, 2020

tannergooding May 4, 2020

EgorBo May 4, 2020 •

edited

Loading

AndyAyersMS May 4, 2020

sandreenko commented May 5, 2020

AndyAyersMS commented May 5, 2020

sandreenko commented May 5, 2020

sandreenko commented May 6, 2020

EgorBo commented May 6, 2020

sandreenko commented May 6, 2020

RyuJIT: x*2 -> x+x; x*1 -> x (fp) #33024

RyuJIT: x*2 -> x+x; x*1 -> x (fp) #33024

Conversation

EgorBo commented Mar 1, 2020 • edited Loading

Current codegen

New codegen

Jit-diff (-f -pmi)

Benchmark:

ghost left a comment

Choose a reason for hiding this comment

kunalspathak left a comment

Choose a reason for hiding this comment

AndyAyersMS commented Mar 9, 2020

EgorBo commented Mar 9, 2020 • edited Loading

EgorBo commented Apr 19, 2020 • edited Loading

sandreenko left a comment

Choose a reason for hiding this comment

tannergooding May 4, 2020

Choose a reason for hiding this comment

EgorBo May 4, 2020

Choose a reason for hiding this comment

tannergooding May 4, 2020

Choose a reason for hiding this comment

EgorBo May 4, 2020

Choose a reason for hiding this comment

tannergooding May 4, 2020

Choose a reason for hiding this comment

EgorBo May 4, 2020 • edited Loading

Choose a reason for hiding this comment

AndyAyersMS May 4, 2020

Choose a reason for hiding this comment

sandreenko commented May 5, 2020

AndyAyersMS commented May 5, 2020

sandreenko commented May 5, 2020

sandreenko commented May 6, 2020

EgorBo commented May 6, 2020

sandreenko commented May 6, 2020

RyuJIT: x2 -> x+x; x1 -> x (fp) #33024

RyuJIT: x2 -> x+x; x1 -> x (fp) #33024

EgorBo commented Mar 1, 2020 •

edited

Loading

EgorBo commented Mar 9, 2020 •

edited

Loading

EgorBo commented Apr 19, 2020 •

edited

Loading

EgorBo May 4, 2020 •

edited

Loading