Reduce repeated variable add operations to a single multiply. #61663

anthonycanino · 2021-11-16T10:21:47Z

This PR addresses #34938.

Changes

gtReduceStrength will reduce i + i + i + i to i * 4, which will be optimized to i << 2. The reduction is not limited to
power of twos, as i + i + i + i + i will reduce to i * 5, which will be optimized to lea reg1, [reg2+4*reg2] etc on xarch.

Discussion

My gtReduceStrength function is invoked inside gtMorphSmpOp, and kept as its own function for now per the issue discussion on the public issue, but can be inlined inside the preorder processing block in gtMorphSmpOp.

Regarding the regression below, esp. with Benchstone.BenchI.CSieve:Test():bool:this, another PR #61662 I have opened will fix that case due to the range checking limitation.

Overall, the improvements are pretty minor here, which isn't surprising as I wouldn't expect programmers writing benchmarks or libraries to do something like i + i + i + i. However, LLVM/MSVC performs similar optimizations (https://godbolt.org/z/4ednM7WaE).

Results

benchmarks.run.windows.x64.checked.mch:


Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 7121565 (overridden on cmd)
Total bytes of diff: 7121582 (overridden on cmd)
Total bytes of delta: 17 (0.00 % of base)
    diff is a regression.
    relative diff is a regression.

Detail diffs



Top file regressions (bytes):
          15 : 23803.dasm (8.29% of base)
           1 : 14314.dasm (0.06% of base)
           1 : 26462.dasm (0.08% of base)

3 total files with Code Size differences (0 improved, 3 regressed), 2 unchanged.

Top method regressions (bytes):
          15 ( 8.29% of base) : 23803.dasm - Benchstone.BenchI.CSieve:Test():bool:this
           1 ( 0.08% of base) : 26462.dasm - Benchstone.BenchF.LLoops:Init():this
           1 ( 0.06% of base) : 14314.dasm - Benchstone.MDBenchF.MDLLoops:Init():this

Top method regressions (percentages):
          15 ( 8.29% of base) : 23803.dasm - Benchstone.BenchI.CSieve:Test():bool:this
           1 ( 0.08% of base) : 26462.dasm - Benchstone.BenchF.LLoops:Init():this
           1 ( 0.06% of base) : 14314.dasm - Benchstone.MDBenchF.MDLLoops:Init():this

3 total methods with Code Size differences (0 improved, 3 regressed), 2 unchanged.

coreclr_tests.pmi.windows.x64.checked.mch:


Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 124633874 (overridden on cmd)
Total bytes of diff: 124633877 (overridden on cmd)
Total bytes of delta: 3 (0.00 % of base)
    diff is a regression.
    relative diff is a regression.

Detail diffs



Top file regressions (bytes):
          15 : 82551.dasm (10.34% of base)
           1 : 239451.dasm (0.06% of base)
           1 : 239429.dasm (0.08% of base)

Top file improvements (bytes):
          -6 : 82974.dasm (-0.56% of base)
          -5 : 239565.dasm (-0.04% of base)
          -1 : 245065.dasm (-0.14% of base)
          -1 : 84102.dasm (-0.77% of base)
          -1 : 240909.dasm (-0.16% of base)

8 total files with Code Size differences (5 improved, 3 regressed), 8 unchanged.

Top method regressions (bytes):
          15 (10.34% of base) : 82551.dasm - Benchstone.BenchI.CSieve:Bench():bool
           1 ( 0.08% of base) : 239429.dasm - Benchstone.BenchF.LLoops:Init():this
           1 ( 0.06% of base) : 239451.dasm - Benchstone.MDBenchF.MDLLoops:Init():this

Top method improvements (bytes):
          -6 (-0.56% of base) : 82974.dasm - $:gg(int,jb,jb):jb
          -5 (-0.04% of base) : 239565.dasm - ILGEN_0x372a9ae6:Method_0xdc6ff1a4(byte,byte,int,long,ushort,double,long,long):int
          -1 (-0.16% of base) : 240909.dasm - AA:Static1(byref,byref,long,System.Boolean[,,]):ushort
          -1 (-0.14% of base) : 245065.dasm - BB:Static1(System.Double[,,][],byref,int,byref)
          -1 (-0.77% of base) : 84102.dasm - TailRecursionWithOsrEntryInTry:F(int,int,int,int):int

Top method regressions (percentages):
          15 (10.34% of base) : 82551.dasm - Benchstone.BenchI.CSieve:Bench():bool
           1 ( 0.08% of base) : 239429.dasm - Benchstone.BenchF.LLoops:Init():this
           1 ( 0.06% of base) : 239451.dasm - Benchstone.MDBenchF.MDLLoops:Init():this

Top method improvements (percentages):
          -1 (-0.77% of base) : 84102.dasm - TailRecursionWithOsrEntryInTry:F(int,int,int,int):int
          -6 (-0.56% of base) : 82974.dasm - $:gg(int,jb,jb):jb
          -1 (-0.16% of base) : 240909.dasm - AA:Static1(byref,byref,long,System.Boolean[,,]):ushort
          -1 (-0.14% of base) : 245065.dasm - BB:Static1(System.Double[,,][],byref,int,byref)
          -5 (-0.04% of base) : 239565.dasm - ILGEN_0x372a9ae6:Method_0xdc6ff1a4(byte,byte,int,long,ushort,double,long,long):int

8 total methods with Code Size differences (5 improved, 3 regressed), 8 unchanged.

libraries.pmi.windows.x64.checked.mch:


Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 45221486 (overridden on cmd)
Total bytes of diff: 45221486 (overridden on cmd)
Total bytes of delta: 0 (0.00 % of base)

Detail diffs



0 total files with Code Size differences (0 improved, 0 regressed), 4 unchanged.

0 total methods with Code Size differences (0 improved, 0 regressed), 4 unchanged.

libraries_tests.pmi.windows.x64.checked.mch:


Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 113503162 (overridden on cmd)
Total bytes of diff: 113503171 (overridden on cmd)
Total bytes of delta: 9 (0.00 % of base)
    diff is a regression.
    relative diff is a regression.

Detail diffs



Top file regressions (bytes):
           3 : 196471.dasm (0.74% of base)
           3 : 329488.dasm (0.74% of base)
           3 : 194847.dasm (0.74% of base)

3 total files with Code Size differences (0 improved, 3 regressed), 9 unchanged.

Top method regressions (bytes):
           3 ( 0.74% of base) : 196471.dasm - <>c__DisplayClass19_0:<RsaOaepMaxSize>g__Test|0(System.Security.Cryptography.RSAEncryptionPadding,int):this
           3 ( 0.74% of base) : 329488.dasm - <>c__DisplayClass19_0:<RsaOaepMaxSize>g__Test|0(System.Security.Cryptography.RSAEncryptionPadding,int):this
           3 ( 0.74% of base) : 194847.dasm - <>c__DisplayClass19_0:<RsaOaepMaxSize>g__Test|0(System.Security.Cryptography.RSAEncryptionPadding,int):this

Top method regressions (percentages):
           3 ( 0.74% of base) : 196471.dasm - <>c__DisplayClass19_0:<RsaOaepMaxSize>g__Test|0(System.Security.Cryptography.RSAEncryptionPadding,int):this
           3 ( 0.74% of base) : 329488.dasm - <>c__DisplayClass19_0:<RsaOaepMaxSize>g__Test|0(System.Security.Cryptography.RSAEncryptionPadding,int):this
           3 ( 0.74% of base) : 194847.dasm - <>c__DisplayClass19_0:<RsaOaepMaxSize>g__Test|0(System.Security.Cryptography.RSAEncryptionPadding,int):this

3 total methods with Code Size differences (0 improved, 3 regressed), 9 unchanged.

ghost · 2021-11-16T10:21:55Z

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

This PR addresses #34938.

Changes

gtReduceStrength will reduce i + i + i + i to i * 4, which will be optimized to i << 2. The reduction is not limited to
power of twos, as i + i + i + i + i will reduce to i * 5, which will be optimized to lea reg1, [reg2+4*reg2] etc on xarch.

Discussion

My gtReduceStrength function is invoked inside gtMorphSmpOp, and kept as its own function for now per the issue discussion on the public issue, but can be inlined inside the preorder processing block in gtMorphSmpOp.

Regarding the regression below, esp. with Benchstone.BenchI.CSieve:Test():bool:this, another PR #61662 I have opened will fix that case due to the range checking limitation.

Overall, the improvements are pretty minor here, which isn't surprising as I wouldn't expect programmers writing benchmarks or libraries to do something like i + i + i + i. However, LLVM/MSVC performs similar optimizations (https://godbolt.org/z/4ednM7WaE).

Results

benchmarks.run.windows.x64.checked.mch:


Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 7121565 (overridden on cmd)
Total bytes of diff: 7121582 (overridden on cmd)
Total bytes of delta: 17 (0.00 % of base)
    diff is a regression.
    relative diff is a regression.

Detail diffs



Top file regressions (bytes):
          15 : 23803.dasm (8.29% of base)
           1 : 14314.dasm (0.06% of base)
           1 : 26462.dasm (0.08% of base)

3 total files with Code Size differences (0 improved, 3 regressed), 2 unchanged.

Top method regressions (bytes):
          15 ( 8.29% of base) : 23803.dasm - Benchstone.BenchI.CSieve:Test():bool:this
           1 ( 0.08% of base) : 26462.dasm - Benchstone.BenchF.LLoops:Init():this
           1 ( 0.06% of base) : 14314.dasm - Benchstone.MDBenchF.MDLLoops:Init():this

Top method regressions (percentages):
          15 ( 8.29% of base) : 23803.dasm - Benchstone.BenchI.CSieve:Test():bool:this
           1 ( 0.08% of base) : 26462.dasm - Benchstone.BenchF.LLoops:Init():this
           1 ( 0.06% of base) : 14314.dasm - Benchstone.MDBenchF.MDLLoops:Init():this

3 total methods with Code Size differences (0 improved, 3 regressed), 2 unchanged.

coreclr_tests.pmi.windows.x64.checked.mch:


Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 124633874 (overridden on cmd)
Total bytes of diff: 124633877 (overridden on cmd)
Total bytes of delta: 3 (0.00 % of base)
    diff is a regression.
    relative diff is a regression.

Detail diffs



Top file regressions (bytes):
          15 : 82551.dasm (10.34% of base)
           1 : 239451.dasm (0.06% of base)
           1 : 239429.dasm (0.08% of base)

Top file improvements (bytes):
          -6 : 82974.dasm (-0.56% of base)
          -5 : 239565.dasm (-0.04% of base)
          -1 : 245065.dasm (-0.14% of base)
          -1 : 84102.dasm (-0.77% of base)
          -1 : 240909.dasm (-0.16% of base)

8 total files with Code Size differences (5 improved, 3 regressed), 8 unchanged.

Top method regressions (bytes):
          15 (10.34% of base) : 82551.dasm - Benchstone.BenchI.CSieve:Bench():bool
           1 ( 0.08% of base) : 239429.dasm - Benchstone.BenchF.LLoops:Init():this
           1 ( 0.06% of base) : 239451.dasm - Benchstone.MDBenchF.MDLLoops:Init():this

Top method improvements (bytes):
          -6 (-0.56% of base) : 82974.dasm - $:gg(int,jb,jb):jb
          -5 (-0.04% of base) : 239565.dasm - ILGEN_0x372a9ae6:Method_0xdc6ff1a4(byte,byte,int,long,ushort,double,long,long):int
          -1 (-0.16% of base) : 240909.dasm - AA:Static1(byref,byref,long,System.Boolean[,,]):ushort
          -1 (-0.14% of base) : 245065.dasm - BB:Static1(System.Double[,,][],byref,int,byref)
          -1 (-0.77% of base) : 84102.dasm - TailRecursionWithOsrEntryInTry:F(int,int,int,int):int

Top method regressions (percentages):
          15 (10.34% of base) : 82551.dasm - Benchstone.BenchI.CSieve:Bench():bool
           1 ( 0.08% of base) : 239429.dasm - Benchstone.BenchF.LLoops:Init():this
           1 ( 0.06% of base) : 239451.dasm - Benchstone.MDBenchF.MDLLoops:Init():this

Top method improvements (percentages):
          -1 (-0.77% of base) : 84102.dasm - TailRecursionWithOsrEntryInTry:F(int,int,int,int):int
          -6 (-0.56% of base) : 82974.dasm - $:gg(int,jb,jb):jb
          -1 (-0.16% of base) : 240909.dasm - AA:Static1(byref,byref,long,System.Boolean[,,]):ushort
          -1 (-0.14% of base) : 245065.dasm - BB:Static1(System.Double[,,][],byref,int,byref)
          -5 (-0.04% of base) : 239565.dasm - ILGEN_0x372a9ae6:Method_0xdc6ff1a4(byte,byte,int,long,ushort,double,long,long):int

8 total methods with Code Size differences (5 improved, 3 regressed), 8 unchanged.

libraries.pmi.windows.x64.checked.mch:


Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 45221486 (overridden on cmd)
Total bytes of diff: 45221486 (overridden on cmd)
Total bytes of delta: 0 (0.00 % of base)

Detail diffs



0 total files with Code Size differences (0 improved, 0 regressed), 4 unchanged.

0 total methods with Code Size differences (0 improved, 0 regressed), 4 unchanged.

libraries_tests.pmi.windows.x64.checked.mch:


Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 113503162 (overridden on cmd)
Total bytes of diff: 113503171 (overridden on cmd)
Total bytes of delta: 9 (0.00 % of base)
    diff is a regression.
    relative diff is a regression.

Detail diffs



Top file regressions (bytes):
           3 : 196471.dasm (0.74% of base)
           3 : 329488.dasm (0.74% of base)
           3 : 194847.dasm (0.74% of base)

3 total files with Code Size differences (0 improved, 3 regressed), 9 unchanged.

Top method regressions (bytes):
           3 ( 0.74% of base) : 196471.dasm - <>c__DisplayClass19_0:<RsaOaepMaxSize>g__Test|0(System.Security.Cryptography.RSAEncryptionPadding,int):this
           3 ( 0.74% of base) : 329488.dasm - <>c__DisplayClass19_0:<RsaOaepMaxSize>g__Test|0(System.Security.Cryptography.RSAEncryptionPadding,int):this
           3 ( 0.74% of base) : 194847.dasm - <>c__DisplayClass19_0:<RsaOaepMaxSize>g__Test|0(System.Security.Cryptography.RSAEncryptionPadding,int):this

Top method regressions (percentages):
           3 ( 0.74% of base) : 196471.dasm - <>c__DisplayClass19_0:<RsaOaepMaxSize>g__Test|0(System.Security.Cryptography.RSAEncryptionPadding,int):this
           3 ( 0.74% of base) : 329488.dasm - <>c__DisplayClass19_0:<RsaOaepMaxSize>g__Test|0(System.Security.Cryptography.RSAEncryptionPadding,int):this
           3 ( 0.74% of base) : 194847.dasm - <>c__DisplayClass19_0:<RsaOaepMaxSize>g__Test|0(System.Security.Cryptography.RSAEncryptionPadding,int):this

3 total methods with Code Size differences (0 improved, 3 regressed), 9 unchanged.

Author:	anthonycanino
Assignees:	-
Labels:	`area-CodeGen-coreclr`, `community-contribution`
Milestone:	-

EgorBo · 2021-11-16T11:29:31Z

Nice! Can you make sure this code is correctly folded:

int Test(int x)
{
    for (int i = 0; i < Vector<int>.Count; i++)
        x++;
    return x;
}

currently, Vector<int>.Count is a hack to unroll loops but we might enable it for more cases in future.

Also, this case: #46257 (comment)

UPD Ah, it's a bit different - these are separate statements.

anthonycanino · 2021-11-16T16:08:14Z

UPD Ah, it's a bit different - these are separate statements.

Yes, though I am interested in the combination of strength reduction with loop unrolling in general. However, this is definitely a peephole optimization for a single add expression.

kasperk81 · 2021-11-16T16:31:42Z

will it also reduce x - x - x - x - x to imul eax, ecx, -3?

anthonycanino · 2021-11-17T11:39:57Z

will it also reduce x - x - x - x - x to imul eax, ecx, -3?

No, it will only address replicated addition, at least as a first pass, per the discussion on #34938. I

kasperk81 · 2021-11-17T12:27:47Z

per the discussion on #34938.

how common is the pattern discussed in #34938? 0 improved, 0 regressed in BCL is telling me it's not worth the complexity.

anthonycanino · 2021-11-17T15:43:28Z

The pattern pops up a little bit in the libraries test (please see above), for the regression, please see #61662.

@kunalspathak @EgorBo what do you think? Is this peep worth considering to match behavior in LLVM/MSVC (https://godbolt.org/z/4ednM7WaE)? It may be a start for further strength reduction patterns too, i.e., @kasperk81 pattern above.

kunalspathak · 2021-11-17T15:56:48Z

It is true that there are not many patterns in .NET libraries that will benefit from this. I would still keep it provided we understand the regression happening in some benchmark methods and try to eliminate those.

This happens on both x64/arm64.

Top method regressions (bytes):
          19 (12.93 % of base) : 26383.dasm - Benchstone.BenchI.CSieve:Test():bool:this
           4 (1.79 % of base) : 29297.dasm - StringSort:strsift(System.String[],int,int)

SingleAccretion · 2021-11-17T16:09:38Z

src/coreclr/jit/gentree.cpp

+    if (vnStore != nullptr)
+    {
+        fgValueNumberTreeConst(consTree);
+        fgValueNumberTree(morphed);


The morphed tree should have the same VN as the original: morphed->SetVNsFromNode(tree). It is not a good idea, in general, to call fgValueNumberTree outside of VN, it has not been tested to work that way.

Also, there is a helper you can use here to update the VN of the constant tree: fgUpdateConstTreeValueNumber.

Also, if you don't intend to relax the fgGlobalMorph check under which this method is right now, you don't need to maintain VNs at all.

Thanks for pointing this out. This is mostly my unfamiliarity with the code --- I didn't realize the difference between the fgGlobalMorph check, and the CSE check, i.e., when the optimizations are valid in which phases.

Looking at the surrounding code in fgMorphSmpOp, I see cases where the optimizations are restricted to the global morph, and cases where they are not. Is there a recommended pattern here, or do you have any tips?

Looking at the surrounding code in fgMorphSmpOp, I see cases where the optimizations are restricted to the global morph, and cases where they are not. Is there a recommended pattern here, or do you have any tips?

It is mostly cost/benefit question.

But first, legality:

Only optimizations that preserve value numbers are allowed outside of fgGlobalMorph. Allowable VN manipulation is transferring an existing VN from an existing tree, and updating VNs for constants. Anything more complex is probably better left under fgGlobalMorph. This optimization can trivially preserve value numbers (the VN of the tree stays the same, as does the VN of the local, the VN of the constant gets updated).

Allowing optimizations outside of global morph is beneficial because the optimization phases discover new facts that help peepholes, primarily via constant propagation. If your optimization is unlikely to benefit from that, it is likely better to leave it under fgGlobalMorph and not worry about VN maintenance / CSEs. Otherwise, it is up to the judgement of the implementer and reviewers what decision is "better" CQ-vs-risk-of-bugs-wise (which, I note, we discover on a somewhat regular basis...).

I personally think it is fine to leave this under fgGlobalMorph and delete the VN code. The only new fact that optimizations can discover and which can help the transformation is a local being substituted via copy propagation, but it seems like a rather rare event, and we do not remorph in copy propagation, so it would take a random CSE/assertion to force the remorph and realize the benefit.

(Heh, I should add this my guide)

SingleAccretion · 2021-11-17T16:12:18Z

src/coreclr/jit/gentree.cpp

+    GenTree* consTree   = tree->AsOp()->gtOp1;
+    consTree->BashToConst(foldCount, lclVarTree->TypeGet());
+
+    /* GT_MUL operators can later be transformed into 'GT_CALL' */


Suggested change

/* GT_MUL operators can later be transformed into 'GT_CALL' */

// GT_MUL operators can later be transformed into 'GT_CALL'.

Nit: we prefer //-style comments.

This suggest to me that this is actually a pessimization in this case. Probably we should bail if we see 64 bit ADDs on a 32 bit target (the only case where we use helpers). We may miss some power of two cases, but this is the price of doing this in pre-order, I suppose.

SingleAccretion · 2021-11-17T16:13:45Z

src/coreclr/jit/gentree.cpp

+    GenTree*   op1  = tree->AsOp()->gtOp1;
+    GenTree*   op2  = tree->AsOp()->gtOp2;
+
+    if (!op2->OperIs(GT_LCL_VAR) || !varTypeIsIntegralOrI(op2))


Suggested change

if (!op2->OperIs(GT_LCL_VAR) || !varTypeIsIntegralOrI(op2))

if (!op2->OperIs(GT_LCL_VAR) || !varTypeIsIntegral(op2))

It is not valid to multiply byrefs/refs.

SingleAccretion · 2021-11-17T16:14:02Z

src/coreclr/jit/gentree.cpp

+        return tree;
+    }
+
+    genTreeOps oper = tree->OperGet();


SingleAccretion · 2021-11-17T16:16:05Z

src/coreclr/jit/gentree.cpp

+GenTree* Compiler::gtReduceStrength(GenTree* tree)
+{
+    // ADD(_, V0) starts the pattern match
+    if (tree->OperGet() != GT_ADD || tree->gtOverflow() || optValnumCSE_phase)


Suggested change

if (tree->OperGet() != GT_ADD || tree->gtOverflow() || optValnumCSE_phase)

if (!tree->OperIs(GT_ADD) || tree->gtOverflow() || optValnumCSE_phase)

Nit: why not use OperIs everywhere?

The optValnumCSE_phase is redundant with the fgGlobalMorph guard under which the method is called. I suggest removing it and instead asserting that we are in global morph, if you don't intend to relax that restriction. If you do, then optValnumCSE_phase should, obviously, stay.

SingleAccretion · 2021-11-17T16:19:56Z

src/coreclr/jit/gentree.cpp

+// Return Value:
+//    reduced tree if pattern matches, original tree otherwise
+//
+GenTree* Compiler::gtReduceStrength(GenTree* tree)


Nit: the method's name is very general - is this intended to be extended? If this is to say as it is, it should probably be renamed to something more specific (there is a pattern of fgOptimizeThing in morph) and moved out of gentree.cpp into morph.cpp (not much of a difference these days, I know, but let's not make it worse 😄).

SingleAccretion · 2021-11-17T16:20:46Z

src/coreclr/jit/gentree.cpp

+    }
+
+    int          foldCount = 0;
+    unsigned int lclNum    = op2->AsLclVarCommon()->GetLclNum();


Suggested change

unsigned int lclNum = op2->AsLclVarCommon()->GetLclNum();

unsigned lclNum = op2->AsLclVarCommon()->GetLclNum();

Nit: we prefer bare unsigned for lclNums.

anthonycanino · 2021-11-17T16:22:15Z

For the regression (at least in Benchstone.BenchI.CSieve:Test), we see it happen because the RangeCheck analysis only checks for certain cases. The following code is roughly taken from the CSeive test...

bool[] flags = new bool[Size + 1];

for (int i = 2; i <= Size; i++)
{
  for (int k = i + i; k <= Size; k += i)
  {
    flags[k] = false;
  }
}

becomes

bool[] flags = new bool[Size + 1];

for (int i = 2; i <= Size; i++)
{
  for (int k = i * 2; k <= Size; k += i)  // <-- i + i becomes i * 2
  {
    flags[k] = false;
  }
}

for which the RangeCheck analysis does not analyze through a GT_LSH(V0, 1) or GT_MUL(V0, 2), so the subsequent flags[k] access contains an extra array bounds check where it otherwise wouldn't have.

I created a separate PR to enhance the RangeCheck analysis for this case, at #61662.

I'll investigate the asm diffs generated through the CI superpmi and report back to double check the AMD case.

SingleAccretion · 2021-11-19T19:26:09Z

src/coreclr/jit/morph.cpp

+    // GT_MUL operators can later be transformed into 'GT_CALL'.
+    GenTree* morphed =
+        new (this, GT_CALL) GenTreeOp(GT_MUL, tree->TypeGet(), lclVarTree, consTree DEBUGARG(/*largeNode*/ true));


No longer needed (gtNewOperNode can be used instead)?

(This will need a comment update above, where we give up for 64 MULs on 32 bit, that it is now for both CQ and correctness reasons)

SingleAccretion · 2021-11-19T19:29:22Z

src/coreclr/jit/compiler.h

@@ -6388,6 +6388,9 @@ class Compiler
    GenTree* fgMorphCommutative(GenTreeOp* tree);
    GenTree* fgMorphCastedBitwiseOp(GenTreeOp* tree);

+    // Reduce successive add operations to a single multiply, i + i + i + i => i * 4


Suggested change

// Reduce successive add operations to a single multiply, i + i + i + i => i * 4

Nit: I don't think it is worth repeating the method header. Such duplication tends to lead to stale comments (as people naturally forget to update both places).

SingleAccretion · 2021-11-19T19:31:34Z

src/coreclr/jit/morph.cpp

@@ -11581,6 +11581,14 @@ GenTree* Compiler::fgMorphSmpOp(GenTree* tree, MorphAddrContext* mac)
            break;
    }

+    // Perform strength reduction prior to preorder op1, op2 operand processing


Nit: I am not convinced adding the comment here has a lot of value, it is fairly obvious from the code what is going on.

anthonycanino · 2021-11-24T15:30:31Z

@SingleAccretion I've addressed your comments in the last commits. Should have covered everything.

@kunalspathak I spent some time digging into the regressions to better understand why they happen with this peep. The following is a detailed description of the benchmark mch results for Windows x64, x86, and Linux x64. From the spmi artifact it looks like arm64 and arm exhibit the same/similar behavior. I can dig further there if needed, but I think what I have will cover a lot. It's mostly the following patterns. Also note that I ran these results with JitAlignLoop=0. Please let me know your thoughts.

Results

Windows x64 Benchmarks


Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 7063028 (overridden on cmd)
Total bytes of diff: 7063049 (overridden on cmd)
Total bytes of delta: 21 (0.00 % of base)
    diff is a regression.
    relative diff is a regression.

Detail diffs



Top file regressions (bytes):
          19 : 23427.dasm (11.45% of base)
           1 : 14074.dasm (0.07% of base)
           1 : 25562.dasm (0.08% of base)

3 total files with Code Size differences (0 improved, 3 regressed), 2 unchanged.

Top method regressions (bytes):
          19 (11.45% of base) : 23427.dasm - Benchstone.BenchI.CSieve:Test():bool:this
           1 ( 0.08% of base) : 25562.dasm - Benchstone.BenchF.LLoops:Init():this
           1 ( 0.07% of base) : 14074.dasm - Benchstone.MDBenchF.MDLLoops:Init():this

Top method regressions (percentages):
          19 (11.45% of base) : 23427.dasm - Benchstone.BenchI.CSieve:Test():bool:this
           1 ( 0.08% of base) : 25562.dasm - Benchstone.BenchF.LLoops:Init():this
           1 ( 0.07% of base) : 14074.dasm - Benchstone.MDBenchF.MDLLoops:Init():this

3 total methods with Code Size differences (0 improved, 3 regressed), 2 unchanged.

Windows x86 Benchmarks


Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 5654101 (overridden on cmd)
Total bytes of diff: 5654095 (overridden on cmd)
Total bytes of delta: -6 (-0.00 % of base)
    diff is an improvement.
    relative diff is a regression.

Detail diffs



Top file regressions (bytes):
          16 : 6565.dasm (2.55% of base)
          14 : 21015.dasm (10.85% of base)
           1 : 13147.dasm (0.08% of base)

Top file improvements (bytes):
         -32 : 22560.dasm (-6.57% of base)
          -3 : 14050.dasm (-0.26% of base)
          -2 : 22550.dasm (-0.19% of base)

6 total files with Code Size differences (3 improved, 3 regressed), 2 unchanged.

Top method regressions (bytes):
          16 ( 2.55% of base) : 6565.dasm - BitOps:DoBitfieldIteration(System.Int32[],System.Int32[],int,byref):long
          14 (10.85% of base) : 21015.dasm - Benchstone.BenchI.CSieve:Test():bool:this
           1 ( 0.08% of base) : 13147.dasm - Benchstone.MDBenchF.MDLLoops:Init():this

Top method improvements (bytes):
         -32 (-6.57% of base) : 22560.dasm - NumericSortRect:NumSift(System.Int32[,],int,int,int)
          -3 (-0.26% of base) : 14050.dasm - System.Numerics.BigIntegerCalculator:Square(System.ReadOnlySpan`1[UInt32],System.Span`1[UInt32])
          -2 (-0.19% of base) : 22550.dasm - Benchstone.BenchF.LLoops:Init():this

Top method regressions (percentages):
          14 (10.85% of base) : 21015.dasm - Benchstone.BenchI.CSieve:Test():bool:this
          16 ( 2.55% of base) : 6565.dasm - BitOps:DoBitfieldIteration(System.Int32[],System.Int32[],int,byref):long
           1 ( 0.08% of base) : 13147.dasm - Benchstone.MDBenchF.MDLLoops:Init():this

Top method improvements (percentages):
         -32 (-6.57% of base) : 22560.dasm - NumericSortRect:NumSift(System.Int32[,],int,int,int)
          -3 (-0.26% of base) : 14050.dasm - System.Numerics.BigIntegerCalculator:Square(System.ReadOnlySpan`1[UInt32],System.Span`1[UInt32])
          -2 (-0.19% of base) : 22550.dasm - Benchstone.BenchF.LLoops:Init():this

6 total methods with Code Size differences (3 improved, 3 regressed), 2 unchanged.

Linux x64 Benchmarks


Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 10242432 (overridden on cmd)
Total bytes of diff: 10242452 (overridden on cmd)
Total bytes of delta: 20 (0.00 % of base)
	diff is a regression.
	relative diff is a regression.

Detail diffs



Top file regressions (bytes):
      	14 : 27166.dasm (9.524% of base)
       	4 : 29766.dasm (1.786% of base)
       	1 : 16678.dasm (0.069% of base)
       	1 : 29522.dasm (0.079% of base)

4 total files with Code Size differences (0 improved, 4 regressed), 2 unchanged.

Top method regressions (bytes):
      	14 (9.524% of base) : 27166.dasm - Benchstone.BenchI.CSieve:Test():bool:this
       	4 (1.786% of base) : 29766.dasm - StringSort:strsift(System.String[],int,int)
       	1 (0.079% of base) : 29522.dasm - Benchstone.BenchF.LLoops:Init():this
       	1 (0.069% of base) : 16678.dasm - Benchstone.MDBenchF.MDLLoops:Init():this

Top method regressions (percentages):
      	14 (9.524% of base) : 27166.dasm - Benchstone.BenchI.CSieve:Test():bool:this
       	4 (1.786% of base) : 29766.dasm - StringSort:strsift(System.String[],int,int)
       	1 (0.079% of base) : 29522.dasm - Benchstone.BenchF.LLoops:Init():this
       	1 (0.069% of base) : 16678.dasm - Benchstone.MDBenchF.MDLLoops:Init():this

4 total methods with Code Size differences (0 improved, 4 regressed), 2 unchanged.

Explanations

Benchstone.BenchI.CSieve:Test():bool:this

Due to introduction of a range check that was otherwise eliminated by RangeCheck analysis. Please see #61662 for a detailed explanation. If accepted, that PR should eliminate this regression (and has done on my local tests).

Benchstone.MDBenchF.MDLLoops:Init():this

Init code has this pattern...

_u1[j,k,l] = k;
_u2[j,k,l] = k + k;
_u3[j,k,l] = k + k + k;

which the peep will transform to (essentially)

_u1[j,k,l] = k;
_u2[j,k,l] = k << 2
_u3[j,k,l] = k * 3

which effectively transforms

vcvtsi2sd   xmm0, eax
lea         r9d, [rax+rax]
vxorps      xmm1, xmm1
vcvtsi2sd   xmm1, r9d
add         r9d, eax
vxorps      xmm2, xmm2
vcvtsi2sd   xmm2, r9d

to

vcvtsi2sd   xmm0, eax
lea         r9d, [rax+rax]
vxorps      xmm1, xmm1
vcvtsi2sd   xmm1, r9d
lea         r9d, [rax+2*rax]    // <-- HERE
vxorps      xmm2, xmm2
vcvtsi2sd   xmm2, r9d

so in this case, the strength reduction interfares with CSE.

Benchstone.BenchF.LLoops:Init():this

Similar as MDLLoops, with the relevant code as

_u1[j][k][l] = k;
_u2[j][k][l] = k + k;
_u3[j][k][l] = k + k + k;

except that the way code is generated on x86, an extra lea gets eliminated so it's an improvement in that case, i.e.,

lea         ebx, [eax+eax]
vxorps      xmm1, xmm1
vcvtsi2sd   xmm1, ebx
lea         ebx, [eax+eax]
add         ebx, eax

to

lea         edi, [eax+eax]
vxorps      xmm1, xmm1
vcvtsi2sd   xmm1, edi
lea         edi, [eax+2*eax]

StringSort:strsift(System.String[],int,int)

The generated code is effectively the same, even when the i + i are transformed into i << 1 from the following code:

private static void strsift(string[] array, int i, int j)
{
    int k;
    string temp;

    while ((i + i) <= j)
    {
        k = i + i;
        // ...

The reason for the change in bytes is that the i + i => i << 1 reduces number of i uses in the function, and I believe this change resulted in different register allocation for arg0 and arg1 (array and i), r14 and rbx resp. before the peep, rbx and r14 after the peep. This shifted which instructions
needed an extra REX prefix, and the net change is +4 bytes.

BitOps:DoBitfieldIteration(System.Int32[],System.Int32[],int,byref):long

The relevant code is i + i and i + i + 1 inside an index expression, e.g.,

for (i = 0; i < bitoparraysize; i++)
{
    switch (i % 3)
    {
        case 0: /* Set run of bits */
            ToggleBitRun(bitarraybase, bitoparraybase[i + i], bitoparraybase[i + i + 1], 1);
            break;
            // ...

where the transformed code, arr[i << 1] and arr[(i << 1) +1] adjusts how gtMarkAddrMode reads the expression in the index. When the index expression is a LSH, it gets marked with GTF_ADDRMODE_NO_CSE, which changes some code from

lea      eax, [edx+edx]
cmp      eax, dword ptr [edi+4]
jae      G_M4310_IG24
mov      esi, dword ptr [edi+4*eax+8]
inc      eax
cmp      eax, dword ptr [edi+4]

to

lea      ecx, [edx+edx]
cmp      ecx, dword ptr [edi+4]
jae      G_M4310_IG24
mov      eax, dword ptr [edi+8*edx+8]
lea      ecx, [2*edx+1]
cmp      ecx, dword ptr [edi+4]

Note that, if the index expression is written as arr[i * 2], and arr[i * 2 + 1], the same code is generated as the peep optimization generates (both transfrom to the LSH), so I think this is independent of the peep.

anthonycanino · 2021-11-30T20:50:34Z

@kunalspathak can I get an update on this? I'm happy to discuss and make further changes.

kunalspathak · 2021-11-30T20:57:58Z

Sorry @anthonycanino for not getting to this. I will go through it in 1-2 days and let you know my feedback. Thanks for your patience.

kunalspathak

Minor change suggested.

Also I went through the asmdiffs and they looks as expected. There is a scope to improve arm64 code (not in this PR though) that we generate for i * 3. Today we generate:

            mov     w7, #3
            mul     w7, w1, w7

but can be converted to single instruction.

add         w0,w8,w8,lsl #1

kunalspathak · 2021-12-01T14:40:36Z

src/coreclr/jit/morph.cpp

+
+    // V0 + V0 ... + V0 becomes V0 * foldCount, where postorder transform will optimize
+    // accordingly
+    GenTree* lclVarTree = tree->AsOp()->gtOp2;


Nit: Can you move these definitions before the while loop and then set op1, op2 from them?

GenTree* lclVarTree = tree->AsOp()->gtOp2; GenTree* consTree = tree->AsOp()->gtOp1; GenTree *op1 = consTree, *op2 = lclVarTree; while () { }

`gtReduceStrength` will reduce `i + i + i + i` to `i * 4`, which will be optimized to `i << 2`. The reduction is not limited to power of twos, as `i + i + i + i + i` will reduce to `i * 5`, which will be optimized to `lea reg1, [reg2+4*reg2]` etc on xarch.

A `GT_MUL` with `TYP_LONG` must go through the GT_MUL preorder processing on non-64 bit arch. Change forces that to happen by creating a new GenTree node if GT_ADD can be reduced to GT_MUL and then running fgMorphTree on then new node.

Also cleaned up the code per suggestions regarding structure (move to morph.cpp) and fgGlobalOpt check.

anthonycanino · 2021-12-02T22:56:48Z

@kunalspathak thanks for the feedback. I've made the change as requested.

kunalspathak

LGTM

ghost added the community-contribution Indicates that the PR has been added by a community member label Nov 16, 2021

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Nov 16, 2021

anthonycanino marked this pull request as ready for review November 17, 2021 11:40

SingleAccretion reviewed Nov 17, 2021

View reviewed changes

SingleAccretion reviewed Nov 19, 2021

View reviewed changes

anthonycanino force-pushed the anthony/74 branch from 32b6a63 to 3aa1847 Compare November 23, 2021 20:29

kunalspathak requested changes Dec 1, 2021

View reviewed changes

ghost added the needs-author-action An issue or pull request that requires more info or actions from the author. label Dec 1, 2021

anthonycanino added 9 commits December 2, 2021 09:33

Produce a GT_MUL compatible with non-64 bit arch.

621e578

A `GT_MUL` with `TYP_LONG` must go through the GT_MUL preorder processing on non-64 bit arch. Change forces that to happen by creating a new GenTree node if GT_ADD can be reduced to GT_MUL and then running fgMorphTree on then new node.

Run jit-format.

ee3d42d

Suggested style changes.

9c3a1ae

Abort optimization if result will create 64-bit MUL on 32-bit arch.

2148ada

Also cleaned up the code per suggestions regarding structure (move to morph.cpp) and fgGlobalOpt check.

Run jit-format.

eb5d904

Fix small typo bug.

192c7e5

Small code cleanups per review.

fe7ada9

Moving definition of transformed tree nodes for clarity.

363a812

anthonycanino force-pushed the anthony/74 branch from 3aa1847 to 363a812 Compare December 2, 2021 18:40

ghost removed the needs-author-action An issue or pull request that requires more info or actions from the author. label Dec 2, 2021

kunalspathak approved these changes Dec 3, 2021

View reviewed changes

kunalspathak merged commit b84b62c into dotnet:main Dec 3, 2021

ghost locked as resolved and limited conversation to collaborators Jan 2, 2022

	/* GT_MUL operators can later be transformed into 'GT_CALL' */
	// GT_MUL operators can later be transformed into 'GT_CALL'.

	if (!op2->OperIs(GT_LCL_VAR) \|\| !varTypeIsIntegralOrI(op2))
	if (!op2->OperIs(GT_LCL_VAR) \|\| !varTypeIsIntegral(op2))

	if (tree->OperGet() != GT_ADD \|\| tree->gtOverflow() \|\| optValnumCSE_phase)
	if (!tree->OperIs(GT_ADD) \|\| tree->gtOverflow() \|\| optValnumCSE_phase)

	unsigned int lclNum = op2->AsLclVarCommon()->GetLclNum();
	unsigned lclNum = op2->AsLclVarCommon()->GetLclNum();

Reduce repeated variable add operations to a single multiply. #61663

Reduce repeated variable add operations to a single multiply. #61663

Conversation

anthonycanino commented Nov 16, 2021

Changes

Discussion

Results

benchmarks.run.windows.x64.checked.mch:

coreclr_tests.pmi.windows.x64.checked.mch:

libraries.pmi.windows.x64.checked.mch:

libraries_tests.pmi.windows.x64.checked.mch:

ghost commented Nov 16, 2021

Changes

Discussion

Results

benchmarks.run.windows.x64.checked.mch:

coreclr_tests.pmi.windows.x64.checked.mch:

libraries.pmi.windows.x64.checked.mch:

libraries_tests.pmi.windows.x64.checked.mch:

EgorBo commented Nov 16, 2021 • edited Loading

anthonycanino commented Nov 16, 2021

kasperk81 commented Nov 16, 2021

anthonycanino commented Nov 17, 2021

kasperk81 commented Nov 17, 2021

anthonycanino commented Nov 17, 2021

kunalspathak commented Nov 17, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SingleAccretion Nov 19, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anthonycanino commented Nov 17, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anthonycanino commented Nov 24, 2021

Results

Windows x64 Benchmarks

Windows x86 Benchmarks

Linux x64 Benchmarks

Explanations

Benchstone.BenchI.CSieve:Test():bool:this

Benchstone.MDBenchF.MDLLoops:Init():this

Benchstone.BenchF.LLoops:Init():this

StringSort:strsift(System.String[],int,int)

BitOps:DoBitfieldIteration(System.Int32[],System.Int32[],int,byref):long

anthonycanino commented Nov 30, 2021

kunalspathak commented Nov 30, 2021

kunalspathak left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anthonycanino commented Dec 2, 2021

kunalspathak left a comment

Choose a reason for hiding this comment

EgorBo commented Nov 16, 2021 •

edited

Loading

SingleAccretion Nov 19, 2021 •

edited

Loading