Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce repeated variable add operations to a single multiply. #61663

Merged
merged 9 commits into from
Dec 3, 2021

Conversation

anthonycanino
Copy link
Contributor

This PR addresses #34938.

Changes

gtReduceStrength will reduce i + i + i + i to i * 4, which will be optimized to i << 2. The reduction is not limited to
power of twos, as i + i + i + i + i will reduce to i * 5, which will be optimized to lea reg1, [reg2+4*reg2] etc on xarch.

Discussion

My gtReduceStrength function is invoked inside gtMorphSmpOp, and kept as its own function for now per the issue discussion on the public issue, but can be inlined inside the preorder processing block in gtMorphSmpOp.

Regarding the regression below, esp. with Benchstone.BenchI.CSieve:Test():bool:this, another PR #61662 I have opened will fix that case due to the range checking limitation.

Overall, the improvements are pretty minor here, which isn't surprising as I wouldn't expect programmers writing benchmarks or libraries to do something like i + i + i + i. However, LLVM/MSVC performs similar optimizations (https://godbolt.org/z/4ednM7WaE).

Results

benchmarks.run.windows.x64.checked.mch:


Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 7121565 (overridden on cmd)
Total bytes of diff: 7121582 (overridden on cmd)
Total bytes of delta: 17 (0.00 % of base)
    diff is a regression.
    relative diff is a regression.
Detail diffs


Top file regressions (bytes):
          15 : 23803.dasm (8.29% of base)
           1 : 14314.dasm (0.06% of base)
           1 : 26462.dasm (0.08% of base)

3 total files with Code Size differences (0 improved, 3 regressed), 2 unchanged.

Top method regressions (bytes):
          15 ( 8.29% of base) : 23803.dasm - Benchstone.BenchI.CSieve:Test():bool:this
           1 ( 0.08% of base) : 26462.dasm - Benchstone.BenchF.LLoops:Init():this
           1 ( 0.06% of base) : 14314.dasm - Benchstone.MDBenchF.MDLLoops:Init():this

Top method regressions (percentages):
          15 ( 8.29% of base) : 23803.dasm - Benchstone.BenchI.CSieve:Test():bool:this
           1 ( 0.08% of base) : 26462.dasm - Benchstone.BenchF.LLoops:Init():this
           1 ( 0.06% of base) : 14314.dasm - Benchstone.MDBenchF.MDLLoops:Init():this

3 total methods with Code Size differences (0 improved, 3 regressed), 2 unchanged.


coreclr_tests.pmi.windows.x64.checked.mch:


Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 124633874 (overridden on cmd)
Total bytes of diff: 124633877 (overridden on cmd)
Total bytes of delta: 3 (0.00 % of base)
    diff is a regression.
    relative diff is a regression.
Detail diffs


Top file regressions (bytes):
          15 : 82551.dasm (10.34% of base)
           1 : 239451.dasm (0.06% of base)
           1 : 239429.dasm (0.08% of base)

Top file improvements (bytes):
          -6 : 82974.dasm (-0.56% of base)
          -5 : 239565.dasm (-0.04% of base)
          -1 : 245065.dasm (-0.14% of base)
          -1 : 84102.dasm (-0.77% of base)
          -1 : 240909.dasm (-0.16% of base)

8 total files with Code Size differences (5 improved, 3 regressed), 8 unchanged.

Top method regressions (bytes):
          15 (10.34% of base) : 82551.dasm - Benchstone.BenchI.CSieve:Bench():bool
           1 ( 0.08% of base) : 239429.dasm - Benchstone.BenchF.LLoops:Init():this
           1 ( 0.06% of base) : 239451.dasm - Benchstone.MDBenchF.MDLLoops:Init():this

Top method improvements (bytes):
          -6 (-0.56% of base) : 82974.dasm - $:gg(int,jb,jb):jb
          -5 (-0.04% of base) : 239565.dasm - ILGEN_0x372a9ae6:Method_0xdc6ff1a4(byte,byte,int,long,ushort,double,long,long):int
          -1 (-0.16% of base) : 240909.dasm - AA:Static1(byref,byref,long,System.Boolean[,,]):ushort
          -1 (-0.14% of base) : 245065.dasm - BB:Static1(System.Double[,,][],byref,int,byref)
          -1 (-0.77% of base) : 84102.dasm - TailRecursionWithOsrEntryInTry:F(int,int,int,int):int

Top method regressions (percentages):
          15 (10.34% of base) : 82551.dasm - Benchstone.BenchI.CSieve:Bench():bool
           1 ( 0.08% of base) : 239429.dasm - Benchstone.BenchF.LLoops:Init():this
           1 ( 0.06% of base) : 239451.dasm - Benchstone.MDBenchF.MDLLoops:Init():this

Top method improvements (percentages):
          -1 (-0.77% of base) : 84102.dasm - TailRecursionWithOsrEntryInTry:F(int,int,int,int):int
          -6 (-0.56% of base) : 82974.dasm - $:gg(int,jb,jb):jb
          -1 (-0.16% of base) : 240909.dasm - AA:Static1(byref,byref,long,System.Boolean[,,]):ushort
          -1 (-0.14% of base) : 245065.dasm - BB:Static1(System.Double[,,][],byref,int,byref)
          -5 (-0.04% of base) : 239565.dasm - ILGEN_0x372a9ae6:Method_0xdc6ff1a4(byte,byte,int,long,ushort,double,long,long):int

8 total methods with Code Size differences (5 improved, 3 regressed), 8 unchanged.


libraries.pmi.windows.x64.checked.mch:


Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 45221486 (overridden on cmd)
Total bytes of diff: 45221486 (overridden on cmd)
Total bytes of delta: 0 (0.00 % of base)
Detail diffs


0 total files with Code Size differences (0 improved, 0 regressed), 4 unchanged.

0 total methods with Code Size differences (0 improved, 0 regressed), 4 unchanged.


libraries_tests.pmi.windows.x64.checked.mch:


Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 113503162 (overridden on cmd)
Total bytes of diff: 113503171 (overridden on cmd)
Total bytes of delta: 9 (0.00 % of base)
    diff is a regression.
    relative diff is a regression.
Detail diffs


Top file regressions (bytes):
           3 : 196471.dasm (0.74% of base)
           3 : 329488.dasm (0.74% of base)
           3 : 194847.dasm (0.74% of base)

3 total files with Code Size differences (0 improved, 3 regressed), 9 unchanged.

Top method regressions (bytes):
           3 ( 0.74% of base) : 196471.dasm - <>c__DisplayClass19_0:<RsaOaepMaxSize>g__Test|0(System.Security.Cryptography.RSAEncryptionPadding,int):this
           3 ( 0.74% of base) : 329488.dasm - <>c__DisplayClass19_0:<RsaOaepMaxSize>g__Test|0(System.Security.Cryptography.RSAEncryptionPadding,int):this
           3 ( 0.74% of base) : 194847.dasm - <>c__DisplayClass19_0:<RsaOaepMaxSize>g__Test|0(System.Security.Cryptography.RSAEncryptionPadding,int):this

Top method regressions (percentages):
           3 ( 0.74% of base) : 196471.dasm - <>c__DisplayClass19_0:<RsaOaepMaxSize>g__Test|0(System.Security.Cryptography.RSAEncryptionPadding,int):this
           3 ( 0.74% of base) : 329488.dasm - <>c__DisplayClass19_0:<RsaOaepMaxSize>g__Test|0(System.Security.Cryptography.RSAEncryptionPadding,int):this
           3 ( 0.74% of base) : 194847.dasm - <>c__DisplayClass19_0:<RsaOaepMaxSize>g__Test|0(System.Security.Cryptography.RSAEncryptionPadding,int):this

3 total methods with Code Size differences (0 improved, 3 regressed), 9 unchanged.


@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Nov 16, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Nov 16, 2021
@ghost
Copy link

ghost commented Nov 16, 2021

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

This PR addresses #34938.

Changes

gtReduceStrength will reduce i + i + i + i to i * 4, which will be optimized to i << 2. The reduction is not limited to
power of twos, as i + i + i + i + i will reduce to i * 5, which will be optimized to lea reg1, [reg2+4*reg2] etc on xarch.

Discussion

My gtReduceStrength function is invoked inside gtMorphSmpOp, and kept as its own function for now per the issue discussion on the public issue, but can be inlined inside the preorder processing block in gtMorphSmpOp.

Regarding the regression below, esp. with Benchstone.BenchI.CSieve:Test():bool:this, another PR #61662 I have opened will fix that case due to the range checking limitation.

Overall, the improvements are pretty minor here, which isn't surprising as I wouldn't expect programmers writing benchmarks or libraries to do something like i + i + i + i. However, LLVM/MSVC performs similar optimizations (https://godbolt.org/z/4ednM7WaE).

Results

benchmarks.run.windows.x64.checked.mch:


Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 7121565 (overridden on cmd)
Total bytes of diff: 7121582 (overridden on cmd)
Total bytes of delta: 17 (0.00 % of base)
    diff is a regression.
    relative diff is a regression.
Detail diffs


Top file regressions (bytes):
          15 : 23803.dasm (8.29% of base)
           1 : 14314.dasm (0.06% of base)
           1 : 26462.dasm (0.08% of base)

3 total files with Code Size differences (0 improved, 3 regressed), 2 unchanged.

Top method regressions (bytes):
          15 ( 8.29% of base) : 23803.dasm - Benchstone.BenchI.CSieve:Test():bool:this
           1 ( 0.08% of base) : 26462.dasm - Benchstone.BenchF.LLoops:Init():this
           1 ( 0.06% of base) : 14314.dasm - Benchstone.MDBenchF.MDLLoops:Init():this

Top method regressions (percentages):
          15 ( 8.29% of base) : 23803.dasm - Benchstone.BenchI.CSieve:Test():bool:this
           1 ( 0.08% of base) : 26462.dasm - Benchstone.BenchF.LLoops:Init():this
           1 ( 0.06% of base) : 14314.dasm - Benchstone.MDBenchF.MDLLoops:Init():this

3 total methods with Code Size differences (0 improved, 3 regressed), 2 unchanged.


coreclr_tests.pmi.windows.x64.checked.mch:


Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 124633874 (overridden on cmd)
Total bytes of diff: 124633877 (overridden on cmd)
Total bytes of delta: 3 (0.00 % of base)
    diff is a regression.
    relative diff is a regression.
Detail diffs


Top file regressions (bytes):
          15 : 82551.dasm (10.34% of base)
           1 : 239451.dasm (0.06% of base)
           1 : 239429.dasm (0.08% of base)

Top file improvements (bytes):
          -6 : 82974.dasm (-0.56% of base)
          -5 : 239565.dasm (-0.04% of base)
          -1 : 245065.dasm (-0.14% of base)
          -1 : 84102.dasm (-0.77% of base)
          -1 : 240909.dasm (-0.16% of base)

8 total files with Code Size differences (5 improved, 3 regressed), 8 unchanged.

Top method regressions (bytes):
          15 (10.34% of base) : 82551.dasm - Benchstone.BenchI.CSieve:Bench():bool
           1 ( 0.08% of base) : 239429.dasm - Benchstone.BenchF.LLoops:Init():this
           1 ( 0.06% of base) : 239451.dasm - Benchstone.MDBenchF.MDLLoops:Init():this

Top method improvements (bytes):
          -6 (-0.56% of base) : 82974.dasm - $:gg(int,jb,jb):jb
          -5 (-0.04% of base) : 239565.dasm - ILGEN_0x372a9ae6:Method_0xdc6ff1a4(byte,byte,int,long,ushort,double,long,long):int
          -1 (-0.16% of base) : 240909.dasm - AA:Static1(byref,byref,long,System.Boolean[,,]):ushort
          -1 (-0.14% of base) : 245065.dasm - BB:Static1(System.Double[,,][],byref,int,byref)
          -1 (-0.77% of base) : 84102.dasm - TailRecursionWithOsrEntryInTry:F(int,int,int,int):int

Top method regressions (percentages):
          15 (10.34% of base) : 82551.dasm - Benchstone.BenchI.CSieve:Bench():bool
           1 ( 0.08% of base) : 239429.dasm - Benchstone.BenchF.LLoops:Init():this
           1 ( 0.06% of base) : 239451.dasm - Benchstone.MDBenchF.MDLLoops:Init():this

Top method improvements (percentages):
          -1 (-0.77% of base) : 84102.dasm - TailRecursionWithOsrEntryInTry:F(int,int,int,int):int
          -6 (-0.56% of base) : 82974.dasm - $:gg(int,jb,jb):jb
          -1 (-0.16% of base) : 240909.dasm - AA:Static1(byref,byref,long,System.Boolean[,,]):ushort
          -1 (-0.14% of base) : 245065.dasm - BB:Static1(System.Double[,,][],byref,int,byref)
          -5 (-0.04% of base) : 239565.dasm - ILGEN_0x372a9ae6:Method_0xdc6ff1a4(byte,byte,int,long,ushort,double,long,long):int

8 total methods with Code Size differences (5 improved, 3 regressed), 8 unchanged.


libraries.pmi.windows.x64.checked.mch:


Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 45221486 (overridden on cmd)
Total bytes of diff: 45221486 (overridden on cmd)
Total bytes of delta: 0 (0.00 % of base)
Detail diffs


0 total files with Code Size differences (0 improved, 0 regressed), 4 unchanged.

0 total methods with Code Size differences (0 improved, 0 regressed), 4 unchanged.


libraries_tests.pmi.windows.x64.checked.mch:


Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 113503162 (overridden on cmd)
Total bytes of diff: 113503171 (overridden on cmd)
Total bytes of delta: 9 (0.00 % of base)
    diff is a regression.
    relative diff is a regression.
Detail diffs


Top file regressions (bytes):
           3 : 196471.dasm (0.74% of base)
           3 : 329488.dasm (0.74% of base)
           3 : 194847.dasm (0.74% of base)

3 total files with Code Size differences (0 improved, 3 regressed), 9 unchanged.

Top method regressions (bytes):
           3 ( 0.74% of base) : 196471.dasm - <>c__DisplayClass19_0:<RsaOaepMaxSize>g__Test|0(System.Security.Cryptography.RSAEncryptionPadding,int):this
           3 ( 0.74% of base) : 329488.dasm - <>c__DisplayClass19_0:<RsaOaepMaxSize>g__Test|0(System.Security.Cryptography.RSAEncryptionPadding,int):this
           3 ( 0.74% of base) : 194847.dasm - <>c__DisplayClass19_0:<RsaOaepMaxSize>g__Test|0(System.Security.Cryptography.RSAEncryptionPadding,int):this

Top method regressions (percentages):
           3 ( 0.74% of base) : 196471.dasm - <>c__DisplayClass19_0:<RsaOaepMaxSize>g__Test|0(System.Security.Cryptography.RSAEncryptionPadding,int):this
           3 ( 0.74% of base) : 329488.dasm - <>c__DisplayClass19_0:<RsaOaepMaxSize>g__Test|0(System.Security.Cryptography.RSAEncryptionPadding,int):this
           3 ( 0.74% of base) : 194847.dasm - <>c__DisplayClass19_0:<RsaOaepMaxSize>g__Test|0(System.Security.Cryptography.RSAEncryptionPadding,int):this

3 total methods with Code Size differences (0 improved, 3 regressed), 9 unchanged.


Author: anthonycanino
Assignees: -
Labels:

area-CodeGen-coreclr, community-contribution

Milestone: -

@EgorBo
Copy link
Member

EgorBo commented Nov 16, 2021

Nice! Can you make sure this code is correctly folded:

int Test(int x)
{
    for (int i = 0; i < Vector<int>.Count; i++)
        x++;
    return x;
}

currently, Vector<int>.Count is a hack to unroll loops but we might enable it for more cases in future.

Also, this case: #46257 (comment)

UPD Ah, it's a bit different - these are separate statements.

@anthonycanino
Copy link
Contributor Author

UPD Ah, it's a bit different - these are separate statements.

Yes, though I am interested in the combination of strength reduction with loop unrolling in general. However, this is definitely a peephole optimization for a single add expression.

@kasperk81
Copy link
Contributor

will it also reduce x - x - x - x - x to imul eax, ecx, -3?

@anthonycanino
Copy link
Contributor Author

will it also reduce x - x - x - x - x to imul eax, ecx, -3?

No, it will only address replicated addition, at least as a first pass, per the discussion on #34938. I

@anthonycanino anthonycanino marked this pull request as ready for review November 17, 2021 11:40
@kasperk81
Copy link
Contributor

per the discussion on #34938.

how common is the pattern discussed in #34938? 0 improved, 0 regressed in BCL is telling me it's not worth the complexity.

@anthonycanino
Copy link
Contributor Author

The pattern pops up a little bit in the libraries test (please see above), for the regression, please see #61662.

@kunalspathak @EgorBo what do you think? Is this peep worth considering to match behavior in LLVM/MSVC (https://godbolt.org/z/4ednM7WaE)? It may be a start for further strength reduction patterns too, i.e., @kasperk81 pattern above.

@kunalspathak
Copy link
Member

It is true that there are not many patterns in .NET libraries that will benefit from this. I would still keep it provided we understand the regression happening in some benchmark methods and try to eliminate those.

This happens on both x64/arm64.

Top method regressions (bytes):
          19 (12.93 % of base) : 26383.dasm - Benchstone.BenchI.CSieve:Test():bool:this
           4 (1.79 % of base) : 29297.dasm - StringSort:strsift(System.String[],int,int)

if (vnStore != nullptr)
{
fgValueNumberTreeConst(consTree);
fgValueNumberTree(morphed);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The morphed tree should have the same VN as the original: morphed->SetVNsFromNode(tree). It is not a good idea, in general, to call fgValueNumberTree outside of VN, it has not been tested to work that way.

Also, there is a helper you can use here to update the VN of the constant tree: fgUpdateConstTreeValueNumber.

Also, if you don't intend to relax the fgGlobalMorph check under which this method is right now, you don't need to maintain VNs at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out. This is mostly my unfamiliarity with the code --- I didn't realize the difference between the fgGlobalMorph check, and the CSE check, i.e., when the optimizations are valid in which phases.

Looking at the surrounding code in fgMorphSmpOp, I see cases where the optimizations are restricted to the global morph, and cases where they are not. Is there a recommended pattern here, or do you have any tips?

Copy link
Contributor

@SingleAccretion SingleAccretion Nov 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the surrounding code in fgMorphSmpOp, I see cases where the optimizations are restricted to the global morph, and cases where they are not. Is there a recommended pattern here, or do you have any tips?

It is mostly cost/benefit question.

But first, legality:

  • Only optimizations that preserve value numbers are allowed outside of fgGlobalMorph. Allowable VN manipulation is transferring an existing VN from an existing tree, and updating VNs for constants. Anything more complex is probably better left under fgGlobalMorph. This optimization can trivially preserve value numbers (the VN of the tree stays the same, as does the VN of the local, the VN of the constant gets updated).

Allowing optimizations outside of global morph is beneficial because the optimization phases discover new facts that help peepholes, primarily via constant propagation. If your optimization is unlikely to benefit from that, it is likely better to leave it under fgGlobalMorph and not worry about VN maintenance / CSEs. Otherwise, it is up to the judgement of the implementer and reviewers what decision is "better" CQ-vs-risk-of-bugs-wise (which, I note, we discover on a somewhat regular basis...).

I personally think it is fine to leave this under fgGlobalMorph and delete the VN code. The only new fact that optimizations can discover and which can help the transformation is a local being substituted via copy propagation, but it seems like a rather rare event, and we do not remorph in copy propagation, so it would take a random CSE/assertion to force the remorph and realize the benefit.

(Heh, I should add this my guide)

GenTree* consTree = tree->AsOp()->gtOp1;
consTree->BashToConst(foldCount, lclVarTree->TypeGet());

/* GT_MUL operators can later be transformed into 'GT_CALL' */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/* GT_MUL operators can later be transformed into 'GT_CALL' */
// GT_MUL operators can later be transformed into 'GT_CALL'.

Nit: we prefer //-style comments.

This suggest to me that this is actually a pessimization in this case. Probably we should bail if we see 64 bit ADDs on a 32 bit target (the only case where we use helpers). We may miss some power of two cases, but this is the price of doing this in pre-order, I suppose.

GenTree* op1 = tree->AsOp()->gtOp1;
GenTree* op2 = tree->AsOp()->gtOp2;

if (!op2->OperIs(GT_LCL_VAR) || !varTypeIsIntegralOrI(op2))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (!op2->OperIs(GT_LCL_VAR) || !varTypeIsIntegralOrI(op2))
if (!op2->OperIs(GT_LCL_VAR) || !varTypeIsIntegral(op2))

It is not valid to multiply byrefs/refs.

return tree;
}

genTreeOps oper = tree->OperGet();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused?

GenTree* Compiler::gtReduceStrength(GenTree* tree)
{
// ADD(_, V0) starts the pattern match
if (tree->OperGet() != GT_ADD || tree->gtOverflow() || optValnumCSE_phase)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (tree->OperGet() != GT_ADD || tree->gtOverflow() || optValnumCSE_phase)
if (!tree->OperIs(GT_ADD) || tree->gtOverflow() || optValnumCSE_phase)

Nit: why not use OperIs everywhere?

The optValnumCSE_phase is redundant with the fgGlobalMorph guard under which the method is called. I suggest removing it and instead asserting that we are in global morph, if you don't intend to relax that restriction. If you do, then optValnumCSE_phase should, obviously, stay.

// Return Value:
// reduced tree if pattern matches, original tree otherwise
//
GenTree* Compiler::gtReduceStrength(GenTree* tree)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the method's name is very general - is this intended to be extended? If this is to say as it is, it should probably be renamed to something more specific (there is a pattern of fgOptimizeThing in morph) and moved out of gentree.cpp into morph.cpp (not much of a difference these days, I know, but let's not make it worse 😄).

}

int foldCount = 0;
unsigned int lclNum = op2->AsLclVarCommon()->GetLclNum();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
unsigned int lclNum = op2->AsLclVarCommon()->GetLclNum();
unsigned lclNum = op2->AsLclVarCommon()->GetLclNum();

Nit: we prefer bare unsigned for lclNums.

@anthonycanino
Copy link
Contributor Author

For the regression (at least in Benchstone.BenchI.CSieve:Test), we see it happen because the RangeCheck analysis only checks for certain cases. The following code is roughly taken from the CSeive test...

bool[] flags = new bool[Size + 1];

for (int i = 2; i <= Size; i++)
{
  for (int k = i + i; k <= Size; k += i)
  {
    flags[k] = false;
  }
}

becomes

bool[] flags = new bool[Size + 1];

for (int i = 2; i <= Size; i++)
{
  for (int k = i * 2; k <= Size; k += i)  // <-- i + i becomes i * 2
  {
    flags[k] = false;
  }
}

for which the RangeCheck analysis does not analyze through a GT_LSH(V0, 1) or GT_MUL(V0, 2), so the subsequent flags[k] access contains an extra array bounds check where it otherwise wouldn't have.

I created a separate PR to enhance the RangeCheck analysis for this case, at #61662.

I'll investigate the asm diffs generated through the CI superpmi and report back to double check the AMD case.

Comment on lines 18214 to 18216
// GT_MUL operators can later be transformed into 'GT_CALL'.
GenTree* morphed =
new (this, GT_CALL) GenTreeOp(GT_MUL, tree->TypeGet(), lclVarTree, consTree DEBUGARG(/*largeNode*/ true));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer needed (gtNewOperNode can be used instead)?

(This will need a comment update above, where we give up for 64 MULs on 32 bit, that it is now for both CQ and correctness reasons)

@@ -6388,6 +6388,9 @@ class Compiler
GenTree* fgMorphCommutative(GenTreeOp* tree);
GenTree* fgMorphCastedBitwiseOp(GenTreeOp* tree);

// Reduce successive add operations to a single multiply, i + i + i + i => i * 4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Reduce successive add operations to a single multiply, i + i + i + i => i * 4

Nit: I don't think it is worth repeating the method header. Such duplication tends to lead to stale comments (as people naturally forget to update both places).

@@ -11581,6 +11581,14 @@ GenTree* Compiler::fgMorphSmpOp(GenTree* tree, MorphAddrContext* mac)
break;
}

// Perform strength reduction prior to preorder op1, op2 operand processing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I am not convinced adding the comment here has a lot of value, it is fairly obvious from the code what is going on.

@anthonycanino
Copy link
Contributor Author

@SingleAccretion I've addressed your comments in the last commits. Should have covered everything.

@kunalspathak I spent some time digging into the regressions to better understand why they happen with this peep. The following is a detailed description of the benchmark mch results for Windows x64, x86, and Linux x64. From the spmi artifact it looks like arm64 and arm exhibit the same/similar behavior. I can dig further there if needed, but I think what I have will cover a lot. It's mostly the following patterns. Also note that I ran these results with JitAlignLoop=0. Please let me know your thoughts.

Results

Windows x64 Benchmarks


Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 7063028 (overridden on cmd)
Total bytes of diff: 7063049 (overridden on cmd)
Total bytes of delta: 21 (0.00 % of base)
    diff is a regression.
    relative diff is a regression.
Detail diffs


Top file regressions (bytes):
          19 : 23427.dasm (11.45% of base)
           1 : 14074.dasm (0.07% of base)
           1 : 25562.dasm (0.08% of base)

3 total files with Code Size differences (0 improved, 3 regressed), 2 unchanged.

Top method regressions (bytes):
          19 (11.45% of base) : 23427.dasm - Benchstone.BenchI.CSieve:Test():bool:this
           1 ( 0.08% of base) : 25562.dasm - Benchstone.BenchF.LLoops:Init():this
           1 ( 0.07% of base) : 14074.dasm - Benchstone.MDBenchF.MDLLoops:Init():this

Top method regressions (percentages):
          19 (11.45% of base) : 23427.dasm - Benchstone.BenchI.CSieve:Test():bool:this
           1 ( 0.08% of base) : 25562.dasm - Benchstone.BenchF.LLoops:Init():this
           1 ( 0.07% of base) : 14074.dasm - Benchstone.MDBenchF.MDLLoops:Init():this

3 total methods with Code Size differences (0 improved, 3 regressed), 2 unchanged.


Windows x86 Benchmarks


Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 5654101 (overridden on cmd)
Total bytes of diff: 5654095 (overridden on cmd)
Total bytes of delta: -6 (-0.00 % of base)
    diff is an improvement.
    relative diff is a regression.
Detail diffs


Top file regressions (bytes):
          16 : 6565.dasm (2.55% of base)
          14 : 21015.dasm (10.85% of base)
           1 : 13147.dasm (0.08% of base)

Top file improvements (bytes):
         -32 : 22560.dasm (-6.57% of base)
          -3 : 14050.dasm (-0.26% of base)
          -2 : 22550.dasm (-0.19% of base)

6 total files with Code Size differences (3 improved, 3 regressed), 2 unchanged.

Top method regressions (bytes):
          16 ( 2.55% of base) : 6565.dasm - BitOps:DoBitfieldIteration(System.Int32[],System.Int32[],int,byref):long
          14 (10.85% of base) : 21015.dasm - Benchstone.BenchI.CSieve:Test():bool:this
           1 ( 0.08% of base) : 13147.dasm - Benchstone.MDBenchF.MDLLoops:Init():this

Top method improvements (bytes):
         -32 (-6.57% of base) : 22560.dasm - NumericSortRect:NumSift(System.Int32[,],int,int,int)
          -3 (-0.26% of base) : 14050.dasm - System.Numerics.BigIntegerCalculator:Square(System.ReadOnlySpan`1[UInt32],System.Span`1[UInt32])
          -2 (-0.19% of base) : 22550.dasm - Benchstone.BenchF.LLoops:Init():this

Top method regressions (percentages):
          14 (10.85% of base) : 21015.dasm - Benchstone.BenchI.CSieve:Test():bool:this
          16 ( 2.55% of base) : 6565.dasm - BitOps:DoBitfieldIteration(System.Int32[],System.Int32[],int,byref):long
           1 ( 0.08% of base) : 13147.dasm - Benchstone.MDBenchF.MDLLoops:Init():this

Top method improvements (percentages):
         -32 (-6.57% of base) : 22560.dasm - NumericSortRect:NumSift(System.Int32[,],int,int,int)
          -3 (-0.26% of base) : 14050.dasm - System.Numerics.BigIntegerCalculator:Square(System.ReadOnlySpan`1[UInt32],System.Span`1[UInt32])
          -2 (-0.19% of base) : 22550.dasm - Benchstone.BenchF.LLoops:Init():this

6 total methods with Code Size differences (3 improved, 3 regressed), 2 unchanged.


Linux x64 Benchmarks


Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 10242432 (overridden on cmd)
Total bytes of diff: 10242452 (overridden on cmd)
Total bytes of delta: 20 (0.00 % of base)
	diff is a regression.
	relative diff is a regression.
Detail diffs


Top file regressions (bytes):
      	14 : 27166.dasm (9.524% of base)
       	4 : 29766.dasm (1.786% of base)
       	1 : 16678.dasm (0.069% of base)
       	1 : 29522.dasm (0.079% of base)

4 total files with Code Size differences (0 improved, 4 regressed), 2 unchanged.

Top method regressions (bytes):
      	14 (9.524% of base) : 27166.dasm - Benchstone.BenchI.CSieve:Test():bool:this
       	4 (1.786% of base) : 29766.dasm - StringSort:strsift(System.String[],int,int)
       	1 (0.079% of base) : 29522.dasm - Benchstone.BenchF.LLoops:Init():this
       	1 (0.069% of base) : 16678.dasm - Benchstone.MDBenchF.MDLLoops:Init():this

Top method regressions (percentages):
      	14 (9.524% of base) : 27166.dasm - Benchstone.BenchI.CSieve:Test():bool:this
       	4 (1.786% of base) : 29766.dasm - StringSort:strsift(System.String[],int,int)
       	1 (0.079% of base) : 29522.dasm - Benchstone.BenchF.LLoops:Init():this
       	1 (0.069% of base) : 16678.dasm - Benchstone.MDBenchF.MDLLoops:Init():this

4 total methods with Code Size differences (0 improved, 4 regressed), 2 unchanged.


Explanations

Benchstone.BenchI.CSieve:Test():bool:this

Due to introduction of a range check that was otherwise eliminated by RangeCheck analysis. Please see #61662 for a detailed explanation. If accepted, that PR should eliminate this regression (and has done on my local tests).


Benchstone.MDBenchF.MDLLoops:Init():this

Init code has this pattern...

_u1[j,k,l] = k;
_u2[j,k,l] = k + k;
_u3[j,k,l] = k + k + k;

which the peep will transform to (essentially)

_u1[j,k,l] = k;
_u2[j,k,l] = k << 2
_u3[j,k,l] = k * 3

which effectively transforms

vcvtsi2sd   xmm0, eax
lea         r9d, [rax+rax]
vxorps      xmm1, xmm1
vcvtsi2sd   xmm1, r9d
add         r9d, eax
vxorps      xmm2, xmm2
vcvtsi2sd   xmm2, r9d

to

vcvtsi2sd   xmm0, eax
lea         r9d, [rax+rax]
vxorps      xmm1, xmm1
vcvtsi2sd   xmm1, r9d
lea         r9d, [rax+2*rax]    // <-- HERE
vxorps      xmm2, xmm2
vcvtsi2sd   xmm2, r9d

so in this case, the strength reduction interfares with CSE.


Benchstone.BenchF.LLoops:Init():this

Similar as MDLLoops, with the relevant code as

_u1[j][k][l] = k;
_u2[j][k][l] = k + k;
_u3[j][k][l] = k + k + k;

except that the way code is generated on x86, an extra lea gets eliminated so it's an improvement in that case, i.e.,

lea         ebx, [eax+eax]
vxorps      xmm1, xmm1
vcvtsi2sd   xmm1, ebx
lea         ebx, [eax+eax]
add         ebx, eax

to

lea         edi, [eax+eax]
vxorps      xmm1, xmm1
vcvtsi2sd   xmm1, edi
lea         edi, [eax+2*eax]

StringSort:strsift(System.String[],int,int)

The generated code is effectively the same, even when the i + i are transformed into i << 1 from the following code:

private static void strsift(string[] array, int i, int j)
{
    int k;
    string temp;

    while ((i + i) <= j)
    {
        k = i + i;
        // ...

The reason for the change in bytes is that the i + i => i << 1 reduces number of i uses in the function, and I believe this change resulted in different register allocation for arg0 and arg1 (array and i), r14 and rbx resp. before the peep, rbx and r14 after the peep. This shifted which instructions
needed an extra REX prefix, and the net change is +4 bytes.


BitOps:DoBitfieldIteration(System.Int32[],System.Int32[],int,byref):long

The relevant code is i + i and i + i + 1 inside an index expression, e.g.,

for (i = 0; i < bitoparraysize; i++)
{
    switch (i % 3)
    {
        case 0: /* Set run of bits */
            ToggleBitRun(bitarraybase, bitoparraybase[i + i], bitoparraybase[i + i + 1], 1);
            break;
            // ...

where the transformed code, arr[i << 1] and arr[(i << 1) +1] adjusts how gtMarkAddrMode reads the expression in the index. When the index expression is a LSH, it gets marked with GTF_ADDRMODE_NO_CSE, which changes some code from

lea      eax, [edx+edx]
cmp      eax, dword ptr [edi+4]
jae      G_M4310_IG24
mov      esi, dword ptr [edi+4*eax+8]
inc      eax
cmp      eax, dword ptr [edi+4]

to

lea      ecx, [edx+edx]
cmp      ecx, dword ptr [edi+4]
jae      G_M4310_IG24
mov      eax, dword ptr [edi+8*edx+8]
lea      ecx, [2*edx+1]
cmp      ecx, dword ptr [edi+4]

Note that, if the index expression is written as arr[i * 2], and arr[i * 2 + 1], the same code is generated as the peep optimization generates (both transfrom to the LSH), so I think this is independent of the peep.

@anthonycanino
Copy link
Contributor Author

@kunalspathak can I get an update on this? I'm happy to discuss and make further changes.

@kunalspathak
Copy link
Member

Sorry @anthonycanino for not getting to this. I will go through it in 1-2 days and let you know my feedback. Thanks for your patience.

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor change suggested.

Also I went through the asmdiffs and they looks as expected. There is a scope to improve arm64 code (not in this PR though) that we generate for i * 3. Today we generate:

            mov     w7, #3
            mul     w7, w1, w7

but can be converted to single instruction.

add         w0,w8,w8,lsl #1


// V0 + V0 ... + V0 becomes V0 * foldCount, where postorder transform will optimize
// accordingly
GenTree* lclVarTree = tree->AsOp()->gtOp2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Can you move these definitions before the while loop and then set op1, op2 from them?

GenTree* lclVarTree = tree->AsOp()->gtOp2;
GenTree* consTree   = tree->AsOp()->gtOp1;

GenTree *op1 = consTree, *op2 = lclVarTree;

while () 
{
}

@ghost ghost added the needs-author-action An issue or pull request that requires more info or actions from the author. label Dec 1, 2021
`gtReduceStrength` will reduce `i + i + i + i` to `i * 4`, which
will be optimized to `i << 2`. The reduction is not limited to
power of twos, as `i + i + i + i + i` will reduce to `i * 5`, which
will be optimized to `lea reg1, [reg2+4*reg2]` etc on xarch.
A `GT_MUL` with `TYP_LONG` must go through the GT_MUL preorder
processing on non-64 bit arch. Change forces that to happen by
creating a new GenTree node if GT_ADD can be reduced to GT_MUL
and then running fgMorphTree on then new node.
Also cleaned up the code per suggestions regarding structure (move
to morph.cpp) and fgGlobalOpt check.
@ghost ghost removed the needs-author-action An issue or pull request that requires more info or actions from the author. label Dec 2, 2021
@anthonycanino
Copy link
Contributor Author

@kunalspathak thanks for the feedback. I've made the change as requested.

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kunalspathak kunalspathak merged commit b84b62c into dotnet:main Dec 3, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Jan 2, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants