Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ARM64 unsigned div by const perf regression #57372

Closed
wants to merge 3 commits into from

Conversation

pentp
Copy link
Contributor

@pentp pentp commented Aug 13, 2021

This fixes #54166. I added a new GT_MULWIDE node for this, but then later realized that we should actually try to fix the GT_MUL arm64 codegen for such cases where both operands are 32-bit as a general solution that would benefit for example System.Decimal also (it does a lot of such multiplications). What do you think @AndyAyersMS?

Diff: https://www.diffchecker.com/9mkVf4se
Diff with baseline (no #52893): https://www.diffchecker.com/x2SRZJvW

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Aug 13, 2021
@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Aug 13, 2021
@ghost
Copy link

ghost commented Aug 13, 2021

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

This fixes #54166. I added a new GT_MULWIDE node for this, but then later realized that we should actually try to fix the GT_MUL arm64 codegen for such cases where both operands are 32-bit as a general solution that would benefit for example System.Decimal also (it does a lot of such multiplications). What do you think @AndyAyersMS?

Diff: https://www.diffchecker.com/9mkVf4se
Diff with baseline (no #52893): https://www.diffchecker.com/x2SRZJvW

Author: pentp
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

Comment on lines +1820 to +1821
// Generate code to get the 64-bits of a 32*32 bit multiplication result
void CodeGen::genCodeForMulWide(GenTreeOp* treeNode)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this a GT_MUL_LONG? We only use for 32 bit targets now, but it should be straightforward to use it on ARM64 too.

@AndyAyersMS
Copy link
Member

cc @dotnet/jit-contrib.

Can somebody more familiar with arm64 codegen PTAL. It would be nice to verify this fixes the regression, so likely one of us will need to be on tap to run the benchmarks. Anyone want to sign up to shepherd this fix?

we should actually try to fix the GT_MUL arm64 codegen...

I like the sound of this. Is it something you can work up reasonably quickly so we can assess how complicated it would be and whether it looks like a more attractive alternative?

@JulieLeeMSFT JulieLeeMSFT added this to the 6.0.0 milestone Aug 14, 2021
@JulieLeeMSFT
Copy link
Member

@echesakovMSFT PTAL.

@echesakov
Copy link
Contributor

I agree that having GT_MUL generating:

  • smull for s32 x s32 -> s64
  • umull for u32 x u32 -> u64
  • mul for i32 x i32 -> i32 and i64 x i64 -> i64

would be better that introducing an IR node that only gonna be used for a few cases like this one.
In fact, I opened issue #47490 that targets such work.

I will run the affected benchmarks on Arm64 device on Monday and post results here.

@EgorBo
Copy link
Member

EgorBo commented Aug 14, 2021

Heh, I had a prototype for #47490 where I also just removed casts and changed types like this:

MUL LONG
  CAST INT -> LONG
    X INT  
  CAST INT -> LONG
    Y INT  

to:

MUL LONG
  X INT  
  Y INT  

and recognized umull/smull in the codegen for it

@SingleAccretion
Copy link
Contributor

SingleAccretion commented Aug 14, 2021

We already have all the necessary logic to recognize such trees in morph, but it may need some refactoring to be usable in LIR (I do not think we want it in morph for ARM64, I think we want it in lowering). However, if you want the easy way out, there is a "dumbed down" debug-only (but still correct) version that will work on LIR as-is here (modulo the actual flag check - it can just be inlined for the 32 bit-only assert in morph). And we already have an oper that represents 32x32 -> 64 multiplication - GT_MUL_LONG. It would seem to me that the only missing piece is codegen (and LSRA, I suppose) support, which this PR already adds.

@pentp
Copy link
Contributor Author

pentp commented Aug 14, 2021

Closing this one as #57400 is better.

It includes the codegen change for umull/smull, but I don't know what's the correct place for removing casts (optNarrowTree in optimizer.cpp?), so I'll leave it to you @EgorBo as you already have a prototype.

@pentp pentp closed this Aug 14, 2021
@pentp pentp deleted the fix-arm64-udiv branch August 18, 2021 14:51
@ghost ghost locked as resolved and limited conversation to collaborators Sep 17, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Perf] Regressions in Integer Formatting
6 participants