Fix ARM64 unsigned div by const perf regression #57372

pentp · 2021-08-13T20:28:05Z

This fixes #54166. I added a new GT_MULWIDE node for this, but then later realized that we should actually try to fix the GT_MUL arm64 codegen for such cases where both operands are 32-bit as a general solution that would benefit for example System.Decimal also (it does a lot of such multiplications). What do you think @AndyAyersMS?

Diff: https://www.diffchecker.com/9mkVf4se
Diff with baseline (no #52893): https://www.diffchecker.com/x2SRZJvW

ghost · 2021-08-13T20:28:12Z

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

This fixes #54166. I added a new GT_MULWIDE node for this, but then later realized that we should actually try to fix the GT_MUL arm64 codegen for such cases where both operands are 32-bit as a general solution that would benefit for example System.Decimal also (it does a lot of such multiplications). What do you think @AndyAyersMS?

Diff: https://www.diffchecker.com/9mkVf4se
Diff with baseline (no #52893): https://www.diffchecker.com/x2SRZJvW

Author:	pentp
Assignees:	-
Labels:	`area-CodeGen-coreclr`
Milestone:	-

SingleAccretion · 2021-08-13T20:36:48Z

src/coreclr/jit/codegenarm64.cpp

+// Generate code to get the 64-bits of a 32*32 bit multiplication result
+void CodeGen::genCodeForMulWide(GenTreeOp* treeNode)


Isn't this a GT_MUL_LONG? We only use for 32 bit targets now, but it should be straightforward to use it on ARM64 too.

AndyAyersMS · 2021-08-13T20:51:34Z

cc @dotnet/jit-contrib.

Can somebody more familiar with arm64 codegen PTAL. It would be nice to verify this fixes the regression, so likely one of us will need to be on tap to run the benchmarks. Anyone want to sign up to shepherd this fix?

we should actually try to fix the GT_MUL arm64 codegen...

I like the sound of this. Is it something you can work up reasonably quickly so we can assess how complicated it would be and whether it looks like a more attractive alternative?

JulieLeeMSFT · 2021-08-14T00:36:41Z

@echesakovMSFT PTAL.

echesakov · 2021-08-14T01:56:09Z

I agree that having GT_MUL generating:

smull for s32 x s32 -> s64
umull for u32 x u32 -> u64
mul for i32 x i32 -> i32 and i64 x i64 -> i64

would be better that introducing an IR node that only gonna be used for a few cases like this one.
In fact, I opened issue #47490 that targets such work.

I will run the affected benchmarks on Arm64 device on Monday and post results here.

EgorBo · 2021-08-14T09:07:03Z

Heh, I had a prototype for #47490 where I also just removed casts and changed types like this:

MUL LONG
  CAST INT -> LONG
    X INT  
  CAST INT -> LONG
    Y INT

to:

MUL LONG
  X INT  
  Y INT

and recognized umull/smull in the codegen for it

SingleAccretion · 2021-08-14T11:30:02Z

We already have all the necessary logic to recognize such trees in morph, but it may need some refactoring to be usable in LIR (I do not think we want it in morph for ARM64, I think we want it in lowering). However, if you want the easy way out, there is a "dumbed down" debug-only (but still correct) version that will work on LIR as-is here (modulo the actual flag check - it can just be inlined for the 32 bit-only assert in morph). And we already have an oper that represents 32x32 -> 64 multiplication - GT_MUL_LONG. It would seem to me that the only missing piece is codegen (and LSRA, I suppose) support, which this PR already adds.

pentp · 2021-08-14T15:07:43Z

Closing this one as #57400 is better.

It includes the codegen change for umull/smull, but I don't know what's the correct place for removing casts (optNarrowTree in optimizer.cpp?), so I'll leave it to you @EgorBo as you already have a prototype.

Fix ARM64 unsigned div by const perf regression

976bf2a

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Aug 13, 2021

ghost added the community-contribution Indicates that the PR has been added by a community member label Aug 13, 2021

SingleAccretion reviewed Aug 13, 2021

View reviewed changes

Try fix formatting

6dba123

JulieLeeMSFT requested a review from echesakov August 14, 2021 00:36

JulieLeeMSFT added this to the 6.0.0 milestone Aug 14, 2021

Try to make formatting happy

9a9d09d

pentp mentioned this pull request Aug 14, 2021

Fix ARM64 unsigned div by const perf regression #57400

Merged

pentp closed this Aug 14, 2021

pentp deleted the fix-arm64-udiv branch August 18, 2021 14:51

SingleAccretion mentioned this pull request Aug 23, 2021

Implement GT_MUL_LONG for ARM64 #57926

Merged

ghost locked as resolved and limited conversation to collaborators Sep 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix ARM64 unsigned div by const perf regression #57372

Fix ARM64 unsigned div by const perf regression #57372

pentp commented Aug 13, 2021

ghost commented Aug 13, 2021

SingleAccretion Aug 13, 2021

AndyAyersMS commented Aug 13, 2021

JulieLeeMSFT commented Aug 14, 2021

echesakov commented Aug 14, 2021

EgorBo commented Aug 14, 2021 •

edited

Loading

SingleAccretion commented Aug 14, 2021 •

edited

Loading

pentp commented Aug 14, 2021

		// Generate code to get the 64-bits of a 32*32 bit multiplication result
		void CodeGen::genCodeForMulWide(GenTreeOp* treeNode)

Fix ARM64 unsigned div by const perf regression #57372

Fix ARM64 unsigned div by const perf regression #57372

Conversation

pentp commented Aug 13, 2021

ghost commented Aug 13, 2021

SingleAccretion Aug 13, 2021

Choose a reason for hiding this comment

AndyAyersMS commented Aug 13, 2021

JulieLeeMSFT commented Aug 14, 2021

echesakov commented Aug 14, 2021

EgorBo commented Aug 14, 2021 • edited Loading

SingleAccretion commented Aug 14, 2021 • edited Loading

pentp commented Aug 14, 2021

EgorBo commented Aug 14, 2021 •

edited

Loading

SingleAccretion commented Aug 14, 2021 •

edited

Loading