Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Substitute GT_RET_EXPR in inline candidate arguments #69117

Merged

Conversation

jakobbotsch
Copy link
Member

These GT_RET_EXPR nodes would previously make it into the inlinees as part of the inline argument table. This was confusing to me, and it additionally means that we call gtRetExprVal() in a few places to compensate (while for non-toplevel cases it would essentially cause the argument node containing the GT_RET_EXPR to become "opaque" to the JIT). With this change we only ever expect to substitute GT_RET_EXPR inside fgUpdateInlineReturnExpressionPlaceHolder called from the main fgInline loop, so I have inlined gtRetExprVal() here and removed that function.

@ghost ghost assigned jakobbotsch May 10, 2022
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 10, 2022
@ghost
Copy link

ghost commented May 10, 2022

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

These GT_RET_EXPR nodes would previously make it into the inlinees as part of the inline argument table. This was confusing to me, and it additionally means that we call gtRetExprVal() in a few places to compensate (while for non-toplevel cases it would essentially cause the argument node containing the GT_RET_EXPR to become "opaque" to the JIT). With this change we only ever expect to substitute GT_RET_EXPR inside fgUpdateInlineReturnExpressionPlaceHolder called from the main fgInline loop, so I have inlined gtRetExprVal() here and removed that function.

Author: jakobbotsch
Assignees: jakobbotsch
Labels:

area-CodeGen-coreclr

Milestone: -

Comment on lines 141 to 147
// Yes. We may still have GT_RET_EXPR in arguments,
// substitute those here so the inlinee does not need to
// deal with them.
// Note that we leave late devirts in arguments for when we
// handle inlined statements as fgLateDevirtualization
// cannot handle being called twice on a late devirt
// candidate.
Copy link
Member Author

@jakobbotsch jakobbotsch May 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not ideal, in a case where we could late-devirt an argument containing a call you could imagine doing so expose valuable information to the inlinee for whole-program optimization. We might consider looking into how to fix this if we decide to invest in interprocedural analysis.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decided to fix this, didn't turn out to be too difficult.

@jakobbotsch
Copy link
Member Author

/azp run Fuzzlyn, runtime-coreclr libraries-jitstress

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@jakobbotsch
Copy link
Member Author

/azp run Fuzzlyn, runtime-coreclr libraries-jitstress

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@jakobbotsch
Copy link
Member Author

Fuzzlyn failures are preexisting. libraries-jitstress failures are #69154 and #68803.

Diffs. Generally we see small improvements in the non-test code, and some regressions in the test code. Presumably these are because we now have fewer "opaque" nodes during inlining. I will spot check these when I have some more time, but I don't think we need to block the review on them.

There are some TP regressions, but they are mostly paid for by #69120 that I noticed while working on this. I think there are additional improvements to be made in the future; in particular, we currently walk the argument nodes once when we see the inline candidate, and then another time after they are added as part of inlining (because fgInlinePrependStatements adds them after the inline candidate call). So I think we can afford this given that it makes follow-up improvements simpler and generally simplifies where GT_RET_EXPR appears.

cc @dotnet/jit-contrib PTAL @AndyAyersMS

@jakobbotsch jakobbotsch requested a review from AndyAyersMS May 10, 2022 19:45
@AndyAyersMS
Copy link
Member

I am curious about what's behind some of the larger diffs here, so please do take a look

         172 (15.45 % of base) : 21274.dasm - RequestUtilities:AddHeader(HttpRequestMessage,String,StringValues)

@jakobbotsch
Copy link
Member Author

I am curious about what's behind some of the larger diffs here, so please do take a look

Hmm yes. And after looking at the diff summary again I can see that

Generally we see small improvements in the non-test code, and some regressions in the test code.

is no longer true. That's interesting because it was true before my latest commit 41e0af2, that changes late devirt to happen on the call arguments as well. These were the diffs before that.

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good.

I wonder if the diffs here are because this change uncovers more devirt + inlines?

@@ -21299,6 +21300,7 @@ void Compiler::impDevirtualizeCall(GenTreeCall* call,
// it's a union field used for other things by virtual
// stubs)
call->gtInlineCandidateInfo = nullptr;
call->gtCallMoreFlags &= ~GTF_CALL_M_LATE_DEVIRT;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should rename this flag? GTF_CALL_M_HAS_LATE_DEVIRT_INFO perhaps? I introduced it but the name could better describe what it means.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@jakobbotsch
Copy link
Member Author

jakobbotsch commented May 10, 2022

I am curious about what's behind some of the larger diffs here, so please do take a look

         172 (15.45 % of base) : 21274.dasm - RequestUtilities:AddHeader(HttpRequestMessage,String,StringValues)

The difference here is that we do another inline because of:

-Inline candidate callsite is boring.  Multiplier increased to 2.3.
+Inline candidate callsite is warm.  Multiplier increased to 3.

This happens because we now do the following substitution:

+Replacing the return expression placeholder [000164] with [000812]
+               [000164] --C--------                         ▌  RET_EXPR  ref   (inl return expr [000812])
+
+Inserting the inline return expression
+               [000812] ---XG------                         ▌  FIELD     ref    _headers
+               [000811] -----------                         └──▌  LCL_VAR   ref    V13 tmp6         
+
 Expanding INLINE_CANDIDATE in statement STMT00050 in BB105:
 STMT00050 ( ??? ... ??? )
                [000167] I-C-G------                         ▌  CALL nullcheck int    HttpHeaders.TryAddWithoutValidation (exactContextHnd=0x00007FFC32571731)
-               [000164] --C-------- this                    ├──▌  RET_EXPR  ref   (inl return expr [000812])
+               [000812] ---XG------ this                    ├──▌  FIELD     ref    _headers
+               [000811] -----------                         │  └──▌  LCL_VAR   ref    V13 tmp6         
                [000165] ----------- arg1                    ├──▌  LCL_VAR   ref    V01 arg1         
                [000166] ----------- arg2                    └──▌  LCL_VAR   ref    V03 loc0         

And that RET_EXPR node has BBF_PROF_WEIGHT which we propagate in fgUpdateInlineReturnExpressionPlaceHolder.
If I understand correctly, the same thing would have happened if the RET_EXPR was in the statement before the inline candidate, so this doesn't seem bad to me.
I'll spot check some more diffs.

@jakobbotsch
Copy link
Member Author

The above may also explain parts of the TP diff, since if we do more inlines we expect to spend more time.

@jakobbotsch
Copy link
Member Author

jakobbotsch commented May 10, 2022

In asm.benchmarks.run.windows.x64.checked 8113 we have

-; 0 inlinees with PGO data; 59 single block inlinees; 39 inlinees without PGO data
+; 0 inlinees with PGO data; 69 single block inlinees; 48 inlinees without PGO data

The cause is that we create a couple fewer locals, which means in the new version we do one less lva table grow (which has a *= 2 multiplier). Then we end up with a bunch of extra inlines due to space in the lva table:

-Caller has 148 locals.  Multiplier decreased to 9.6668.
+Caller has 98 locals.  Multiplier decreased to 10.2186.
 calleeNativeSizeEstimate=1122
 callsiteNativeSizeEstimate=115
-benefit multiplier=9.6668
+benefit multiplier=10.2186
-threshold=1111
+threshold=1175
-Native estimate for function size exceeds threshold for inlining 112.2 > 111.1 (multiplier = 9.6668)
+Native estimate for function size is within threshold for inlining 112.2 <= 117.5 (multiplier = 10.2186)

Seems strange that we are using the capacity here:

if (m_RootCompiler->lvaTableCnt > 64)
{
// E.g. MaxLocalsToTrack = 1024 and lvaTableCnt = 512 -> multiplier *= 0.5;
const double lclFullness = min(1.0, (double)m_RootCompiler->lvaTableCnt / JitConfig.JitMaxLocalsToTrack());
multiplier *= (1.0 - lclFullness);
JITDUMP("\nCaller has %d locals. Multiplier decreased to %g.", m_RootCompiler->lvaTableCnt, multiplier);
}

Should this be using lvaCount instead?

@AndyAyersMS
Copy link
Member

Should this be using lvaCount instead?

I think this is something @EgorBo added... yeah, it seems odd to check capacity and not how much is actually in use.

@jakobbotsch
Copy link
Member Author

In aspnet.benchmarks.run.windows.x64.checked 8840 we also have

-; 0 inlinees with PGO data; 243 single block inlinees; 63 inlinees without PGO data
+; 0 inlinees with PGO data; 243 single block inlinees; 67 inlinees without PGO data

Cause is the same as above, we end up with fewer locals and decide to inline more because of it.

In 10502 (System.Text.Json.JsonSerializer:ReadFromSpan) we end up with one fewer statement after an inline because we fold it early and realize it's a constant:

 INLINER: during 'fgInline' result 'success' reason 'aggressive inline attribute' for 'System.Text.Json.JsonSerializer:ReadFromSpan(System.ReadOnlySpan`1[Char],System.Text.Json.Serialization.Metadata.JsonTypeInfo):MicroBenchmarks.Serializers.SimpleStructWithProperties' calling 'System.Runtime.CompilerServices.Unsafe:SizeOf():int'
 INLINER: during 'fgInline' result 'success' reason 'aggressive inline attribute'
+
+Replacing the return expression placeholder [000347] with [000391]
+               [000347] --C--------                         ▌  RET_EXPR  int   (inl return expr [000391])
+
+Inserting the inline return expression
+               [000391] -----------                         ▌  CNS_INT   int    1
+
+
+Folding long operator with constant nodes into a constant:
+               [000348] --C--------                         ▌  CAST      long <- int
+               [000391] -----------                         └──▌  CNS_INT   int    1
+Bashed to long constant:
+               [000348] -----------                         ▌  CNS_INT   long   1
+
+Folding binary operator with a constant operand:
+               [000349] --C--------                         ▌  MUL       long  
+               [000346] -----------                         ├──▌  LCL_VAR   long   V27 tmp20        
+               [000348] -----------                         └──▌  CNS_INT   long   1
+Transformed into:
+               [000346] -----------                         ▌  LCL_VAR   long   V27 tmp20        
 Expanding INLINE_CANDIDATE in statement STMT00072 in BB50:
 STMT00072 ( INL17 @ ??? ... ??? ) <- INLRT @ 0x06A[E-]
                [000350] I-C-G------                         ▌  CALL      void   System.SpanHelpers.ClearWithoutReferences (exactContextHnd=0x00007FFF1E6F83B1)
                [000343] ----------- arg0                    ├──▌  LCL_VAR   byref  V26 tmp19        
-               [000349] --C-------- arg1                    └──▌  MUL       long  
-               [000346] -----------                            ├──▌  LCL_VAR   long   V27 tmp20        
-               [000348] --C--------                            └──▌  CAST      long <- int
-               [000347] --C--------                               └──▌  RET_EXPR  int   (inl return expr [000391])
+               [000346] ----------- arg1                    └──▌  LCL_VAR   long   V27 tmp20        

This leads to one fewer statements added:

------------ Statements (and blocks) added due to the inlining of call [000350] -----------
-
-Arguments setup:
-STMT00084 ( INL17 @ ??? ... ??? ) <- INLRT @ 0x06A[E-]
-               [000410] -AC--------                         ▌  ASG       long  
-               [000409] D------N---                         ├──▌  LCL_VAR   long   V32 tmp25        
-               [000349] --C--------                         └──▌  MUL       long  
-               [000346] -----------                            ├──▌  LCL_VAR   long   V27 tmp20        
-               [000348] --C--------                            └──▌  CAST      long <- int
-               [000347] --C--------                               └──▌  RET_EXPR  int   (inl return expr [000391])

which finally means we clone a finally that brings the extra code:

-Finally in EH#0 has 16 statements, limit is 15; skipping.
+EH#0 is a candidate for finally cloning: 13 blocks, 15 statements

8810 is another case of more inlines:

-; 0 inlinees with PGO data; 33 single block inlinees; 30 inlinees without PGO data
+; 0 inlinees with PGO data; 42 single block inlinees; 34 inlinees without PGO data

In this case we manage to fold some arguments to constants, which results in us folding some branches during import and overall using less budget, so we run out of budget later than before.

I think from what I've seen so far that the diffs are good so will stop checking more cases.

@jakobbotsch
Copy link
Member Author

Timeouts are known. Can you please take a look at the above diffs and approve if it sounds fine @AndyAyersMS?

@jakobbotsch
Copy link
Member Author

Ping @AndyAyersMS

@AndyAyersMS
Copy link
Member

Sorry, got caught up in other things. Looking now.

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for looking at diffs.

@jakobbotsch jakobbotsch merged commit 4ecd060 into dotnet:main May 13, 2022
@jakobbotsch jakobbotsch deleted the substitute-ret-exprs-in-inline-candidates branch May 13, 2022 07:38
@ghost ghost locked as resolved and limited conversation to collaborators Jun 12, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants