-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: Visit blocks in RPO during LSRA #107927
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
This reverts commit 4a9e0ff.
/azp run runtime-coreclr outerloop, Fuzzlyn |
Azure Pipelines successfully started running 2 pipeline(s). |
cc @dotnet/jit-contrib, @AndyAyersMS @kunalspathak PTAL. Fuzzlyn failures are known or NaN false positives. Note that LSRA previously had three ordering strategies: the default preds-first one, lexical order, and randomized order (which was never implemented). RPO looks like a viable replacement for the first one, and lexical order doesn't make much sense if we plan to move block layout later, so I removed the functionality for specifying LSRA's block order. Is it ok to remove this for now, and add it back in if/when we decide to implement a randomized order? |
Diffs are large, though a net size improvement. Looking at the instructions retired per collection, the larger MinOpts TP regressions are concentrated in collections with relatively few MinOpts methods, so I think the TP impact isn't as bad as it looks. |
Fuzzlyn linux/arm failures seems to expose some more issues by this change. We should address all of them before merging this PR. |
The assertion |
/azp run Fuzzlyn |
Azure Pipelines successfully started running 1 pipeline(s). |
@kunalspathak the Linux arm failure didn't repro in the last Fuzzlyn run, so if it is a bug, it doesn't readily repro. Aside from inner/outerloop tests, are there any other suites you'd like me to run? JitStress doesn't seem to do anything interesting for block ordering, though if it's still worthwhile to run LSRA stress modes, I can do that -- thanks! |
Yes, since Fuzzlyn is randomized, it might not necessary repro on every run, but we should take the failure that we saw in previous run and see why it showed up with this changes and go from there. I would usually run |
/azp run runtime-coreclr jitstressregs, runtime-coreclr jitstressregs-x86, runtime-coreclr jitstress2-jitstressregs |
Azure Pipelines successfully started running 3 pipeline(s). |
Note that the diffs from the latest run look quite different because the collections got a bit messed up on x64: we're missing |
I haven't been able to repro the Linux arm assert(s) using a crossjit, and based on the inconsistent error message in Fuzzlyn, I suspect the repro was wrong to begin with. |
Did you try downloading the superpmi collection from the prior run and tried reproing on your pre-merge commit? If not, I will find some time to repro it with your changes. |
You could also try a temporary change as part of your PR to increase the duration of Fuzzlyn to 4 hours and see if it catches anything. |
Thanks for recommending this. I was able to hit both of the above asserts when replaying with a crossjit, though I can repro them with a mainline JIT, too. I've opened #108000 to track these asserts. |
@kunalspathak were there any other pipelines you wanted me to run? |
i think that should be it. If the failures are in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compared to the RPO based traversal, the only thing that current traversal was not doing was:
- Sequencing the blocks based on profile data
- Relying on block weight / bbNum in case of tie-breaker
Both of these can have significant impact on final code we produced as we see in the diffs. While the TP improvement is great (around 3~5% in MinOpts for some platforms), the perfscore is half and half. I see same number of methods improved vs. regressed.
To get some confidence I am tempting to see some history of benchmarks that get improved vs. regressed to make sure that this change is the way to go forward. @AndyAyersMS any thoughts?
Remove lexical and random order layout
I am ok with that.
// If the DFS didn't visit any blocks, add them to the end of blockSequence | ||
if (dfsTree->GetPostOrderCount() < compiler->fgBBcount) | ||
{ | ||
// Unvisited blocks are more likely to be at the back of the list, so iterate backwards |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the examples where there will be left over unvisited blocks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any blocks that don't have FlowEdges
into them won't be visited by the RPO traversal. For every example I looked at, this happens when we have throw blocks without any explicit predecessors. For example:
BBnum BBid ref try hnd preds weight [IL range] [jump] [EH region] [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000] 1 1 [000..00D)-> BB06(0.5),BB02(0.5) ( cond ) i LIR hascall idxlen
BB02 [0008] 1 BB01 0.25 [00D..???)-> BB07(0.01),BB03(0.99) ( cond ) LIR internal idxlen
BB03 [0010] 2 BB02,BB05 3.96 [00D..01C)-> BB05(0.5),BB04(0.5) ( cond ) i LIR loophead idxlen bwd
BB04 [0006] 1 BB03 0.99 [???..???)-> BB05(1) (always) i LIR idxlen
BB05 [0004] 2 BB03,BB04 3.96 [01C..026)-> BB03(0.5),BB06(0.5) ( cond ) i LIR hascall idxlen bwd
BB06 [0019] 3 BB01,BB05,BB09 1 [026..031) (return) i LIR gcsafe newobj
BB07 [0011] 2 BB02,BB09 0.04 [00D..01C)-> BB09(0.5),BB08(0.5) ( cond ) i LIR loophead idxlen bwd
BB08 [0013] 1 BB07 0.01 [???..???)-> BB09(1) (always) i LIR idxlen
BB09 [0014] 2 BB07,BB08 0.04 [01C..026)-> BB07(0.5),BB06(0.5) ( cond ) i LIR hascall idxlen bwd
BB10 [0020] 0 0 [???..???) (throw ) i LIR rare keep internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB10
would've been removed in previous phases if it weren't for the BBF_DONT_REMOVE
flag set on it. StackLevelSetter
, which runs right before LSRA, has some logic for removing throw blocks that are no longer needed, even if they're flagged with BBF_DONT_REMOVE
. So if we still see these blocks by the time we get to LSRA, I guess they're needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edges to the throw helpers are added during codegen. So they will appear unreached but are needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since such blocks will be cold (I assume), I am wondering if we should add logic to LSRA to quickly assign registers in such blocks, similar to the way we do for MinOpts, to save some TP time. I am not sure if there will be resolution added to the hot blocks because of this decision and if there is a way to add resolution too in cold blocks.
You mean setting up a lab experiment? |
Yes. |
Since this change is contained to one PR, would it be easier to merge this, and then wait for the perf triage? Since JIT churn has been pretty low lately, it should be easy to spot regressions from this, and if we aren't happy with the results, we can just revert this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets merge it as-is and monitor the perf issues.
@kunalspathak thanks for the review!
@AndyAyersMS is it alright if we go with this? |
Yes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is another place where the loop-aware "RPO" would be useful.
What is loop-aware "RPO"? |
When visiting a block during an RPO traversal, we check if the block is a loop header, and if so, we visit the rest of the loop's body before visiting anything else -- value numbering currently does this, if you want to see what the implementation looks like. This visit ordering has the nice property of keeping loop bodies compact. |
Part of #107749. LSRA's currently does a lexical pass over the blocklist to build a visitation order. Since we intend to run block layout after LSRA with #107634, LSRA ideally shouldn't be sensitive to lexical ordering, and since the current logic tries to visit a block's predecessors before the block itself, it seems easier and faster to just use an RPO traversal.