Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Perf][x86] Investigate performance regression in FannkuchRedux_2 and FannkuchRedux_5 #9836

Closed
nategraf opened this issue Feb 27, 2018 · 11 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance-benchmarks Issue from performance benchmark
Milestone

Comments

@nategraf
Copy link
Contributor

There appears a 5% and 3% performance regression in release/2.1 as compared 7f143364a8ed1ebc51356195fc90bbe972c2c038

A possible cause for this regression has yet be found

category:cq
theme:benchmarks
skill-level:intermediate
cost:small

@AndyAyersMS AndyAyersMS self-assigned this Mar 7, 2018
@AndyAyersMS
Copy link
Member

I'll take a look.

@AndyAyersMS
Copy link
Member

Hmm, for the -2 version I actually see a pretty large perf difference running locally: ~1150 in 2.0 vs ~1675 for master. Test is single threaded and perf is basically all in one method, fannkuch.

Lab seems to have similar results: 2.0 is ~1190, 2.1 is ~ 1800, master is ~ 1800. @nategraf did you mean to write 50% regression above?

Codegen for fannkuch in master is much larger (1036 vs 777), and code layout is quite different, so it may take a while to figure out what has happened here.

@AndyAyersMS
Copy link
Member

Master perf has been stable for quite a while:
image

@jorive here's another example where I really need too look back over longer time ranges. 999 builds is not enough to let me see the entire history of 2.1, I probably need at least 1500 and maybe 2000.

If there are some limits that are hard to work around, I'd also be ok with seeing just a subset, say every 10th build instead of every build.

@adamsitnik whem you zoom in here you see evidence of the longer term stability issues I've been mentioning (#11657). This is why I'd like somebody to do a long time span series of runs with BDN to see if whatever magic is in there can compensate for this or whether it sees something similar. I would guess (but have not verified) that jit codegen for this test did not vary much over this time span.

image

@jorive
Copy link
Member

jorive commented Mar 8, 2018

@AndyAyersMS It's a bad implementation on how the data is requested. I will get to it within the next few weeks. We can talk offline about how you can get the data in different chunks.

@AndyAyersMS
Copy link
Member

AndyAyersMS commented Mar 9, 2018

As best I can tell so far the big code size increase in master/2.1 for fannkuch-2 comes from cloning the entire for (int i = 2; i < n; i++) loop. This seems like it then increases overall register pressure and lead to a lot of disconnected spill and restore blocks.

Note these disconnected RA spills/restore blocks often tend to blow the compact loop layouts. It would be worth revisiting how these blocks get placed; ideally they end up at the ends of the associated loop and not some random spot farther down.

At any rate this test is a poster child for loop cloning abuse (which we are tracking via #8558.)

I wonder if the cloning heuristics are simply too aggressive for x86.

Whether it is this extra cloning that does in perf (as it requires a bit of extra up-front checking) or the problem is the downstream effects of cloning on RA, not I am not sure quite yet. Because of all the code duplication I will need to patch together some instruction level profiles to get a clearer picture.

Also will start diffing the jit dumps to try and see what it was that lead to the extra cloning.

@AndyAyersMS
Copy link
Member

Dump comparison shows that the for (int i = 2 ..) loop was not recognized in 2.0 but it is now in 2.1. So that's what enables cloning. And I may be confused but I don't see a loop body size-based profitability heuristic for cloning. At any rate it is willing to clone a large loop with an inner loop (that is also a clone candidate; not sure why cloning is not running bottom up. But I digress).

Disabling cloning the 2.0 perf decreases to around 1320 and the 2.1 perf improves to around 1360. Code is similar save in 2.1 some of the non-loop code is moved out of the loop bodies. This seems to hurt perf a bit.

x64 perf is comparable between 2.0 and 2.1 so I would hazard that it is the extra spilling induced by the extra cloning that is the root cause for the x86 difference.

So likely cause of this regression is dotnet/coreclr#13314 which cleaned up loop issues with non-loop blocks. Note that change came before we added the test.

The real message here is that we need to take a critical look at loop cloning.

@AndyAyersMS
Copy link
Member

For the -5 variant, I indeed see around 4-5% slowdown:

Fannkuch Redux 5 Metric Iterations AVERAGE STDEV.S MIN MAX
2.0 no affinity Duration 10 1001.929 13.039 982.086 1024.412
2.1 no affinity Duration 10 1047.383 20.018 1012.757 1081.417
2.0 affinity Duration 3 4345.710 10.133 4335.408 4355.666
2.1 affinity Duration 3 4498.468 12.937 4488.478 4513.081

Lab has ~4800 for 2.1 and ~ 4600 for 2.0; history of 2.1 as viewed in master shows aside from level shifting perf has been relatively stable at or a bit above 4800. As with the -2 variant looking back further (while in principle helpful) probably won't show us when things diverged, as this test was added several months after we branched for 2.0.

image

Time to look at some profiles and diffs...

@AndyAyersMS
Copy link
Member

In -5 all the time is spent in run. Because of inlining this is a fairly large method. Codegen is similar but the 2.1 version has more "backedge" spill/restore traffic (and again these blocks are dislocated from their loops). Will hazard a guess that this spill/restore overhead is what is responsible for the relatively small perf difference.

As a simple example the first for loop in FirstPerumtation ends up (in both versions) as:

       33C0         xor      eax, eax
       8B5598       mov      edx, gword ptr [ebp-68H]
       8B7A04       mov      edi, dword ptr [edx+4]
       85FF         test     edi, edi
       0F8E46010000 jle      G_M43568_IG17

G_M43568_IG03:
       895598       mov      gword ptr [ebp-68H], edx
       89448208     mov      dword ptr [edx+4*eax+8], eax
       40           inc      eax
       3BF8         cmp      edi, eax
       0F8F2E010000 jg       G_M43568_IG16
... 
G_M43568_IG16:
       8B5598       mov      edx, gword ptr [ebp-68H]
       E9BAFEFFFF   jmp      G_M43568_IG03

Here edx is not modified in the loop. So just looking at asm it is hard to understand why an in-loop spill and restore was created. We'd be much better off with:

       33C0         xor      eax, eax
       8B5598       mov      edx, gword ptr [ebp-68H]
       8B7A04       mov      edi, dword ptr [edx+4]
       85FF         test     edi, edi
       0F8E46010000 jle      G_M43568_IG17

G_M43568_IG03:
       89448208     mov      dword ptr [edx+4*eax+8], eax
       40           inc      eax
       3BF8         cmp      edi, eax
       0F8F2E010000 jle      G_M43568_IG03

Not clear to me if the disjoint placement of IG16 is symptom or cause here. I can imagine if the block placement comes first then the unrelated code in between will interfere and so placement could be the cause.

@CarolEidt is there some issue open already regarding spilling and spill placement that I can link to?

The bounds check here up above the loop is not needed either, likely for similar reasons to the ones we saw in -2. But at least we don't go crazy with loop cloning.

@CarolEidt
Copy link
Contributor

is there some issue open already regarding spilling and spill placement that I can link to?

There's a general "spill placement" issue - #6825, and also #6806 is related.
However, I've been thinking for some time now about ways to eliminate splitting of backedges in resolution, and I'm hoping to get to that once I'm done with the current (last) stage of #6690. There isn't really an issue open to address that (that I could find), so I opened dotnet/coreclr#16857.

@AndyAyersMS
Copy link
Member

AndyAyersMS commented Mar 9, 2018

Thanks. Am going to move this to future, this as the likely sources of regression are now tracked by other issues that are too complex to address for 2.1.

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@AndyAyersMS
Copy link
Member

Issues seen here are addressed by other open issues. So closing.

@ghost ghost locked as resolved and limited conversation to collaborators Dec 18, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance-benchmarks Issue from performance benchmark
Projects
None yet
Development

No branches or pull requests

5 participants