[Perf][x86] Investigate performance regression in FannkuchRedux_2 and FannkuchRedux_5 #9836

nategraf · 2018-02-27T01:31:12Z

There appears a 5% and 3% performance regression in release/2.1 as compared 7f143364a8ed1ebc51356195fc90bbe972c2c038

A possible cause for this regression has yet be found

category:cq
theme:benchmarks
skill-level:intermediate
cost:small

AndyAyersMS · 2018-03-07T20:20:17Z

I'll take a look.

AndyAyersMS · 2018-03-08T18:53:22Z

Hmm, for the -2 version I actually see a pretty large perf difference running locally: ~1150 in 2.0 vs ~1675 for master. Test is single threaded and perf is basically all in one method, fannkuch.

Lab seems to have similar results: 2.0 is ~1190, 2.1 is ~ 1800, master is ~ 1800. @nategraf did you mean to write 50% regression above?

Codegen for fannkuch in master is much larger (1036 vs 777), and code layout is quite different, so it may take a while to figure out what has happened here.

AndyAyersMS · 2018-03-08T19:04:58Z

Master perf has been stable for quite a while:

@jorive here's another example where I really need too look back over longer time ranges. 999 builds is not enough to let me see the entire history of 2.1, I probably need at least 1500 and maybe 2000.

If there are some limits that are hard to work around, I'd also be ok with seeing just a subset, say every 10th build instead of every build.

@adamsitnik whem you zoom in here you see evidence of the longer term stability issues I've been mentioning (#11657). This is why I'd like somebody to do a long time span series of runs with BDN to see if whatever magic is in there can compensate for this or whether it sees something similar. I would guess (but have not verified) that jit codegen for this test did not vary much over this time span.

jorive · 2018-03-08T19:29:37Z

@AndyAyersMS It's a bad implementation on how the data is requested. I will get to it within the next few weeks. We can talk offline about how you can get the data in different chunks.

AndyAyersMS · 2018-03-09T00:25:45Z

As best I can tell so far the big code size increase in master/2.1 for fannkuch-2 comes from cloning the entire for (int i = 2; i < n; i++) loop. This seems like it then increases overall register pressure and lead to a lot of disconnected spill and restore blocks.

Note these disconnected RA spills/restore blocks often tend to blow the compact loop layouts. It would be worth revisiting how these blocks get placed; ideally they end up at the ends of the associated loop and not some random spot farther down.

At any rate this test is a poster child for loop cloning abuse (which we are tracking via #8558.)

I wonder if the cloning heuristics are simply too aggressive for x86.

Whether it is this extra cloning that does in perf (as it requires a bit of extra up-front checking) or the problem is the downstream effects of cloning on RA, not I am not sure quite yet. Because of all the code duplication I will need to patch together some instruction level profiles to get a clearer picture.

Also will start diffing the jit dumps to try and see what it was that lead to the extra cloning.

AndyAyersMS · 2018-03-09T01:43:43Z

Dump comparison shows that the for (int i = 2 ..) loop was not recognized in 2.0 but it is now in 2.1. So that's what enables cloning. And I may be confused but I don't see a loop body size-based profitability heuristic for cloning. At any rate it is willing to clone a large loop with an inner loop (that is also a clone candidate; not sure why cloning is not running bottom up. But I digress).

Disabling cloning the 2.0 perf decreases to around 1320 and the 2.1 perf improves to around 1360. Code is similar save in 2.1 some of the non-loop code is moved out of the loop bodies. This seems to hurt perf a bit.

x64 perf is comparable between 2.0 and 2.1 so I would hazard that it is the extra spilling induced by the extra cloning that is the root cause for the x86 difference.

So likely cause of this regression is dotnet/coreclr#13314 which cleaned up loop issues with non-loop blocks. Note that change came before we added the test.

The real message here is that we need to take a critical look at loop cloning.

AndyAyersMS · 2018-03-09T16:32:25Z

For the -5 variant, I indeed see around 4-5% slowdown:

Fannkuch Redux 5	Metric	Iterations	AVERAGE	STDEV.S	MIN	MAX
2.0 no affinity	Duration	10	1001.929	13.039	982.086	1024.412
2.1 no affinity	Duration	10	1047.383	20.018	1012.757	1081.417
2.0 affinity	Duration	3	4345.710	10.133	4335.408	4355.666
2.1 affinity	Duration	3	4498.468	12.937	4488.478	4513.081

Lab has ~4800 for 2.1 and ~ 4600 for 2.0; history of 2.1 as viewed in master shows aside from level shifting perf has been relatively stable at or a bit above 4800. As with the -2 variant looking back further (while in principle helpful) probably won't show us when things diverged, as this test was added several months after we branched for 2.0.

Time to look at some profiles and diffs...

AndyAyersMS · 2018-03-09T17:15:52Z

In -5 all the time is spent in run. Because of inlining this is a fairly large method. Codegen is similar but the 2.1 version has more "backedge" spill/restore traffic (and again these blocks are dislocated from their loops). Will hazard a guess that this spill/restore overhead is what is responsible for the relatively small perf difference.

As a simple example the first for loop in FirstPerumtation ends up (in both versions) as:

       33C0         xor      eax, eax
       8B5598       mov      edx, gword ptr [ebp-68H]
       8B7A04       mov      edi, dword ptr [edx+4]
       85FF         test     edi, edi
       0F8E46010000 jle      G_M43568_IG17

G_M43568_IG03:
       895598       mov      gword ptr [ebp-68H], edx
       89448208     mov      dword ptr [edx+4*eax+8], eax
       40           inc      eax
       3BF8         cmp      edi, eax
       0F8F2E010000 jg       G_M43568_IG16
... 
G_M43568_IG16:
       8B5598       mov      edx, gword ptr [ebp-68H]
       E9BAFEFFFF   jmp      G_M43568_IG03

Here edx is not modified in the loop. So just looking at asm it is hard to understand why an in-loop spill and restore was created. We'd be much better off with:

       33C0         xor      eax, eax
       8B5598       mov      edx, gword ptr [ebp-68H]
       8B7A04       mov      edi, dword ptr [edx+4]
       85FF         test     edi, edi
       0F8E46010000 jle      G_M43568_IG17

G_M43568_IG03:
       89448208     mov      dword ptr [edx+4*eax+8], eax
       40           inc      eax
       3BF8         cmp      edi, eax
       0F8F2E010000 jle      G_M43568_IG03

Not clear to me if the disjoint placement of IG16 is symptom or cause here. I can imagine if the block placement comes first then the unrelated code in between will interfere and so placement could be the cause.

@CarolEidt is there some issue open already regarding spilling and spill placement that I can link to?

The bounds check here up above the loop is not needed either, likely for similar reasons to the ones we saw in -2. But at least we don't go crazy with loop cloning.

CarolEidt · 2018-03-09T17:49:30Z

is there some issue open already regarding spilling and spill placement that I can link to?

There's a general "spill placement" issue - #6825, and also #6806 is related.
However, I've been thinking for some time now about ways to eliminate splitting of backedges in resolution, and I'm hoping to get to that once I'm done with the current (last) stage of #6690. There isn't really an issue open to address that (that I could find), so I opened dotnet/coreclr#16857.

AndyAyersMS · 2018-03-09T18:07:46Z

Thanks. Am going to move this to future, this as the likely sources of regression are now tracked by other issues that are too complex to address for 2.1.

fannkuch-2: loop cloning JIT: examples where loop cloning is not useful #8558
fannkuch-5: backedge spills [LSRA] Avoid splitting loop backedges #9909

AndyAyersMS · 2020-02-19T22:59:19Z

Issues seen here are addressed by other open issues. So closing.

AndyAyersMS self-assigned this Mar 7, 2018

msftgits transferred this issue from dotnet/coreclr Jan 31, 2020

msftgits added this to the Future milestone Jan 31, 2020

AndyAyersMS closed this as completed Feb 19, 2020

ghost locked as resolved and limited conversation to collaborators Dec 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Perf][x86] Investigate performance regression in FannkuchRedux_2 and FannkuchRedux_5 #9836

[Perf][x86] Investigate performance regression in FannkuchRedux_2 and FannkuchRedux_5 #9836

nategraf commented Feb 27, 2018

AndyAyersMS commented Mar 7, 2018

AndyAyersMS commented Mar 8, 2018

AndyAyersMS commented Mar 8, 2018

jorive commented Mar 8, 2018

AndyAyersMS commented Mar 9, 2018 •

edited

Loading

AndyAyersMS commented Mar 9, 2018

AndyAyersMS commented Mar 9, 2018

AndyAyersMS commented Mar 9, 2018

CarolEidt commented Mar 9, 2018

AndyAyersMS commented Mar 9, 2018 •

edited

Loading

AndyAyersMS commented Feb 19, 2020

[Perf][x86] Investigate performance regression in FannkuchRedux_2 and FannkuchRedux_5 #9836

[Perf][x86] Investigate performance regression in FannkuchRedux_2 and FannkuchRedux_5 #9836

Comments

nategraf commented Feb 27, 2018

AndyAyersMS commented Mar 7, 2018

AndyAyersMS commented Mar 8, 2018

AndyAyersMS commented Mar 8, 2018

jorive commented Mar 8, 2018

AndyAyersMS commented Mar 9, 2018 • edited Loading

AndyAyersMS commented Mar 9, 2018

AndyAyersMS commented Mar 9, 2018

AndyAyersMS commented Mar 9, 2018

CarolEidt commented Mar 9, 2018

AndyAyersMS commented Mar 9, 2018 • edited Loading

AndyAyersMS commented Feb 19, 2020

AndyAyersMS commented Mar 9, 2018 •

edited

Loading

AndyAyersMS commented Mar 9, 2018 •

edited

Loading