-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Perf][x86] Investigate performance regression in FannkuchRedux_2 and FannkuchRedux_5 #9836
Comments
I'll take a look. |
Hmm, for the -2 version I actually see a pretty large perf difference running locally: ~1150 in 2.0 vs ~1675 for master. Test is single threaded and perf is basically all in one method, fannkuch. Lab seems to have similar results: 2.0 is ~1190, 2.1 is ~ 1800, master is ~ 1800. @nategraf did you mean to write 50% regression above? Codegen for fannkuch in master is much larger (1036 vs 777), and code layout is quite different, so it may take a while to figure out what has happened here. |
Master perf has been stable for quite a while: @jorive here's another example where I really need too look back over longer time ranges. 999 builds is not enough to let me see the entire history of 2.1, I probably need at least 1500 and maybe 2000. If there are some limits that are hard to work around, I'd also be ok with seeing just a subset, say every 10th build instead of every build. @adamsitnik whem you zoom in here you see evidence of the longer term stability issues I've been mentioning (#11657). This is why I'd like somebody to do a long time span series of runs with BDN to see if whatever magic is in there can compensate for this or whether it sees something similar. I would guess (but have not verified) that jit codegen for this test did not vary much over this time span. |
@AndyAyersMS It's a bad implementation on how the data is requested. I will get to it within the next few weeks. We can talk offline about how you can get the data in different chunks. |
As best I can tell so far the big code size increase in master/2.1 for fannkuch-2 comes from cloning the entire Note these disconnected RA spills/restore blocks often tend to blow the compact loop layouts. It would be worth revisiting how these blocks get placed; ideally they end up at the ends of the associated loop and not some random spot farther down. At any rate this test is a poster child for loop cloning abuse (which we are tracking via #8558.) I wonder if the cloning heuristics are simply too aggressive for x86. Whether it is this extra cloning that does in perf (as it requires a bit of extra up-front checking) or the problem is the downstream effects of cloning on RA, not I am not sure quite yet. Because of all the code duplication I will need to patch together some instruction level profiles to get a clearer picture. Also will start diffing the jit dumps to try and see what it was that lead to the extra cloning. |
Dump comparison shows that the Disabling cloning the 2.0 perf decreases to around 1320 and the 2.1 perf improves to around 1360. Code is similar save in 2.1 some of the non-loop code is moved out of the loop bodies. This seems to hurt perf a bit. x64 perf is comparable between 2.0 and 2.1 so I would hazard that it is the extra spilling induced by the extra cloning that is the root cause for the x86 difference. So likely cause of this regression is dotnet/coreclr#13314 which cleaned up loop issues with non-loop blocks. Note that change came before we added the test. The real message here is that we need to take a critical look at loop cloning. |
For the -5 variant, I indeed see around 4-5% slowdown:
Lab has ~4800 for 2.1 and ~ 4600 for 2.0; history of 2.1 as viewed in master shows aside from level shifting perf has been relatively stable at or a bit above 4800. As with the -2 variant looking back further (while in principle helpful) probably won't show us when things diverged, as this test was added several months after we branched for 2.0. Time to look at some profiles and diffs... |
In -5 all the time is spent in As a simple example the first 33C0 xor eax, eax
8B5598 mov edx, gword ptr [ebp-68H]
8B7A04 mov edi, dword ptr [edx+4]
85FF test edi, edi
0F8E46010000 jle G_M43568_IG17
G_M43568_IG03:
895598 mov gword ptr [ebp-68H], edx
89448208 mov dword ptr [edx+4*eax+8], eax
40 inc eax
3BF8 cmp edi, eax
0F8F2E010000 jg G_M43568_IG16
...
G_M43568_IG16:
8B5598 mov edx, gword ptr [ebp-68H]
E9BAFEFFFF jmp G_M43568_IG03 Here 33C0 xor eax, eax
8B5598 mov edx, gword ptr [ebp-68H]
8B7A04 mov edi, dword ptr [edx+4]
85FF test edi, edi
0F8E46010000 jle G_M43568_IG17
G_M43568_IG03:
89448208 mov dword ptr [edx+4*eax+8], eax
40 inc eax
3BF8 cmp edi, eax
0F8F2E010000 jle G_M43568_IG03 Not clear to me if the disjoint placement of @CarolEidt is there some issue open already regarding spilling and spill placement that I can link to? The bounds check here up above the loop is not needed either, likely for similar reasons to the ones we saw in -2. But at least we don't go crazy with loop cloning. |
There's a general "spill placement" issue - #6825, and also #6806 is related. |
Thanks. Am going to move this to future, this as the likely sources of regression are now tracked by other issues that are too complex to address for 2.1.
|
Issues seen here are addressed by other open issues. So closing. |
There appears a 5% and 3% performance regression in release/2.1 as compared 7f143364a8ed1ebc51356195fc90bbe972c2c038
A possible cause for this regression has yet be found
category:cq
theme:benchmarks
skill-level:intermediate
cost:small
The text was updated successfully, but these errors were encountered: