-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Certain loops do not get recorded in optLoopTable #43713
Comments
@dotnet/jit-contrib |
There are two sibling inner loops. The first is not recognized because the immediately preceding HEAD block doesn't fall or branch into the loop. The second is not recognized for a different reason. The HEAD block, which points to ENTRY, must be the ONLY outside-loop predecessor of ENTRY (there is some exception for HEAD dominating another predecessor). The second loop ENTRY has two outside-loop predecessors, corresponding to the two failure cases of the preceding loop condition. The loop conditions are interesting because both the inner loops have multiple-clause loop conditions that necessitate multiple blocks, each branching directly to the entry of the next loop. |
As in Andy's discussion we want loops to be transformed into a |
I have also noticed that we do not make an entry of any cloned loops inside |
I hit this problem again when investigating a regression in one of the benchmark because of loop alignment. The problem was in the generated code for Lines 19 to 37 in 9f921f4
Here, we clone both the loops shown above. While cloning, we also copy the fact that these loops need alignment. However none of the cloned loop is recorded in In below assembly code, Assembly code of IndexOf(); Assembly listing for method System.Collections.Generic.GenericEqualityComparer`1[__Canon][System.__Canon]:IndexOf(System.__Canon[],System.__Canon,int,int):int:this
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; rsp based frame
; fully interruptible
; Final local variable assignments
;
; V00 this [V00,T08] ( 4, 6 ) ref -> rbx this class-hnd
; V01 arg1 [V01,T02] ( 15, 34 ) ref -> rsi class-hnd
; V02 arg2 [V02,T07] ( 5, 7 ) ref -> rdi class-hnd
; V03 arg3 [V03,T09] ( 7, 5 ) int -> r9
; V04 arg4 [V04,T12] ( 1, 1 ) int -> [rsp+0x80]
; V05 loc0 [V05,T04] ( 11, 20 ) int -> rbp
; V06 loc1 [V06,T01] ( 12, 37.50) int -> rax
; V07 loc2 [V07,T00] ( 15, 43.50) int -> r14
; V08 OutArgs [V08 ] ( 1, 1 ) lclBlk (32) [rsp+0x00] "OutgoingArgSpace"
; V09 tmp1 [V09,T05] ( 6, 20 ) long -> rcx "impRuntimeLookup slot"
; V10 tmp2 [V10,T06] ( 4, 16 ) byref -> r15 "impAppendStmt"
; V11 tmp3 [V11,T03] ( 8, 24 ) long -> r11 "spilling Runtime Lookup tree"
;* V12 tmp4 [V12 ] ( 0, 0 ) long -> zero-ref "VirtualCall with runtime lookup"
; V13 cse0 [V13,T10] ( 3, 5 ) long -> r11 "CSE - aggressive"
; V14 cse1 [V14,T11] ( 3, 5 ) long -> r11 "CSE - aggressive"
;
; Lcl frame size = 40
G_M30624_IG01: ;; offset=0000H
4157 push r15
4156 push r14
57 push rdi
56 push rsi
55 push rbp
53 push rbx
4883EC28 sub rsp, 40
48894C2420 mov qword ptr [rsp+20H], rcx
488BD9 mov rbx, rcx
488BF2 mov rsi, rdx
498BF8 mov rdi, r8
;; bbWeight=1 PerfScore 8.00
G_M30624_IG02: ;; offset=001AH
418BE9 mov ebp, r9d
03AC2480000000 add ebp, dword ptr [rsp+80H]
4885FF test rdi, rdi
0F857F000000 jne G_M30624_IG12
;; bbWeight=1 PerfScore 2.50
G_M30624_IG03: ;; offset=002DH
418BC1 mov eax, r9d
443BCD cmp r9d, ebp
0F8D5D010000 jge G_M30624_IG27
4885F6 test rsi, rsi
7442 je SHORT G_M30624_IG08
;; bbWeight=0.50 PerfScore 1.38
G_M30624_IG04: ;; offset=003EH
448BC8 mov r9d, eax
41F7D1 not r9d
41C1E91F shr r9d, 31
8BCD mov ecx, ebp
F7D1 not ecx
C1E91F shr ecx, 31
4123C9 and ecx, r9d
396E08 cmp dword ptr [rsi+8], ebp
0F9DC2 setge dl
0FB6D2 movzx rdx, dl
85CA test ecx, edx
7421 je SHORT G_M30624_IG08
90 align
;; bbWeight=0.50 PerfScore 5.00
G_M30624_IG05: ;; offset=0060H
4863C8 movsxd rcx, eax
48837CCE1000 cmp gword ptr [rsi+8*rcx+16], 0
7434 je SHORT G_M30624_IG11
;; bbWeight=4 PerfScore 13.00
G_M30624_IG06: ;; offset=006BH
FFC0 inc eax
3BC5 cmp eax, ebp
7CEF jl SHORT G_M30624_IG05
;; bbWeight=4 PerfScore 6.00
G_M30624_IG07: ;; offset=0071H
E920010000 jmp G_M30624_IG27
66660F1F840000000000 align
;; bbWeight=0.50 PerfScore 1.12
G_M30624_IG08: ;; offset=0080H
3B4608 cmp eax, dword ptr [rsi+8]
0F832F010000 jae G_M30624_IG31
4863C8 movsxd rcx, eax
48837CCE1000 cmp gword ptr [rsi+8*rcx+16], 0
740B je SHORT G_M30624_IG11
;; bbWeight=4 PerfScore 25.00
G_M30624_IG09: ;; offset=0094H
FFC0 inc eax
3BC5 cmp eax, ebp
7CE6 jl SHORT G_M30624_IG08
;; bbWeight=4 PerfScore 6.00
G_M30624_IG10: ;; offset=009AH
E9F7000000 jmp G_M30624_IG27
;; bbWeight=0.50 PerfScore 1.00
G_M30624_IG11: ;; offset=009FH
4883C428 add rsp, 40
5B pop rbx
5D pop rbp
5E pop rsi
5F pop rdi
415E pop r14
415F pop r15
C3 ret
;; bbWeight=0.50 PerfScore 2.12
G_M30624_IG12: ;; offset=00ACH
458BF1 mov r14d, r9d
443BCD cmp r9d, ebp
0F8DDE000000 jge G_M30624_IG27
4885F6 test rsi, rsi
0F847F000000 je G_M30624_IG21
;; bbWeight=0.50 PerfScore 1.38
G_M30624_IG13: ;; offset=00C1H
418BCE mov ecx, r14d
F7D1 not ecx
C1E91F shr ecx, 31
8BD5 mov edx, ebp
F7D2 not edx
C1EA1F shr edx, 31
23CA and ecx, edx
396E08 cmp dword ptr [rsi+8], ebp
0F9DC2 setge dl
0FB6D2 movzx rdx, dl
85CA test ecx, edx
7461 je SHORT G_M30624_IG21
;; bbWeight=0.50 PerfScore 4.88
G_M30624_IG14: ;; offset=00DFH
4963CE movsxd rcx, r14d
48837CCE1000 cmp gword ptr [rsi+8*rcx+16], 0
744B je SHORT G_M30624_IG19
;; bbWeight=4 PerfScore 13.00
G_M30624_IG15: ;; offset=00EAH
443B7608 cmp r14d, dword ptr [rsi+8]
0F83C4000000 jae G_M30624_IG31
4963CE movsxd rcx, r14d
4C8D7CCE10 lea r15, bword ptr [rsi+8*rcx+16]
488B0B mov rcx, qword ptr [rbx]
488B5138 mov rdx, qword ptr [rcx+56]
488B5208 mov rdx, qword ptr [rdx+8]
4C8B5A18 mov r11, qword ptr [rdx+24]
4D85DB test r11, r11
7402 je SHORT G_M30624_IG17
;; bbWeight=2 PerfScore 27.00
G_M30624_IG16: ;; offset=0110H
EB12 jmp SHORT G_M30624_IG18
;; bbWeight=1 PerfScore 2.00
G_M30624_IG17: ;; offset=0112H
48BAE8D2AD76FA7F0000 mov rdx, 0x7FFA76ADD2E8
E8BF6E065F call CORINFO_HELP_RUNTIMEHANDLE_CLASS
4C8BD8 mov r11, rax
;; bbWeight=1 PerfScore 1.50
G_M30624_IG18: ;; offset=0124H
498B0F mov rcx, gword ptr [r15]
488BD7 mov rdx, rdi
41FF13 call qword ptr [r11]
85C0 test eax, eax
7577 jne SHORT G_M30624_IG29
;; NOP compensation instructions of 4 bytes.
;; bbWeight=2 PerfScore 13.00
G_M30624_IG19: ;; offset=0135H
41FFC6 inc r14d
443BF5 cmp r14d, ebp
7CA2 jl SHORT G_M30624_IG14
;; bbWeight=4 PerfScore 6.00
G_M30624_IG20: ;; offset=013DH
EB57 jmp SHORT G_M30624_IG27
90 align
;; bbWeight=0.50 PerfScore 1.12
G_M30624_IG21: ;; offset=0140H
443B7608 cmp r14d, dword ptr [rsi+8]
7372 jae SHORT G_M30624_IG31
4963CE movsxd rcx, r14d
48837CCE1000 cmp gword ptr [rsi+8*rcx+16], 0
743D je SHORT G_M30624_IG26
;; bbWeight=4 PerfScore 25.00
G_M30624_IG22: ;; offset=0151H
4963CE movsxd rcx, r14d
4C8D7CCE10 lea r15, bword ptr [rsi+8*rcx+16]
488B0B mov rcx, qword ptr [rbx]
488B5138 mov rdx, qword ptr [rcx+56]
488B5208 mov rdx, qword ptr [rdx+8]
4C8B5A18 mov r11, qword ptr [rdx+24]
4D85DB test r11, r11
7402 je SHORT G_M30624_IG24
;; bbWeight=2 PerfScore 21.00
G_M30624_IG23: ;; offset=016DH
EB12 jmp SHORT G_M30624_IG25
;; bbWeight=1 PerfScore 2.00
G_M30624_IG24: ;; offset=016FH
48BAE8D2AD76FA7F0000 mov rdx, 0x7FFA76ADD2E8
E8626E065F call CORINFO_HELP_RUNTIMEHANDLE_CLASS
4C8BD8 mov r11, rax
;; bbWeight=1 PerfScore 1.50
G_M30624_IG25: ;; offset=0181H
498B0F mov rcx, gword ptr [r15]
488BD7 mov rdx, rdi
41FF13 call qword ptr [r11]
85C0 test eax, eax
751A jne SHORT G_M30624_IG29
;; bbWeight=2 PerfScore 13.00
G_M30624_IG26: ;; offset=018EH
41FFC6 inc r14d
443BF5 cmp r14d, ebp
7CAA jl SHORT G_M30624_IG21
;; bbWeight=4 PerfScore 6.00
G_M30624_IG27: ;; offset=0196H
B8FFFFFFFF mov eax, -1
;; bbWeight=0.50 PerfScore 0.12
G_M30624_IG28: ;; offset=019BH
4883C428 add rsp, 40
5B pop rbx
5D pop rbp
5E pop rsi
5F pop rdi
415E pop r14
415F pop r15
C3 ret
;; bbWeight=0.50 PerfScore 2.12
G_M30624_IG29: ;; offset=01A8H
418BC6 mov eax, r14d
;; bbWeight=0.50 PerfScore 0.12
G_M30624_IG30: ;; offset=01ABH
4883C428 add rsp, 40
5B pop rbx
5D pop rbp
5E pop rsi
5F pop rdi
415E pop r14
415F pop r15
C3 ret
;; bbWeight=0.50 PerfScore 2.12
G_M30624_IG31: ;; offset=01B8H
E843D1065F call CORINFO_HELP_RNGCHKFAIL
CC int3
;; bbWeight=0 PerfScore 0.00
; Total bytes of code 446, prolog size 26, PerfScore 258.60, instruction count 147, allocated bytes for code 446 (MethodHash=c3fb885f) for method System.Collections.Generic.GenericEqualityComparer`1[__Canon][System.__Canon]:IndexOf(System.__Canon[],System.__Canon,int,int):int:this
; ============================================================ Impacted benchmarks:
Here are the list of methods that rely directly or indirectly on
|
I just noticed this comment where we have a runtime/src/coreclr/jit/optimizer.cpp Lines 5322 to 5324 in a0e8d8d
|
As part of loop normalization we should refactor flow so all loop headers have exactly one non-in-loop predecessor (for reducible loops, anyways). I expect that C# always gives us IL with reducible loops (except in methods that use goto). |
Just for completeness' sake, the "standard" flow normalization around reducible loops are:
Today the jit selectively does rotation and preheader formation; as noted here, it would be interesting to try and broaden the set of loops we treat this way. |
Here's another case of unrecognized loops: #58941. It appears that F# presents IL that doesn't match the form coming from C# that the JIT is optimized for. |
From looking at the code a bit here, it looks like that in order to better generalize the loop normalization done in
but if some of the logic were flipped, i.e.,
we could more easily transform the loops in Am I following the high-level logic correctly? |
The "general loop logic" only marks blocks, it doesn't rearrange block order, so I don't think hoisting that in the phase order will have any effect. fyi, this code is a little easier to follow with #62560 |
Thanks @BruceForstall . When I wrote "general loop logic", I meant, "marking blocks that are loops", which I thought it might be easier to then apply more transforms once we know which blocks are loops. It seemed to me the loop identification logic in |
I verified that is fixed with the new loop representation for the |
Fixed by PRs listed in #93144 (comment) |
Today, we fail to capture certain loops inside
optLoopTable
.optLoopTable
is a data structure that should ideally contain all the loops so we can make further loop optimizations like loop unrolling, loop cloning or hoist loop code.Below is the code taken from QuickSortSpan benchmark.
In the assembly code, the 2 inner while loops have this kind of representation and hence do not make it to
optLoopTable
.Assembly code
Here, none of the 3 loops (outer
for
and 2 innerwhile
) loops are present inoptLoopTable
. On investigation, @BruceForstall and I found out that it happens because the entry to the loop is below the bottom of the loop and from the entry it jumps up to the top block, however the bottom's backedge jumps to the first block of the loop.Here is the pictorial representation. Ideally, we favor below loops. The diagram is taken from optimizer.cpp.
However, in this case, the loop is little different and hence we reject it in
FindNaturalLoops()
because the code FindEntry returnsnullptr
and hence we don't recognize it as a loop.We should fix it to identify such loops or at least make an entry to
optLoopTable
to take advantage of other loop optimizations.This should be one of the part of #43549.
category:cq
theme:loop-opt
skill-level:expert
cost:large
impact:medium
The text was updated successfully, but these errors were encountered: