-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: examples where loop cloning is not useful #8558
Comments
Similar example, with a local variable instead of a constant: static int Test(int[] a)
{
int s = 0;
int l = a.Length;
for (int i = 0; i < l; i++)
s += a[i];
return s;
} Picked up from actual corelib code: |
The bytemark Loop cloning looks like it runs in lexical order of loops, which generally means preorder over the loop nesting tree. In
because of top-down cloning we end up with 2 copies of each top level loop body, 3 copies of each nested loop, and 4 copies of each doubly-nested loop:
In this case there is a lot of imperfect nesting (outer loops have computation as well as invoking inner loops) and so a ton of code duplication for what is likely even in best case minimal gains. And in this case actual losses. So potential code growth from cloning is apparently quadratic in nesting depth. It might be amusing to create a very deep nest just to see how insane this becomes. If we think loop cloning is a good idea it should run bottom up as the inner loops are almost certainly hotter than the outer ones, and there should be some size/speed estimation in the model that would eventually halt cloning as it works its way up the loop nesting tree before it gets to this crazy level of duplication. |
Another example with some odd behavior. I'd expect F1 and F2 to produce similar code, but F1 has cloning and F2 doesn't (likewise for F4 and F5).
Also thinking if we really can split the two versions as "won't throw / must throw" then the latter should be moved to the cold part of the method. If it's "won't throw / may throw" then perhaps not... |
@AndyAyersMS Proposal: lower limit for constant-size loops: EgorBo@0efefa1 public static int SumOfFirstThreeElements(int[] arr)
{
int sum = 0;
for (int i = 0; i < 3; i++)
sum += arr[i];
return sum;
} With loop-cloning: ; Method Program:SumOfFirstThreeElements(System.Int32[]):int
G_M22200_IG01:
4883EC28 sub rsp, 40
;; bbWeight=1 PerfScore 0.25
G_M22200_IG02:
33C0 xor eax, eax
33D2 xor edx, edx
4885C9 test rcx, rcx
7417 je SHORT G_M22200_IG06
;; bbWeight=1 PerfScore 1.75
G_M22200_IG03:
83790803 cmp dword ptr [rcx+8], 3
7C11 jl SHORT G_M22200_IG06
;; bbWeight=1 PerfScore 3.00
G_M22200_IG04:
4C63C2 movsxd r8, edx
4203448110 add eax, dword ptr [rcx+4*r8+16]
FFC2 inc edx
83FA03 cmp edx, 3
7CF1 jl SHORT G_M22200_IG04
;; bbWeight=4 PerfScore 15.00
G_M22200_IG05:
EB14 jmp SHORT G_M22200_IG07
;; bbWeight=1 PerfScore 2.00
G_M22200_IG06:
3B5108 cmp edx, dword ptr [rcx+8]
7314 jae SHORT G_M22200_IG08
4C63C2 movsxd r8, edx
4203448110 add eax, dword ptr [rcx+4*r8+16]
FFC2 inc edx
83FA03 cmp edx, 3
7CEC jl SHORT G_M22200_IG06
;; bbWeight=4 PerfScore 27.00
G_M22200_IG07:
4883C428 add rsp, 40
C3 ret
;; bbWeight=1 PerfScore 1.25
G_M22200_IG08:
E81E78755F call CORINFO_HELP_RNGCHKFAIL
CC int3
;; bbWeight=0 PerfScore 0.00
; Total bytes of code: 67 Without loop-cloning for small constant-size loops (by small I mean amount of iterations) ; Method Program:SumOfFirstThreeElements(System.Int32[]):int
G_M22200_IG01:
4883EC28 sub rsp, 40
;; bbWeight=1 PerfScore 0.25
G_M22200_IG02:
33C0 xor eax, eax
33D2 xor edx, edx
448B4108 mov r8d, dword ptr [rcx+8]
;; bbWeight=1 PerfScore 2.50
G_M22200_IG03:
413BD0 cmp edx, r8d
7314 jae SHORT G_M22200_IG05
4C63CA movsxd r9, edx
4203448910 add eax, dword ptr [rcx+4*r9+16]
FFC2 inc edx
83FA03 cmp edx, 3
7CEC jl SHORT G_M22200_IG03
;; bbWeight=4 PerfScore 20.00
G_M22200_IG04:
4883C428 add rsp, 40
C3 ret
;; bbWeight=1 PerfScore 1.25
G_M22200_IG05:
E8B679765F call CORINFO_HELP_RNGCHKFAIL
CC int3
;; bbWeight=0 PerfScore 0.00
; Total bytes of code: 43 |
Suggestion: don't clone loops with unmanaged calls inside - it leads to a huge size increase, e.g. [DllImport("Kernel32")]
//[SuppressGCTransition]
static extern int GetTickCount();
static void Test(int[] array)
{
for (int i = 0; i < 100; i++)
DoWork(array[i] + GetTickCount());
}
[MethodImpl(MethodImplOptions.NoInlining)]
static void DoWork(int t) {} ; Method Program:Test(System.Int32[])
G_M38108_IG01:
55 push rbp
4157 push r15
4156 push r14
4155 push r13
4154 push r12
57 push rdi
56 push rsi
53 push rbx
4883EC68 sub rsp, 104
488DAC24A0000000 lea rbp, [rsp+A0H]
488BF1 mov rsi, rcx
;; bbWeight=1 PerfScore 9.00
G_M38108_IG02:
488D4D88 lea rcx, [rbp-78H]
498BD2 mov rdx, r10
E8B9B8765F call CORINFO_HELP_INIT_PINVOKE_FRAME
488BF8 mov rdi, rax
488BC4 mov rax, rsp
488945A8 mov qword ptr [rbp-58H], rax
488BC5 mov rax, rbp
488945B8 mov qword ptr [rbp-48H], rax
33DB xor ebx, ebx
4885F6 test rsi, rsi
746E je SHORT G_M38108_IG09
;; bbWeight=1 PerfScore 6.00
G_M38108_IG03:
837E0864 cmp dword ptr [rsi+8], 100
7C68 jl SHORT G_M38108_IG09
;; bbWeight=1 PerfScore 3.00
G_M38108_IG04:
4863C3 movsxd rax, ebx
48897510 mov gword ptr [rbp+10H], rsi
448B748610 mov r14d, dword ptr [rsi+4*rax+16]
48B818545785FC7F0000 mov rax, 0x7FFC85575418
48894598 mov qword ptr [rbp-68H], rax
488D0516000000 lea rax, G_M38108_IG06
488945B0 mov qword ptr [rbp-50H], rax
488D4588 lea rax, bword ptr [rbp-78H]
48894710 mov qword ptr [rdi+16], rax
C6470C00 mov byte ptr [rdi+12], 0
;; bbWeight=4 PerfScore 36.00
G_M38108_IG05:
FF1504802400 call [Program:GetTickCount():int]
;; bbWeight=4 PerfScore 12.00
G_M38108_IG06:
C6470C01 mov byte ptr [rdi+12], 1
833D8998F75F00 cmp dword ptr [(reloc)], 0
7406 je SHORT G_M38108_IG07
FF1591DBF55F call [CORINFO_HELP_STOP_FOR_GC]
;; bbWeight=4 PerfScore 28.00
G_M38108_IG07:
488B4D90 mov rcx, bword ptr [rbp-70H]
48894F10 mov qword ptr [rdi+16], rcx
428D0C30 lea ecx, [rax+r14]
E8C881FEFF call Program:DoWork(int)
FFC3 inc ebx
83FB64 cmp ebx, 100
488B7510 mov rsi, gword ptr [rbp+10H]
7C9A jl SHORT G_M38108_IG04
;; bbWeight=4 PerfScore 24.00
G_M38108_IG08:
EB6B jmp SHORT G_M38108_IG13
;; bbWeight=1 PerfScore 2.00
G_M38108_IG09:
3B5E08 cmp ebx, dword ptr [rsi+8]
7377 jae SHORT G_M38108_IG14
4863C3 movsxd rax, ebx
48897510 mov gword ptr [rbp+10H], rsi
448B748610 mov r14d, dword ptr [rsi+4*rax+16]
48B818545785FC7F0000 mov rax, 0x7FFC85575418
48894598 mov qword ptr [rbp-68H], rax
488D0516000000 lea rax, G_M38108_IG11
488945B0 mov qword ptr [rbp-50H], rax
488D4588 lea rax, bword ptr [rbp-78H]
48894710 mov qword ptr [rdi+16], rax
C6470C00 mov byte ptr [rdi+12], 0
;; bbWeight=4 PerfScore 48.00
G_M38108_IG10:
FF15977F2400 call [Program:GetTickCount():int]
;; bbWeight=4 PerfScore 12.00
G_M38108_IG11:
C6470C01 mov byte ptr [rdi+12], 1
833D1C98F75F00 cmp dword ptr [(reloc)], 0
7406 je SHORT G_M38108_IG12
FF1524DBF55F call [CORINFO_HELP_STOP_FOR_GC]
;; bbWeight=4 PerfScore 28.00
G_M38108_IG12:
488B4D90 mov rcx, bword ptr [rbp-70H]
48894F10 mov qword ptr [rdi+16], rcx
428D0C30 lea ecx, [rax+r14]
E85B81FEFF call Program:DoWork(int)
FFC3 inc ebx
83FB64 cmp ebx, 100
488B7510 mov rsi, gword ptr [rbp+10H]
7C95 jl SHORT G_M38108_IG09
;; bbWeight=4 PerfScore 24.00
G_M38108_IG13:
488D65C8 lea rsp, [rbp-38H]
5B pop rbx
5E pop rsi
5F pop rdi
415C pop r12
415D pop r13
415E pop r14
415F pop r15
5D pop rbp
C3 ret
;; bbWeight=1 PerfScore 5.50
G_M38108_IG14:
E83281765F call CORINFO_HELP_RNGCHKFAIL
CC int3
;; bbWeight=0 PerfScore 0.00
; Total bytes of code: 303 |
Another small trip count example (from this stack overflow issue) using System;
public class C {
static void M1(int[] left, int[] right)
{
for (int i = 0; i < 5; i++)
{
left[i] = 1;
right[i] = 1;
}
}
static void M2(int[] left, int[] right)
{
for (int i = 0; i < 10; i+=2)
{
left[i] = 1;
right[i] = 1;
}
}
} We only clone the loop in
Not sure why we materialize all those bools and then test them, seems like we should perhaps just branch after each test. |
#96851 (comment) has another example in regex of a questionable cloned loop. |
In some cases loops are cloned with no performance benefit, and it would be better to not clone the loop at all, as the jit would run faster and produce smaller code.
For instance in benchi\iniarray, the two cloned copies are identical (in fact the array length is known and you can see there are no bounds checks):
category:cq
theme:loop-opt
skill-level:expert
cost:medium
The text was updated successfully, but these errors were encountered: