JIT optimization: loop unrolling #8107

JosephTremoulet · 2017-05-15T15:53:23Z

(I'm creating tracking issues for some optimizations that RyuJit doesn't perform, so we'll have a place to reference/note when we see the lack of them affecting particular benchmarks)

There's code in the Jit today that goes by the name "loop unrolling", but it's only doing full unrolls, and only for SIMD loops. General loop unrolling (to balance ALU ops vs branching) is not performed by the jit at all today.
category:cq
theme:loop-opt
skill-level:expert
cost:extra-large

JosephTremoulet · 2017-05-15T16:03:44Z

This is also an important source of code quality disparity vs Jit64 in BenchI-MulMtx and BenchI-IniArray

AndyAyersMS · 2018-05-04T23:00:28Z

See also #4248

ArtBlnd · 2018-05-09T05:21:42Z

I have some question

does it means current ryuJIT doesn't perform unrolling, rerolling loop for loop caching?
only specalized for full-unrolling, simd-vectorizing?

Thanks.

[ADDED] Can I get some testcases that should be unrolled?

AndyAyersMS · 2018-05-09T16:13:38Z

Yes, that's basically what it means. Logic for determining which loops to unroll is in optUnrollLoops(). Current heuristic is that the loop bounds test must be a SIMD element count.

An example where unrolling kicks in: SeekUnroll.cs

ArtBlnd · 2018-05-10T00:00:03Z

I meant can I have that should be unrolled and shouldn't be unrolled?

because the coreclr optimization theory is kinda different from LLVM, I saw some docs on this. and it seems DOES NOT optimizes non-natural loops. thats why I need case that shouldn't be unrolled.
if the loop doesn't unroll because its predicated as non-natural loop, it should be extended.
I need the bottom line that should be optimized or not.

this will may help me to contribute on this.

also, is that SIMD check does with LPFLG_DONT_UNROLL flag which is on here?
https://github.com/dotnet/coreclr/blob/dd089b270c5d4833e615f2a926ea0a82d070e5de/src/jit/optimizer.cpp#L3636-L3639

Thanks :>

AndyAyersMS · 2018-05-10T00:49:14Z

Broadly speaking most loops should be unrollable. After all, it is a fairly mechanical operation.

In practice there are cases where the JIT can't properly represent the results of unrolling, and those loops are flagged by various critiera, eg:

LPFLG_DONT_UNROLL -- bail if loop was created by loop cloning or has an unclonable statement
LPFLG_DO_WHILE -- loop must be bottom tested
LPFLG_CONST -- loop iteration count is a known constant
Other bail-out checks done in optUnrollLoops, for instance, the iteration variable is address exposed or a struct field, or too many returns (this one should probably never be hit anymore since the JIT now moves non-loop blocks out of loops). There are probably more.

A second set of criteria tries to determine when loop unrolling is beneficial. Unrolling is quite likely going to increase code size. So the JIT will try and estimate that size increase and have limits on how much growth it will allow. But the JIT does not have a good model for the benefit of unrolling a loop. To avoid being overly aggressive but still unroll in some cases where there is likely a benefit, it will currently insist the loop iteration count be a SIMD vector length.

If you want to contribute, there are a few broad areas:

First, find or create a set of examples to use to evaluate the unrolling heuristics. Ideally we would have a good number of cases where unrolling is beneficial and a number of other cases where it is not beneficial.
- In the latter group you should distinguish between cases where unrolling ought to be beneficial but isn't because of other problems in the JIT (and make sure to open issues for those), and cases where unrolling is just going to increase code size with no performance benefit.
- Since you can unroll by hand you can use source level transformations to ballpark the potential benefits from unrolling (again subject to JIT limitations).
- Make sure to also include cases where the JIT can't unroll because of its various internal limitations.
Second, remove correctness constraints and limitations that block unrolling so more kinds of loops can be unrolled.
Third, add some realistic benefit heuristics and tune on your test suite plus our benchmarks and realistic applications.
Fourth, support unrolling in more scenarios:
- high trip count loops (partial unrolls)
- fixed but unknown trip count loops
- search loops

If you're interested in developing better loop unrolling benefit heuristics, you might also look upstream at loop cloning, which also is lacking benefit heuristics (and has a number of other issues too).

It would be ideal if the utilities we develop for evaluating loop performance can be leveraged for other kinds of loop optimizations.

Also at some point, likely not too far into this program, you will discover that other parts of the JIT's loop optimization infrastructure will need upgrades.

ArtBlnd · 2018-05-10T00:58:55Z

Thanks :> Its helped a lot. I will try to fix it ASAP

ArtBlnd · 2018-05-10T23:40:56Z

I think we have to start with a very simple cases
here is most general case that should be unrolled.

int[] arr = { 1, 2, 3, 4 };
int total = 0;

for(int i = 0; i < 4; ++i)
{
    total += arr[i];
}

the loop backedge-count is constant. also there is realistic banefits on it.

here is current JIT generated assemblies here.

    11:             int[] val_1 = { 1, 2, 3, 4 };
00007FFDF2F204B0  sub         rsp,28h  
00007FFDF2F204B4  vzeroupper  
00007FFDF2F204B7  mov         rcx,7FFE299931BAh  
00007FFDF2F204C1  mov         edx,4  
00007FFDF2F204C6  call        00007FFE52A24400  
00007FFDF2F204CB  mov         rdx,1B49308280Ch  
00007FFDF2F204D5  mov         ecx,dword ptr [rax+8]  
00007FFDF2F204D8  lea         r8,[rax+10h]  
00007FFDF2F204DC  vmovdqu     xmm0,xmmword ptr [rdx]  
00007FFDF2F204E1  vmovdqu     xmmword ptr [r8],xmm0  
    12:             int total = 0;
00007FFDF2F204E6  xor         edx,edx  
    13: 
    14:             for(int i = 0; i < 4; ++i)
00007FFDF2F204E8  xor         r8d,r8d  
00007FFDF2F204EB  cmp         ecx,4  
00007FFDF2F204EE  jl          00007FFDF2F20502  
    15:             {
    16:                 total += val_1[i];
00007FFDF2F204F0  movsxd      rcx,r8d  
00007FFDF2F204F3  add         edx,dword ptr [rax+rcx*4+10h]  
    14:             for(int i = 0; i < 4; ++i)
00007FFDF2F204F7  inc         r8d  
00007FFDF2F204FA  cmp         r8d,4  
00007FFDF2F204FE  jl          00007FFDF2F204F0  
00007FFDF2F20500  jmp         00007FFDF2F20512  
00007FFDF2F20502  movsxd      rcx,r8d  
00007FFDF2F20505  add         edx,dword ptr [rax+rcx*4+10h]  
00007FFDF2F20509  inc         r8d  
00007FFDF2F2050C  cmp         r8d,4  
00007FFDF2F20510  jl          00007FFDF2F20502  
    17:             }

loop didn't removed. also DIDN'T unrolled
current progress in.

ArtBlnd · 2018-05-15T01:41:41Z

Current progress in for partial loop unrolling.

for(unsigned int i = 0; i < 256; ++i)
{
    total += array[i];
}

to

for(unsigned int i = 0; i < 64; ++i)
{
    total += array[i];
    total += array[i + 1];
    total += array[i + 2];
    total += array[i + 3];
}

the partial unroll limit threshold is maximum to 64 bytes, normally 32 bytes.

btw, does vectorizing processes after loop optimizing?
Thanks :>

mikedn · 2018-05-15T04:12:19Z

does vectorizing processes after loop optimizing?

The JIT does not do autovectorization.

ArtBlnd · 2018-05-15T05:40:05Z

Thanks @mikedn, is there any auto-vectorization plan exists?

mikedn · 2018-05-15T05:53:47Z

is there any auto-vectorization plan exists?

None that I've heard of.

ArtBlnd · 2018-05-16T23:52:10Z

Can I ask you something? @mikedn

Is IR simpify processes before the loop optimizations? such as this.

int total = 0;
for(unsigned int i = 0; i < 4; ++i)
{
    total += 2;
    total -= 1;
}

to

int total = 0;
for(unsigned int i = 0; i < 4; ++i)
{
    total += 1;
}

Also, how can I extract loop's body from LoopDsc.

Thanks.

mikedn · 2018-05-17T05:23:28Z

Is IR simpify processes before the loop optimizations?

Some happens during global morph phase that runs before loop optimizations. Some happens after loop optimizations as part of SSA/VN based optimizations. That said, I don't think the JIT simplifies that particular code you have in your example.

Also, how can I extract loop's body from LoopDsc.

See the comments associated with LoopDsc and LoopSearch classes, most notably the parts about loop body being a contiguous list of BBs, for example:

We (currently) require the body of a loop to be a contiguous (in bbNext order) sequence of basic blocks

ArtBlnd · 2018-05-18T21:36:42Z

Can I ask what ASG gt node does? there is no comments on gtlist.h
btw, almost done!! stay turned!! :>

mikedn · 2018-05-18T21:58:29Z

Heh, you can ask but that's a tough question. The JIT team may have a different opinion but my personal opinion is that ASG does many things - complicates things, slow down the JIT, stands in the way and in general being obnoxious.

Basically it models an a = b assignment. Except that there's no need for it since you could model that by something like store(a, load(b)). And in fact the JIT does just in later phases, after rationalization.

ArtBlnd · 2018-05-19T14:39:18Z

feels like phi node. thanks

mikedn · 2018-05-19T17:20:33Z

feels like phi node

Hmm, not really. A PHI is a "function" that produces a value but ASG is more like an "operation". A better analogy is the mov instruction that exists in various assembly languages, including x86/64 and ARM.

ArtBlnd · 2018-05-20T10:10:57Z

there is no ASG node on LLVM, I just understood in my way. so nevermind :>

ArtBlnd · 2018-05-22T05:30:58Z

How can I check is there any range-checks on block? should I iterate all of GenTrees?

ArtBlnd · 2018-05-22T05:33:50Z

I think this might late a lot... there was major issues with range-checks. sorry :<
this instruction weight check is quite hard. can I have more times?

mikedn · 2018-05-22T07:39:13Z

How can I check is there any range-checks on block? should I iterate all of GenTrees?

Hmm, there's a BasicBlock flag that indicates that a block contains range checks - BBF_HAS_IDX_LEN. However, I'm not sure how reliable it is and it is also set when a block contains array length references. So yes, you'd need to visit all nodes to figure out if there are range-checks or not. But it really depends on why do you need to know if range checks are present. They're not the only nodes that may impact the benefit analysis.

can I have more times?

You can have as much time as you need. Unless this is some kind of school project, then you'll need to ask your teacher 😄

mikedn · 2018-05-22T07:55:53Z

this instruction weight check is quite hard

Have you done any measurements to evaluate the impact of loop unrolling on current hardware? Loop unrolling is fortunately an optimization that can be implemented manually, without any compiler support. This means that you can measure the potential improvement provided by unrolling without writing any JIT code.
It may turn out that trivial loop unrolling such as

for (int i = 0; i < 64; ++i)
{
    total += array[i];
    total += array[i + 1];
    total += array[i + 2];
    total += array[i + 3];
}

is not very effective. It may turn out that splitting the loop-carried dependency chain in 4 is far more useful:

for (int i = 0; i < 64; ++i)
{
    total1 += array[i];
    total2 += array[i + 1];
    total3 += array[i + 2];
    total4 += array[i + 3];
}
total = total1 + total2 + total3 + total4;

In any case, you need to measure. And it's best to do that before attempting to implement such an optimization.

ArtBlnd · 2018-05-22T10:09:38Z

Have you done any measurements to evaluate the impact of loop unrolling on current hardware?

Yes (In LLVM, I was working on SCEV, loop-deletion), and I've read all of intel optimization manual like a 3 times with a year.
I mean not a real instruction weight. on IL instruction. checking is that effective with partial unrolling is quite hard.

ArtBlnd · 2018-05-24T01:35:02Z

Is there are any custom defined ADT (abstract data types) in coreclr? such as list, vector. or should I use std?
cuz I want to make a GenTree list without copying it.

[ADDED] nvm, there was jitstd instead of it.

ArtBlnd · 2018-05-28T06:16:48Z

Is there are any way to clone block without changing? thanks :>

mikedn · 2018-05-28T18:41:57Z

What do you mean by "without changing"?

ArtBlnd · 2018-05-28T23:28:48Z

I mean, CloneBlockState does change variant as invariant. I need copy without changing anything.

AndyAyersMS · 2018-05-29T07:19:40Z

CloneBlockState won't change things unless you pass in the optional arguments.

ArtBlnd · 2018-08-20T00:48:22Z

I'm currently working on new loop unrolling implements based on inner unrolled count.
with new unroll count theory. (base is same. but for partial unrolling)
not to implement partial unrolling and full unrolling separately. :>

ArtBlnd · 2018-08-20T00:53:35Z

I want to know that every single noway_assert check is needed when its initializing basic loop informations.

here is example.
https://github.com/dotnet/coreclr/blob/b2031a14a3b110f8e712abdb0cd392c54cd2e839/src/jit/optimizer.cpp#L3617-L3627

almost of all seems duplicated. should I try to remove it?

tannergooding · 2019-06-06T23:06:16Z

The SIMD support was also extended from just Vector<T>.Count to also cover Vector64<T>.Count, Vector128<T>.Count, and Vector256<T>.Count in dotnet/coreclr#24991.

However, that support is really only when using Count as the upper bound of the loop. It would be beneficial to extend the support to cover simple auto-vectorization (https://github.com/dotnet/coreclr/issues/20486) and to also cover unrolling more complex loops (such as those containing already vectorized code).

ArtBlnd · 2019-06-07T09:18:04Z

@tannergooding What kind of complex loops? just for example.

mikedn · 2019-06-07T10:45:30Z

However, that support is really only when using Count as the upper bound of the loop.

That was done as a special case, for code that was trying to access individual vector elements in a loop (and that's likely driven by SIMD API limitations such as the lack of shuffles). It's not clear why do you think that loops having stride Vector<T>.Count are somewhat special in terms of unrolling compared to other loops or why is this related to auto-vectorization.

msftgits transferred this issue from dotnet/coreclr Jan 31, 2020

msftgits added this to the Future milestone Jan 31, 2020

abelbraaksma mentioned this issue Jun 19, 2020

String mapi ~2.5x perf improvement (v2) dotnet/fsharp#9482

Merged

AndyAyersMS mentioned this issue Aug 20, 2020

Loop Unrolling is not Enabled in Release Build #41063

Closed

BruceForstall mentioned this issue Oct 17, 2020

Improve JIT loop optimizations (.NET 6) #43549

Closed

25 tasks

BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020

Rosuavio mentioned this issue Nov 16, 2020

Performance improvement in Count<T> extension CommunityToolkit/WindowsCommunityToolkit#3548

Merged

7 tasks

BruceForstall removed the JitUntriaged CLR JIT issues needing additional triage label Jan 15, 2021

BruceForstall mentioned this issue Jul 6, 2021

Improve JIT loop optimizations (.NET 7) #55235

Closed

5 tasks

BruceForstall mentioned this issue Feb 15, 2022

Improve JIT loop optimizations #65342

Open

20 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT optimization: loop unrolling #8107

JIT optimization: loop unrolling #8107

JosephTremoulet commented May 15, 2017

JosephTremoulet commented May 15, 2017

AndyAyersMS commented May 4, 2018

ArtBlnd commented May 9, 2018 •

edited

Loading

AndyAyersMS commented May 9, 2018

ArtBlnd commented May 10, 2018 •

edited

Loading

AndyAyersMS commented May 10, 2018

ArtBlnd commented May 10, 2018 •

edited

Loading

ArtBlnd commented May 10, 2018

ArtBlnd commented May 15, 2018 •

edited

Loading

mikedn commented May 15, 2018

ArtBlnd commented May 15, 2018 •

edited

Loading

mikedn commented May 15, 2018

ArtBlnd commented May 16, 2018 •

edited

Loading

mikedn commented May 17, 2018

ArtBlnd commented May 18, 2018 •

edited

Loading

mikedn commented May 18, 2018

ArtBlnd commented May 19, 2018 •

edited

Loading

mikedn commented May 19, 2018

ArtBlnd commented May 20, 2018

ArtBlnd commented May 22, 2018

ArtBlnd commented May 22, 2018 •

edited

Loading

mikedn commented May 22, 2018

mikedn commented May 22, 2018

ArtBlnd commented May 22, 2018 •

edited

Loading

ArtBlnd commented May 24, 2018 •

edited

Loading

ArtBlnd commented May 28, 2018

mikedn commented May 28, 2018

ArtBlnd commented May 28, 2018

AndyAyersMS commented May 29, 2018

ArtBlnd commented Aug 20, 2018 •

edited

Loading

ArtBlnd commented Aug 20, 2018 •

edited

Loading

tannergooding commented Jun 6, 2019

ArtBlnd commented Jun 7, 2019

mikedn commented Jun 7, 2019

JIT optimization: loop unrolling #8107

JIT optimization: loop unrolling #8107

Comments

JosephTremoulet commented May 15, 2017

JosephTremoulet commented May 15, 2017

AndyAyersMS commented May 4, 2018

ArtBlnd commented May 9, 2018 • edited Loading

AndyAyersMS commented May 9, 2018

ArtBlnd commented May 10, 2018 • edited Loading

AndyAyersMS commented May 10, 2018

ArtBlnd commented May 10, 2018 • edited Loading

ArtBlnd commented May 10, 2018

ArtBlnd commented May 15, 2018 • edited Loading

mikedn commented May 15, 2018

ArtBlnd commented May 15, 2018 • edited Loading

mikedn commented May 15, 2018

ArtBlnd commented May 16, 2018 • edited Loading

mikedn commented May 17, 2018

ArtBlnd commented May 18, 2018 • edited Loading

mikedn commented May 18, 2018

ArtBlnd commented May 19, 2018 • edited Loading

mikedn commented May 19, 2018

ArtBlnd commented May 20, 2018

ArtBlnd commented May 22, 2018

ArtBlnd commented May 22, 2018 • edited Loading

mikedn commented May 22, 2018

mikedn commented May 22, 2018

ArtBlnd commented May 22, 2018 • edited Loading

ArtBlnd commented May 24, 2018 • edited Loading

ArtBlnd commented May 28, 2018

mikedn commented May 28, 2018

ArtBlnd commented May 28, 2018

AndyAyersMS commented May 29, 2018

ArtBlnd commented Aug 20, 2018 • edited Loading

ArtBlnd commented Aug 20, 2018 • edited Loading

tannergooding commented Jun 6, 2019

ArtBlnd commented Jun 7, 2019

mikedn commented Jun 7, 2019

ArtBlnd commented May 9, 2018 •

edited

Loading

ArtBlnd commented May 10, 2018 •

edited

Loading

ArtBlnd commented May 10, 2018 •

edited

Loading

ArtBlnd commented May 15, 2018 •

edited

Loading

ArtBlnd commented May 15, 2018 •

edited

Loading

ArtBlnd commented May 16, 2018 •

edited

Loading

ArtBlnd commented May 18, 2018 •

edited

Loading

ArtBlnd commented May 19, 2018 •

edited

Loading

ArtBlnd commented May 22, 2018 •

edited

Loading

ArtBlnd commented May 22, 2018 •

edited

Loading

ArtBlnd commented May 24, 2018 •

edited

Loading

ArtBlnd commented Aug 20, 2018 •

edited

Loading

ArtBlnd commented Aug 20, 2018 •

edited

Loading