Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT optimization: loop unrolling #8107

Open
JosephTremoulet opened this issue May 15, 2017 · 34 comments
Open

JIT optimization: loop unrolling #8107

JosephTremoulet opened this issue May 15, 2017 · 34 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization tenet-performance Performance related issue
Milestone

Comments

@JosephTremoulet
Copy link
Contributor

(I'm creating tracking issues for some optimizations that RyuJit doesn't perform, so we'll have a place to reference/note when we see the lack of them affecting particular benchmarks)

There's code in the Jit today that goes by the name "loop unrolling", but it's only doing full unrolls, and only for SIMD loops. General loop unrolling (to balance ALU ops vs branching) is not performed by the jit at all today.
category:cq
theme:loop-opt
skill-level:expert
cost:extra-large

@JosephTremoulet
Copy link
Contributor Author

This is also an important source of code quality disparity vs Jit64 in BenchI-MulMtx and BenchI-IniArray

@AndyAyersMS
Copy link
Member

See also #4248

@ArtBlnd
Copy link

ArtBlnd commented May 9, 2018

I have some question

does it means current ryuJIT doesn't perform unrolling, rerolling loop for loop caching?
only specalized for full-unrolling, simd-vectorizing?

Thanks.

[ADDED] Can I get some testcases that should be unrolled?

@AndyAyersMS
Copy link
Member

Yes, that's basically what it means. Logic for determining which loops to unroll is in optUnrollLoops(). Current heuristic is that the loop bounds test must be a SIMD element count.

An example where unrolling kicks in: SeekUnroll.cs

@ArtBlnd
Copy link

ArtBlnd commented May 10, 2018

I meant can I have that should be unrolled and shouldn't be unrolled?

because the coreclr optimization theory is kinda different from LLVM, I saw some docs on this. and it seems DOES NOT optimizes non-natural loops. thats why I need case that shouldn't be unrolled.
if the loop doesn't unroll because its predicated as non-natural loop, it should be extended.
I need the bottom line that should be optimized or not.

this will may help me to contribute on this.

also, is that SIMD check does with LPFLG_DONT_UNROLL flag which is on here?
https://github.com/dotnet/coreclr/blob/dd089b270c5d4833e615f2a926ea0a82d070e5de/src/jit/optimizer.cpp#L3636-L3639

Thanks :>

@AndyAyersMS
Copy link
Member

Broadly speaking most loops should be unrollable. After all, it is a fairly mechanical operation.

In practice there are cases where the JIT can't properly represent the results of unrolling, and those loops are flagged by various critiera, eg:

  • LPFLG_DONT_UNROLL -- bail if loop was created by loop cloning or has an unclonable statement
  • LPFLG_DO_WHILE -- loop must be bottom tested
  • LPFLG_CONST -- loop iteration count is a known constant
  • Other bail-out checks done in optUnrollLoops, for instance, the iteration variable is address exposed or a struct field, or too many returns (this one should probably never be hit anymore since the JIT now moves non-loop blocks out of loops). There are probably more.

A second set of criteria tries to determine when loop unrolling is beneficial. Unrolling is quite likely going to increase code size. So the JIT will try and estimate that size increase and have limits on how much growth it will allow. But the JIT does not have a good model for the benefit of unrolling a loop. To avoid being overly aggressive but still unroll in some cases where there is likely a benefit, it will currently insist the loop iteration count be a SIMD vector length.

If you want to contribute, there are a few broad areas:

  • First, find or create a set of examples to use to evaluate the unrolling heuristics. Ideally we would have a good number of cases where unrolling is beneficial and a number of other cases where it is not beneficial.
    • In the latter group you should distinguish between cases where unrolling ought to be beneficial but isn't because of other problems in the JIT (and make sure to open issues for those), and cases where unrolling is just going to increase code size with no performance benefit.
    • Since you can unroll by hand you can use source level transformations to ballpark the potential benefits from unrolling (again subject to JIT limitations).
    • Make sure to also include cases where the JIT can't unroll because of its various internal limitations.
  • Second, remove correctness constraints and limitations that block unrolling so more kinds of loops can be unrolled.
  • Third, add some realistic benefit heuristics and tune on your test suite plus our benchmarks and realistic applications.
  • Fourth, support unrolling in more scenarios:
    • high trip count loops (partial unrolls)
    • fixed but unknown trip count loops
    • search loops

If you're interested in developing better loop unrolling benefit heuristics, you might also look upstream at loop cloning, which also is lacking benefit heuristics (and has a number of other issues too).

It would be ideal if the utilities we develop for evaluating loop performance can be leveraged for other kinds of loop optimizations.

Also at some point, likely not too far into this program, you will discover that other parts of the JIT's loop optimization infrastructure will need upgrades.

@ArtBlnd
Copy link

ArtBlnd commented May 10, 2018

Thanks :> Its helped a lot. I will try to fix it ASAP

@ArtBlnd
Copy link

ArtBlnd commented May 10, 2018

I think we have to start with a very simple cases
here is most general case that should be unrolled.

int[] arr = { 1, 2, 3, 4 };
int total = 0;

for(int i = 0; i < 4; ++i)
{
    total += arr[i];
}

the loop backedge-count is constant. also there is realistic banefits on it.

here is current JIT generated assemblies here.

    11:             int[] val_1 = { 1, 2, 3, 4 };
00007FFDF2F204B0  sub         rsp,28h  
00007FFDF2F204B4  vzeroupper  
00007FFDF2F204B7  mov         rcx,7FFE299931BAh  
00007FFDF2F204C1  mov         edx,4  
00007FFDF2F204C6  call        00007FFE52A24400  
00007FFDF2F204CB  mov         rdx,1B49308280Ch  
00007FFDF2F204D5  mov         ecx,dword ptr [rax+8]  
00007FFDF2F204D8  lea         r8,[rax+10h]  
00007FFDF2F204DC  vmovdqu     xmm0,xmmword ptr [rdx]  
00007FFDF2F204E1  vmovdqu     xmmword ptr [r8],xmm0  
    12:             int total = 0;
00007FFDF2F204E6  xor         edx,edx  
    13: 
    14:             for(int i = 0; i < 4; ++i)
00007FFDF2F204E8  xor         r8d,r8d  
00007FFDF2F204EB  cmp         ecx,4  
00007FFDF2F204EE  jl          00007FFDF2F20502  
    15:             {
    16:                 total += val_1[i];
00007FFDF2F204F0  movsxd      rcx,r8d  
00007FFDF2F204F3  add         edx,dword ptr [rax+rcx*4+10h]  
    14:             for(int i = 0; i < 4; ++i)
00007FFDF2F204F7  inc         r8d  
00007FFDF2F204FA  cmp         r8d,4  
00007FFDF2F204FE  jl          00007FFDF2F204F0  
00007FFDF2F20500  jmp         00007FFDF2F20512  
00007FFDF2F20502  movsxd      rcx,r8d  
00007FFDF2F20505  add         edx,dword ptr [rax+rcx*4+10h]  
00007FFDF2F20509  inc         r8d  
00007FFDF2F2050C  cmp         r8d,4  
00007FFDF2F20510  jl          00007FFDF2F20502  
    17:             }

loop didn't removed. also DIDN'T unrolled
current progress in.

@ArtBlnd
Copy link

ArtBlnd commented May 15, 2018

Current progress in for partial loop unrolling.

for(unsigned int i = 0; i < 256; ++i)
{
    total += array[i];
}

to

for(unsigned int i = 0; i < 64; ++i)
{
    total += array[i];
    total += array[i + 1];
    total += array[i + 2];
    total += array[i + 3];
}

the partial unroll limit threshold is maximum to 64 bytes, normally 32 bytes.

btw, does vectorizing processes after loop optimizing?
Thanks :>

@mikedn
Copy link
Contributor

mikedn commented May 15, 2018

does vectorizing processes after loop optimizing?

The JIT does not do autovectorization.

@ArtBlnd
Copy link

ArtBlnd commented May 15, 2018

Thanks @mikedn, is there any auto-vectorization plan exists?

@mikedn
Copy link
Contributor

mikedn commented May 15, 2018

is there any auto-vectorization plan exists?

None that I've heard of.

@ArtBlnd
Copy link

ArtBlnd commented May 16, 2018

Can I ask you something? @mikedn

Is IR simpify processes before the loop optimizations? such as this.

int total = 0;
for(unsigned int i = 0; i < 4; ++i)
{
    total += 2;
    total -= 1;
}

to

int total = 0;
for(unsigned int i = 0; i < 4; ++i)
{
    total += 1;
}

Also, how can I extract loop's body from LoopDsc.

Thanks.

@mikedn
Copy link
Contributor

mikedn commented May 17, 2018

Is IR simpify processes before the loop optimizations?

Some happens during global morph phase that runs before loop optimizations. Some happens after loop optimizations as part of SSA/VN based optimizations. That said, I don't think the JIT simplifies that particular code you have in your example.

Also, how can I extract loop's body from LoopDsc.

See the comments associated with LoopDsc and LoopSearch classes, most notably the parts about loop body being a contiguous list of BBs, for example:

We (currently) require the body of a loop to be a contiguous (in bbNext order) sequence of basic blocks

@ArtBlnd
Copy link

ArtBlnd commented May 18, 2018

Can I ask what ASG gt node does? there is no comments on gtlist.h
btw, almost done!! stay turned!! :>

@mikedn
Copy link
Contributor

mikedn commented May 18, 2018

Heh, you can ask but that's a tough question. The JIT team may have a different opinion but my personal opinion is that ASG does many things - complicates things, slow down the JIT, stands in the way and in general being obnoxious.

Basically it models an a = b assignment. Except that there's no need for it since you could model that by something like store(a, load(b)). And in fact the JIT does just in later phases, after rationalization.

@ArtBlnd
Copy link

ArtBlnd commented May 19, 2018

feels like phi node. thanks

@mikedn
Copy link
Contributor

mikedn commented May 19, 2018

feels like phi node

Hmm, not really. A PHI is a "function" that produces a value but ASG is more like an "operation". A better analogy is the mov instruction that exists in various assembly languages, including x86/64 and ARM.

@ArtBlnd
Copy link

ArtBlnd commented May 20, 2018

there is no ASG node on LLVM, I just understood in my way. so nevermind :>

@ArtBlnd
Copy link

ArtBlnd commented May 22, 2018

How can I check is there any range-checks on block? should I iterate all of GenTrees?

@ArtBlnd
Copy link

ArtBlnd commented May 22, 2018

I think this might late a lot... there was major issues with range-checks. sorry :<
this instruction weight check is quite hard. can I have more times?

@mikedn
Copy link
Contributor

mikedn commented May 22, 2018

How can I check is there any range-checks on block? should I iterate all of GenTrees?

Hmm, there's a BasicBlock flag that indicates that a block contains range checks - BBF_HAS_IDX_LEN. However, I'm not sure how reliable it is and it is also set when a block contains array length references. So yes, you'd need to visit all nodes to figure out if there are range-checks or not. But it really depends on why do you need to know if range checks are present. They're not the only nodes that may impact the benefit analysis.

can I have more times?

You can have as much time as you need. Unless this is some kind of school project, then you'll need to ask your teacher 😄

@mikedn
Copy link
Contributor

mikedn commented May 22, 2018

this instruction weight check is quite hard

Have you done any measurements to evaluate the impact of loop unrolling on current hardware? Loop unrolling is fortunately an optimization that can be implemented manually, without any compiler support. This means that you can measure the potential improvement provided by unrolling without writing any JIT code.
It may turn out that trivial loop unrolling such as

for (int i = 0; i < 64; ++i)
{
    total += array[i];
    total += array[i + 1];
    total += array[i + 2];
    total += array[i + 3];
}

is not very effective. It may turn out that splitting the loop-carried dependency chain in 4 is far more useful:

for (int i = 0; i < 64; ++i)
{
    total1 += array[i];
    total2 += array[i + 1];
    total3 += array[i + 2];
    total4 += array[i + 3];
}
total = total1 + total2 + total3 + total4;

In any case, you need to measure. And it's best to do that before attempting to implement such an optimization.

@ArtBlnd
Copy link

ArtBlnd commented May 22, 2018

Have you done any measurements to evaluate the impact of loop unrolling on current hardware?

Yes (In LLVM, I was working on SCEV, loop-deletion), and I've read all of intel optimization manual like a 3 times with a year.
I mean not a real instruction weight. on IL instruction. checking is that effective with partial unrolling is quite hard.

@ArtBlnd
Copy link

ArtBlnd commented May 24, 2018

Is there are any custom defined ADT (abstract data types) in coreclr? such as list, vector. or should I use std?
cuz I want to make a GenTree list without copying it.

[ADDED] nvm, there was jitstd instead of it.

@ArtBlnd
Copy link

ArtBlnd commented May 28, 2018

Is there are any way to clone block without changing? thanks :>

@mikedn
Copy link
Contributor

mikedn commented May 28, 2018

What do you mean by "without changing"?

@ArtBlnd
Copy link

ArtBlnd commented May 28, 2018

I mean, CloneBlockState does change variant as invariant. I need copy without changing anything.

@AndyAyersMS
Copy link
Member

CloneBlockState won't change things unless you pass in the optional arguments.

@ArtBlnd
Copy link

ArtBlnd commented Aug 20, 2018

I'm currently working on new loop unrolling implements based on inner unrolled count.
with new unroll count theory. (base is same. but for partial unrolling)
not to implement partial unrolling and full unrolling separately. :>

@ArtBlnd
Copy link

ArtBlnd commented Aug 20, 2018

I want to know that every single noway_assert check is needed when its initializing basic loop informations.

here is example.
https://github.com/dotnet/coreclr/blob/b2031a14a3b110f8e712abdb0cd392c54cd2e839/src/jit/optimizer.cpp#L3617-L3627

almost of all seems duplicated. should I try to remove it?

@tannergooding
Copy link
Member

The SIMD support was also extended from just Vector<T>.Count to also cover Vector64<T>.Count, Vector128<T>.Count, and Vector256<T>.Count in dotnet/coreclr#24991.

However, that support is really only when using Count as the upper bound of the loop. It would be beneficial to extend the support to cover simple auto-vectorization (https://github.com/dotnet/coreclr/issues/20486) and to also cover unrolling more complex loops (such as those containing already vectorized code).

@ArtBlnd
Copy link

ArtBlnd commented Jun 7, 2019

@tannergooding What kind of complex loops? just for example.

@mikedn
Copy link
Contributor

mikedn commented Jun 7, 2019

However, that support is really only when using Count as the upper bound of the loop.

That was done as a special case, for code that was trying to access individual vector elements in a loop (and that's likely driven by SIMD API limitations such as the lack of shuffles). It's not clear why do you think that loops having stride Vector<T>.Count are somewhat special in terms of unrolling compared to other loops or why is this related to auto-vectorization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

7 participants