-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT optimization: loop unrolling #8107
Comments
This is also an important source of code quality disparity vs Jit64 in |
See also #4248 |
I have some question does it means current ryuJIT doesn't perform unrolling, rerolling loop for loop caching? Thanks. [ADDED] Can I get some testcases that should be unrolled? |
Yes, that's basically what it means. Logic for determining which loops to unroll is in optUnrollLoops(). Current heuristic is that the loop bounds test must be a SIMD element count. An example where unrolling kicks in: SeekUnroll.cs |
I meant can I have that should be unrolled and shouldn't be unrolled? because the coreclr optimization theory is kinda different from LLVM, I saw some docs on this. and it seems DOES NOT optimizes non-natural loops. thats why I need case that shouldn't be unrolled. this will may help me to contribute on this. also, is that SIMD check does with LPFLG_DONT_UNROLL flag which is on here? Thanks :> |
Broadly speaking most loops should be unrollable. After all, it is a fairly mechanical operation. In practice there are cases where the JIT can't properly represent the results of unrolling, and those loops are flagged by various critiera, eg:
A second set of criteria tries to determine when loop unrolling is beneficial. Unrolling is quite likely going to increase code size. So the JIT will try and estimate that size increase and have limits on how much growth it will allow. But the JIT does not have a good model for the benefit of unrolling a loop. To avoid being overly aggressive but still unroll in some cases where there is likely a benefit, it will currently insist the loop iteration count be a SIMD vector length. If you want to contribute, there are a few broad areas:
If you're interested in developing better loop unrolling benefit heuristics, you might also look upstream at loop cloning, which also is lacking benefit heuristics (and has a number of other issues too). It would be ideal if the utilities we develop for evaluating loop performance can be leveraged for other kinds of loop optimizations. Also at some point, likely not too far into this program, you will discover that other parts of the JIT's loop optimization infrastructure will need upgrades. |
Thanks :> Its helped a lot. I will try to fix it ASAP |
I think we have to start with a very simple cases
the loop backedge-count is constant. also there is realistic banefits on it. here is current JIT generated assemblies here.
loop didn't removed. also DIDN'T unrolled |
Current progress in for partial loop unrolling.
to
the partial unroll limit threshold is maximum to 64 bytes, normally 32 bytes. btw, does vectorizing processes after loop optimizing? |
The JIT does not do autovectorization. |
Thanks @mikedn, is there any auto-vectorization plan exists? |
None that I've heard of. |
Can I ask you something? @mikedn Is IR simpify processes before the loop optimizations? such as this.
to
Also, how can I extract loop's body from LoopDsc. Thanks. |
Some happens during global morph phase that runs before loop optimizations. Some happens after loop optimizations as part of SSA/VN based optimizations. That said, I don't think the JIT simplifies that particular code you have in your example.
See the comments associated with
|
Can I ask what |
Heh, you can ask but that's a tough question. The JIT team may have a different opinion but my personal opinion is that Basically it models an |
feels like phi node. thanks |
Hmm, not really. A PHI is a "function" that produces a value but ASG is more like an "operation". A better analogy is the |
there is no ASG node on LLVM, I just understood in my way. so nevermind :> |
How can I check is there any range-checks on block? should I iterate all of GenTrees? |
I think this might late a lot... there was major issues with range-checks. sorry :< |
Hmm, there's a BasicBlock flag that indicates that a block contains range checks -
You can have as much time as you need. Unless this is some kind of school project, then you'll need to ask your teacher 😄 |
Have you done any measurements to evaluate the impact of loop unrolling on current hardware? Loop unrolling is fortunately an optimization that can be implemented manually, without any compiler support. This means that you can measure the potential improvement provided by unrolling without writing any JIT code. for (int i = 0; i < 64; ++i)
{
total += array[i];
total += array[i + 1];
total += array[i + 2];
total += array[i + 3];
} is not very effective. It may turn out that splitting the loop-carried dependency chain in 4 is far more useful: for (int i = 0; i < 64; ++i)
{
total1 += array[i];
total2 += array[i + 1];
total3 += array[i + 2];
total4 += array[i + 3];
}
total = total1 + total2 + total3 + total4; In any case, you need to measure. And it's best to do that before attempting to implement such an optimization. |
Yes (In LLVM, I was working on SCEV, loop-deletion), and I've read all of intel optimization manual like a 3 times with a year. |
Is there are any custom defined ADT (abstract data types) in coreclr? such as list, vector. or should I use std? [ADDED] nvm, there was jitstd instead of it. |
Is there are any way to clone block without changing? thanks :> |
What do you mean by "without changing"? |
I mean, CloneBlockState does change variant as invariant. I need copy without changing anything. |
|
I'm currently working on new loop unrolling implements based on inner unrolled count. |
I want to know that every single here is example. almost of all seems duplicated. should I try to remove it? |
The SIMD support was also extended from just However, that support is really only when using |
@tannergooding What kind of complex loops? just for example. |
That was done as a special case, for code that was trying to access individual vector elements in a loop (and that's likely driven by SIMD API limitations such as the lack of shuffles). It's not clear why do you think that loops having stride |
(I'm creating tracking issues for some optimizations that RyuJit doesn't perform, so we'll have a place to reference/note when we see the lack of them affecting particular benchmarks)
There's code in the Jit today that goes by the name "loop unrolling", but it's only doing full unrolls, and only for SIMD loops. General loop unrolling (to balance ALU ops vs branching) is not performed by the jit at all today.
category:cq
theme:loop-opt
skill-level:expert
cost:extra-large
The text was updated successfully, but these errors were encountered: