Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: PGO-based block reordering interferes with loop recognition #67318

Closed
AndyAyersMS opened this issue Mar 30, 2022 · 5 comments · Fixed by #69878
Closed

JIT: PGO-based block reordering interferes with loop recognition #67318

AndyAyersMS opened this issue Mar 30, 2022 · 5 comments · Fixed by #69878
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Milestone

Comments

@AndyAyersMS
Copy link
Member

We currently rearrange the flow graph based on profile weights before we perform loop recognition. Since this recognition is lexical, rearrangement can impair loop recognition.

Here's a simple example:

;; COMPlus_TieredPGO=1
;; COMPlus_TC_QuickJitForLoops=0
;; (optionally use release runtime, as checked runtimes now change tiering parameters to cause tiering up sooner)

using System;
using System.Runtime.CompilerServices;
using System.Threading;

interface I
{
    public int F(int x, int y);
}

class Add : I 
{
    int I.F(int x, int y) => x + y;
}

class Mul : I
{
    int I.F(int x, int y) => x * y;
}

class X
{
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int MapF(I m, int[] xs, int[] ys, int from, int to)
    {
        int r = 0;
        for (int i = from; i < to; i++)
        {
            r += m.F(xs[i], ys[i]);
        }
        return r;
    }

    public static int Main()
    {
        int[] xs = new int[] { 1, 2, 3, 4 };
        int[] ys = new int[] { 4, 3, 2, 1 };
        I m = new Add();
        I mm = new Mul();

        int r = 0;

        // comment these two out to create a heavily biased GDV
        r += MapF(mm, xs, ys, 0, 3);
        r -= MapF(mm, xs, ys, 0, 3);

        for (int i = 0; i < 30; i++)
        {
            r += MapF(m, xs, ys, 0, 3);
            Thread.Sleep(15);
        }

        Thread.Sleep(50);

        for (int i = 0; i < 70; i++)
        {
            r += MapF(m, xs, ys, 0, 3);
        }

        return r / 20;
    }
}

As is we produce the flow graph on the left; with the two MapF(mm, ...) calls commented out the graph on the right:

Because of this reordering, loop recognition arrives at very different results, even though the flow graphs are isomorphic:

;;;;;;;;;;;;;;;;; left example

*************** Starting PHASE Find loops
*************** In optFindLoops()
*************** In optMarkLoopHeads()
1 loop heads marked
*************** In optFindNaturalLoops()
New loop epoch 1
L00: setting LPFLG_ITER
Recorded loop L00, from BB02 to BB05 (Head=BB01, Entry=BB02, Exit=BB05) [over V06 (ADD 1) LT V04]

***************  Natural loop table
L00, from BB02 to BB05 (Head=BB01, Entry=BB02, Exit=BB05) [over V06 (ADD 1) LT V04]

*************** In optFindAndScaleGeneralLoopBlocks()

Marking a loop from BB02 to BB05
    BB02(wt=300); unchanged: has profile weight
    BB03(wt=150); unchanged: has profile weight
    BB04(wt=150); unchanged: has profile weight
    BB05(wt=300); unchanged: has profile weight
Found a total of 1 general loops.
Skip alignment for L00 that starts at BB02 weight=300.
*************** In fgDebugCheckLoopTable

*************** Finishing PHASE Find loops

;;;;;;;;;;;;;;;;; right example

*************** Starting PHASE Find loops
*************** In optFindLoops()
*************** In optMarkLoopHeads()
2 loop heads marked
*************** In optFindNaturalLoops()
*************** In optFindAndScaleGeneralLoopBlocks()

Marking a loop from BB02 to BB04
    BB02(wt=300); unchanged: has profile weight
    BB03(wt=300); unchanged: has profile weight
    BB04(wt=300); unchanged: has profile weight
Marking a loop from BB04 to BB06
    BB04(wt=300); unchanged: has profile weight
    BB05(wt=100); unchanged: has profile weight
    BB06(wt=0); unchanged: has profile weight
Found a total of 2 general loops.
*************** In fgDebugCheckLoopTable

*************** Finishing PHASE Find loops

We actually see a high volume of monomorphic call sites with PGO, meaning we will quite often be seeing profiles like the one at right where the GDV test is highly biased and the residual indirect call site is cold.

Seems like we should look into deferring block reordering until after optimization.

cc @BruceForstall @EgorBo

@dotnet-issue-labeler dotnet-issue-labeler bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI untriaged New issue has not been triaged by the area owner labels Mar 30, 2022
@ghost
Copy link

ghost commented Mar 30, 2022

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

We currently rearrange the flow graph based on profile weights before we perform loop recognition. Since this recognition is lexical, rearrangement can impair loop recognition.

Here's a simple example:

;; COMPlus_TieredPGO=1
;; COMPlus_TC_QuickJitForLoops=0
;; (optionally use release runtime, as checked runtimes now change tiering parameters to cause tiering up sooner)

using System;
using System.Runtime.CompilerServices;
using System.Threading;

interface I
{
    public int F(int x, int y);
}

class Add : I 
{
    int I.F(int x, int y) => x + y;
}

class Mul : I
{
    int I.F(int x, int y) => x * y;
}

class X
{
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int MapF(I m, int[] xs, int[] ys, int from, int to)
    {
        int r = 0;
        for (int i = from; i < to; i++)
        {
            r += m.F(xs[i], ys[i]);
        }
        return r;
    }

    public static int Main()
    {
        int[] xs = new int[] { 1, 2, 3, 4 };
        int[] ys = new int[] { 4, 3, 2, 1 };
        I m = new Add();
        I mm = new Mul();

        int r = 0;

        // comment these two out to create a heavily biased GDV
        r += MapF(mm, xs, ys, 0, 3);
        r -= MapF(mm, xs, ys, 0, 3);

        for (int i = 0; i < 30; i++)
        {
            r += MapF(m, xs, ys, 0, 3);
            Thread.Sleep(15);
        }

        Thread.Sleep(50);

        for (int i = 0; i < 70; i++)
        {
            r += MapF(m, xs, ys, 0, 3);
        }

        return r / 20;
    }
}

As is we produce the flow graph on the left; with the two MapF(mm, ...) calls commented out the graph on the right:

Because of this reordering, loop recognition arrives at very different results, even though the flow graphs are isomorphic:

;;;;;;;;;;;;;;;;; left example

*************** Starting PHASE Find loops
*************** In optFindLoops()
*************** In optMarkLoopHeads()
1 loop heads marked
*************** In optFindNaturalLoops()
New loop epoch 1
L00: setting LPFLG_ITER
Recorded loop L00, from BB02 to BB05 (Head=BB01, Entry=BB02, Exit=BB05) [over V06 (ADD 1) LT V04]

***************  Natural loop table
L00, from BB02 to BB05 (Head=BB01, Entry=BB02, Exit=BB05) [over V06 (ADD 1) LT V04]

*************** In optFindAndScaleGeneralLoopBlocks()

Marking a loop from BB02 to BB05
    BB02(wt=300); unchanged: has profile weight
    BB03(wt=150); unchanged: has profile weight
    BB04(wt=150); unchanged: has profile weight
    BB05(wt=300); unchanged: has profile weight
Found a total of 1 general loops.
Skip alignment for L00 that starts at BB02 weight=300.
*************** In fgDebugCheckLoopTable

*************** Finishing PHASE Find loops

;;;;;;;;;;;;;;;;; right example

*************** Starting PHASE Find loops
*************** In optFindLoops()
*************** In optMarkLoopHeads()
2 loop heads marked
*************** In optFindNaturalLoops()
*************** In optFindAndScaleGeneralLoopBlocks()

Marking a loop from BB02 to BB04
    BB02(wt=300); unchanged: has profile weight
    BB03(wt=300); unchanged: has profile weight
    BB04(wt=300); unchanged: has profile weight
Marking a loop from BB04 to BB06
    BB04(wt=300); unchanged: has profile weight
    BB05(wt=100); unchanged: has profile weight
    BB06(wt=0); unchanged: has profile weight
Found a total of 2 general loops.
*************** In fgDebugCheckLoopTable

*************** Finishing PHASE Find loops

We actually see a high volume of monomorphic call sites with PGO, meaning we will quite often be seeing profiles like the one at right where the GDV test is highly biased and the residual indirect call site is cold.

Seems like we should look into deferring block reordering until after optimization.

cc @BruceForstall @EgorBo

Author: AndyAyersMS
Assignees: -
Labels:

area-CodeGen-coreclr, untriaged

Milestone: -

@AndyAyersMS
Copy link
Member Author

In principle all we have to do here is move optOptimizeLayout later in the phase ordering. Experimenting with this I am seeing various asserts as later phases either have implicit or (quasi) explicit dependences on derived graph info, like

  • implicitly assuming that block numbers represent sequential block order
  • dominators have been computed

So, it's going to take a bit of work to tease apart what is an essential constrant or to make the implicit more explicit. Initial results from a CQ/Code Size standpoint look very promising, so this is worth pushing on.

I am wondering if we can get out of the strong dependence on bbNum by relating some of the derived data to bbId instead, so that something like renumbering blocks does not invalidate dominators. Possibly more radical than it sounds.

@AndyAyersMS
Copy link
Member Author

Also, some subset of #66998 is likely explained by the above.

@JulieLeeMSFT JulieLeeMSFT added this to the Future milestone Apr 4, 2022
@JulieLeeMSFT JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label Apr 4, 2022
@AndyAyersMS
Copy link
Member Author

Note we can't (currently) update loop recognition to handle non-contiguous loop bodies because it uses the block extent in various ways for correctness, eg

bool Compiler::optIsVarAssigned(BasicBlock* beg, BasicBlock* end, GenTree* skip, unsigned var)
{
bool result;
isVarAssgDsc desc;
desc.ivaSkip = skip;
#ifdef DEBUG
desc.ivaSelf = &desc;
#endif
desc.ivaVar = var;
desc.ivaMaskCall = CALLINT_NONE;
AllVarSetOps::AssignNoCopy(this, desc.ivaMaskVal, AllVarSetOps::MakeEmpty(this));
for (;;)
{
noway_assert(beg != nullptr);
for (Statement* const stmt : beg->Statements())
{
if (fgWalkTreePre(stmt->GetRootNodePointer(), optIsVarAssgCB, &desc) != WALK_CONTINUE)
{
result = true;
goto DONE;
}
}
if (beg == end)
{
break;
}
beg = beg->bbNext;
}
result = false;
DONE:
return result;
}

@AndyAyersMS
Copy link
Member Author

Also see #59087 (comment).

I am going to retitle this issue since it's more broadly about block reordering because of PGO interfering with loop optimization, and not restricted to GDV.

@AndyAyersMS AndyAyersMS changed the title Poor interaction between loop recognition and GDV caused by early block reordering JIT: PGO-based block reordering interferes with loop recognition May 2, 2022
@AndyAyersMS AndyAyersMS modified the milestones: Future, 7.0.0 May 2, 2022
@AndyAyersMS AndyAyersMS self-assigned this May 2, 2022
AndyAyersMS added a commit to AndyAyersMS/runtime that referenced this issue May 26, 2022
The JIT currently will aggressively reorder the flow graph before running its
loop recognition phases. When there is PGO data this sometimes perturbs the
block order so that loops are no longer recognized, and we miss out on some
loop optimizations.

This change defers most block reordering until after the JIT has gone through
the optimization phases. There is still a limited form of flow cleanup done
early on.

There is also a compensating change in loop recognition in one place where it was
relying on adjacent blocks being combined.

Fixes dotnet#67318.
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label May 26, 2022
AndyAyersMS added a commit that referenced this issue May 30, 2022
…69878)

The JIT currently will aggressively reorder the flow graph before running its
loop recognition phases. When there is PGO data this sometimes perturbs the
block order so that loops are no longer recognized, and we miss out on some
loop optimizations.

This change defers most block reordering until after the JIT has gone through
the optimization phases. There is still a limited form of flow cleanup done
early on.

There is also a compensating change in loop recognition in one place where it was
relying on adjacent blocks being combined.

Fixes #67318.
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label May 30, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Jun 29, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants