Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault inside JIT in libraries-jitstress #68756

Closed
jakobbotsch opened this issue May 2, 2022 · 5 comments · Fixed by #68803
Closed

Segfault inside JIT in libraries-jitstress #68756

jakobbotsch opened this issue May 2, 2022 · 5 comments · Fixed by #68803
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration

Comments

@jakobbotsch
Copy link
Member

The libraries-jitstress job is hitting AVs/segfaults in a few tests, e.g. https://dev.azure.com/dnceng/public/_build/results?buildId=1746194&view=ms.vss-test-web.build-test-results-tab&runId=47150832&resultId=193742&paneView=debug.

It seems curBlk is null here:

if (curBlk->bbNum > endBlk->bbNum)

which happens because endBlk is lexically before begBlk.

From bisection it was exposed by #68588.
cc @dotnet/jit-contrib

@dotnet-issue-labeler dotnet-issue-labeler bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI untriaged New issue has not been triaged by the area owner labels May 2, 2022
@ghost
Copy link

ghost commented May 2, 2022

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

The libraries-jitstress job is hitting AVs/segfaults in a few tests, e.g. https://dev.azure.com/dnceng/public/_build/results?buildId=1746194&view=ms.vss-test-web.build-test-results-tab&runId=47150832&resultId=193742&paneView=debug.

It seems curBlk is null here:

if (curBlk->bbNum > endBlk->bbNum)

which happens because endBlk is lexically before begBlk.

From bisection it was exposed by #68588.
cc @dotnet/jit-contrib

Author: jakobbotsch
Assignees: -
Labels:

area-CodeGen-coreclr, untriaged

Milestone: -

@jakobbotsch jakobbotsch added needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration and removed untriaged New issue has not been triaged by the area owner labels May 2, 2022
@jakobbotsch
Copy link
Member Author

Hoisting decides to hoist a null-check out of a loop and creates a preheader for it:

optHoistThisLoop for loop L00 <BB03..BB07>:
  Loop body contains a call
  Loop has multiple exits

  USEDEF  (7)={V10 V03 V11 V04 V05 V01 V02}
  INOUT   (5)={V03 V11 V05 V01 V02}
  LOOPVARS(5)={V03 V11 V05 V01 V02}
  only considering hoisting in entry block BB04

    optHoistLoopBlocks BB04 (weight= 17.78) of loop L00 <BB03..BB07>
      [000213] not hoistable: unset
      [000215] not hoistable: unset
*************** In fgCreateLoopPreHeader for L00
   existing head BB02 isn't unique non-loop predecessor of loop entry
New Basic Block BB13 [0020] created.

Created PreHeader (BB13) for loop L00 (BB03 - BB07, entry BB04), with weight = 0   
Setting edge weights for BB02 -> BB13 to [0 .. 3.402823e+38]
Setting edge weights for BB12 -> BB13 to [0 .. 3.402823e+38]
Setting edge weights for BB02 -> BB13 to [0 .. 0]
Setting edge weights for BB13 -> BB04 to [0 .. 3.402823e+38]
Setting edge weights for BB13 -> BB04 to [0 .. 0]
*************** In fgDebugCheckBBlist
*************** In fgDebugCheckLoopTable
*************** After fgCreateLoopPreHeader for L00

-----------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight    IBC  lp [IL range]     [jump]      [EH region]         [flags]
-----------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1       9    [000..009)-> BB11 ( cond )                     i idxlen IBC 
BB02 [0002]  1       BB01                  0       0    [014..01D)-> BB13 (always)                     i rare hascall gcsafe newobj nullcheck IBC 
BB13 [0020]  2       BB02,BB12             0            [01D..???)-> BB04 (always)                     internal rare LoopPH 
BB03 [0003]  2       BB05,BB07            16     144  0 [01D..03C)-> BB12 ( cond )                     i Loop hascall gcsafe idxlen nullcheck bwd bwd-target IBC 
BB04 [0005]  2       BB03,BB13            17.78  160  0 [044..04C)-> BB07 ( cond )                     i Loop hascall gcsafe nullcheck bwd bwd-src IBC 
BB05 [0014]  1       BB04                933.33 8400  0 [044..045)-> BB03 ( cond )                     i gcsafe bwd IBC 
BB06 [0019]  1       BB05                933.33       0 [???..???)-> BB09 (always)                     internal gcsafe 
BB07 [0015]  1       BB04                 17.78  160  0 [044..045)-> BB03 ( cond )                     i gcsafe nullcheck bwd IBC 
BB08 [0018]  1       BB07                  7.33         [???..???)-> BB10 (always)                     internal gcsafe 
BB09 [0017]  1       BB06                933.33 8400    [???..???)                                     internal gcsafe IBC 
BB10 [0006]  2       BB08,BB09             7.33   66    [04C..04D)        (return)                     i gcsafe IBC 
BB11 [0001]  1       BB01                  0       0    [009..014)        (throw )                     i rare hascall gcsafe newobj IBC 
BB12 [0004]  1       BB03                  0       0    [03C..044)-> BB13 (always)                     i rare hascall gcsafe bwd IBC 
-----------------------------------------------------------------------------------------------------------------------------------------

During assertion prop, we notice that the null check is useless and remove it, so that the loop pre-header becomes empty:

VN based non-null prop in BB13:
N002 (  2,  2) [000265] ---X---H---                           NULLCHECK byte   $VN.Void

removing useless STMT00033 ( ??? ... ??? )
N004 (  2,  2) [000268] -----O-----                           COMMA     void  
N002 (  2,  2) [000265] -----O-H---                         ├──▌  NULLCHECK byte   $VN.Void
N001 (  1,  1) [000266] -------H---                           └──▌  LCL_VAR   ref    V03 loc0         u:2 $24f
N003 (  0,  0) [000267] -----------                         └──▌  NOP       void  
 from BB13

BB13 becomes empty

We then call fgUpdateFlowGraph that tries to remove the block, and in doing so has to unmark (part of?) the loop, which hits the AV:

*************** In fgUpdateFlowGraph()
Before updating the flow graph:

-----------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight    IBC  lp [IL range]     [jump]      [EH region]         [flags]
-----------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1       9    [000..009)-> BB11 ( cond )                     i idxlen IBC 
BB02 [0002]  1       BB01                  0       0    [014..01D)-> BB13 (always)                     i rare hascall gcsafe newobj nullcheck IBC 
BB13 [0020]  2       BB02,BB12             0            [01D..???)-> BB04 (always)                     internal rare nullcheck LoopPH 
BB03 [0003]  1       BB07                 16     144  0 [01D..03C)-> BB12 ( cond )                     i Loop hascall gcsafe idxlen nullcheck bwd bwd-target IBC 
BB04 [0005]  2       BB03,BB13            17.78  160  0 [044..04C)-> BB07 ( cond )                     i Loop hascall gcsafe nullcheck bwd bwd-src IBC 
BB05 [0014]  1       BB04                933.33 8400  0 [044..045)                                     i gcsafe bwd IBC 
BB06 [0019]  1       BB05                933.33       0 [???..???)-> BB09 (always)                     internal gcsafe 
BB07 [0015]  1       BB04                 17.78  160  0 [044..045)-> BB03 (always)                     i gcsafe nullcheck bwd IBC 
BB08 [0018]  0                             7.33         [???..???)-> BB10 (always)                     internal gcsafe 
BB09 [0017]  1       BB06                933.33 8400    [???..???)                                     internal gcsafe IBC 
BB10 [0006]  2       BB08,BB09             7.33   66    [04C..04D)        (return)                     i gcsafe IBC 
BB11 [0001]  1       BB01                  0       0    [009..014)        (throw )                     i rare hascall gcsafe newobj IBC 
BB12 [0004]  1       BB03                  0       0    [03C..044)-> BB13 (always)                     i rare hascall gcsafe bwd IBC 
-----------------------------------------------------------------------------------------------------------------------------------------


Removing unconditional jump to next block (BB02 -> BB13) (converted BB02 to fall-through)
fgRemoveBlock BB13, unreachable=false
Removing empty BB13

Unmarking a loop from BB04 to BB13

Note that the validity check for that is relying on bbNum to specify actual order of basic blocks:

if ((skipUnmarkLoop == false) && //
block->KindIs(BBJ_ALWAYS, BBJ_COND) && //
block->bbJumpDest->isLoopHead() && //
(block->bbJumpDest->bbNum <= block->bbNum) && //
fgDomsComputed && //
(fgCurBBEpochSize == fgDomBBcount + 1) && //
fgReachable(block->bbJumpDest, block))
{
optUnmarkLoopBlocks(block->bbJumpDest, block);
}

@BruceForstall are we missing a call to fgRenumberBlocks somewhere? The check above seems a little strange since it is happening from fgUpdateFlowGraph which should already assume that the flow graph can have changed in arbitrary ways.

@kunalspathak
Copy link
Member

I too noticed it recently that we rely on bbNum being accurate. I will talk to @BruceForstall on right way to handle this.

bool lpContains(BasicBlock* blk) const
{
return lpTop->bbNum <= blk->bbNum && blk->bbNum <= lpBottom->bbNum;
}

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label May 3, 2022
@kunalspathak
Copy link
Member

Multiple things going on here:

  1. Since the block in question is BB13, a preheader, that always jumps to the loop entry BB04, we think that we should remove this loop. But later, we iterate through all the other blocks to see if there were any other predecessors to the loop and if yes, we do not remove the loop and set removeLoop = false. Turns out in this case, the other predecessor we find out is the same BB13 block. BB13 block is not considered to be part of a loop using lpContains() definition (it strictly matches the bbnum and preheader's bbNum might be off). We never check if the block in question is same as auxBlock, but that's a separate question. Next, before we call optUnmarkLoopBlocks(), we never check if removeLoop == true. If we add that check, the bug is fixed.
  2. However, in the method, there is a comment that talks about special casing loop head and not removing the loop. But the check loop.lpHead == block happens way below.
    // Special case: the block was the head of a loop - or pointing to a loop entry.

I think, we should check if the block is preheader and if yes, just remove the block without unmarking the loop. The only thing I am not sure in that case is what should be the updated lpHead for such loop? Is it safe to have block->bbPrev as the loop head as it is done currently?

/* The loop has a new head - Just update the loop table */
loop.lpHead = block->bbPrev;

We can just have a minimal fix of what I discussed in 1. or have little more surgical fix (given my assumptions are correct). Here is my attempt -#68803. @BruceForstall - let me know if it makes sense.

@kunalspathak
Copy link
Member

With the broader fix, before doing loop.lpHead = block->bbPrev, I added an assert assert(block->bbPrev->KindIs(BBJ_NONE, BBJ_ALWAYS)); to see scenarios where bbPrev can be different and is it appropriate to set them. I am seeing cases where bbPrev can be a throw block. For example:

BB26 [0104]  0  1                          0       [???..???)        (throw ) T1                  keep i internal rare hascall 
BB27 [0040]  1  1    BB14                  0.50    [033..034)                 T1                  i gcsafe 
BB28 [0041]  1  1    BB27                  0.50    [033..034)                 T1                  i gcsafe 
BB29 [0042]  1  1    BB28                  0.50    [033..034)-> BB39 (always) T1                  i gcsafe 
BB30 [0043]  0  1                          0.50    [033..034)-> BB32 (always) T1                  i gcsafe 
BB31 [0044]  0  1                          0.50    [033..034)                 T1                  i hascall gcsafe 
BB32 [0045]  2  1    BB30,BB31             0.50    [033..034)-> BB20 (always) T1                  i gcsafe 
BB33 [0055]  1  1    BB15                  0.50    [033..034)                 T1                  i gcsafe idxlen LoopPH 
BB34 [0056]  2  1    BB33,BB36             4     1 [033..034)-> BB36 ( cond ) T1                  i Loop hascall gcsafe idxlen bwd bwd-target bwd-src 
BB35 [0086]  1  1    BB34                  2     1 [033..034)                 T1                  i hascall gcsafe bwd 
BB36 [0087]  2  1    BB34,BB35             4     1 [033..034)-> BB34 ( cond ) T1                  i gcsafe idxlen bwd 

Here, we are trying to delete the preheader BB33 and its bbPrev is BB26 and I am not sure if it is correct to set BB26 as lpHead of the loop. Thinking little more, I feel that just adding removeLoop condition should be sufficient since it takes care of preheader blocks.

@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label May 10, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Jun 9, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants