Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segv in JIT when ingesting dotnet/runtime into dotnet/aspnetcore #101695

Closed
wtgodbe opened this issue Apr 29, 2024 · 12 comments · Fixed by #101714
Closed

segv in JIT when ingesting dotnet/runtime into dotnet/aspnetcore #101695

wtgodbe opened this issue Apr 29, 2024 · 12 comments · Fixed by #101714
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Milestone

Comments

@wtgodbe
Copy link
Member

wtgodbe commented Apr 29, 2024

This break is blocking the active Preview4 build.

We're seeing failures in dotnet/aspnetcore#55372 that look like the following:

[createdump] Gathering state for process 29703
[createdump] Crashing thread 14ce6 signal 11 (000b)
[createdump] Writing minidump with heap to file /Users/runner/work/1/s/dotnet-29703.1714409405.core
[createdump] Written 467104216 bytes (114039 pages) to core file
[createdump] Target process is alive
[createdump] Dump successfully written in 4503ms

The change causing this was introduced somewhere in this commit range: 5111fdc...64f7eca

Looking at the dump in windbg, we see the following, which points to JIT code:

(199c.1a22): Signal SIGSEGV (Segmentation fault) code SEGV_MAPERR (Address not mapped to object) at 0x94

*** WARNING: Unable to verify timestamp for libc-2.31.so

*** WARNING: Unable to verify timestamp for libcoreclr.so

libc_2_31!__wait4+0x5f:

00007f1d`fdc5cc7f 483d00f0ffff    cmp     rax,0FFFFFFFFFFFFF000h

0:022> k

 # Child-SP          RetAddr               Call Site00 00007f1d`f6166cd0 00007f1d`fd9f8fa5     libc_2_31!__wait4+0x5f [/build/glibc-e2p3jK/glibc-2.31/posix/../sysdeps/unix/sysv/linux/wait4.c @ 27] 01 00007f1d`f6166d00 00007f1d`fd9fa42a     libcoreclr!PROCCreateCrashDump+0x275 [/__w/1/s/src/coreclr/pal/src/thread/process.cpp @ 2309] 02 00007f1d`f6166d60 00007f1d`fd9cc7be     libcoreclr!PROCCreateCrashDumpIfEnabled+0xc6a [/__w/1/s/src/coreclr/pal/src/thread/process.cpp @ 15732480] 03 00007f1d`f6166df0 00007f1d`fd9cbd75     libcoreclr!invoke_previous_action+0x10e [/__w/1/s/src/coreclr/pal/src/exception/signal.cpp @ 397] 04 00007f1d`f6166e30 00007f1d`fe0d2420     libcoreclr!sigsegv_handler+0x1d5 [/__w/1/s/src/coreclr/pal/src/exception/signal.cpp @ 631] 05 00007f1d`f6167ac0 00007f1d`f6029201     libpthread_2_31!_restore_rt06 (Inline Function) --------`--------     libclrjit!BasicBlock::Next+0x4 [/__w/1/s/src/coreclr/jit/block.h @ 763] 07 00007f1d`f71651b0 00007f1d`f602806b     libclrjit!Compiler::optCompactLoop+0x1f1 [/__w/1/s/src/coreclr/jit/jit.h @ 2825] 08 (Inline Function) --------`--------     libclrjit!Compiler::optCompactLoops+0x187 [/__w/1/s/src/coreclr/jit/optimizer.cpp @ 2783] 09 00007f1d`f7165280 00007f1d`f6027e96     libclrjit!Compiler::optFindLoops+0x1ab [/__w/1/s/src/coreclr/jit/jit.h @ 2711] 0a 00007f1d`f71652c0 00007f1d`f5e007b6     libclrjit!Compiler::optFindLoopsPhase+0x16 [/__w/1/s/src/coreclr/jit/optimizer.cpp @ 2699] 0b (Inline Function) --------`--------     libclrjit!Phase::Run+0x17 [/__w/1/s/src/coreclr/jit/phase.cpp @ 61] 0c (Inline Function) --------`--------     libclrjit!DoPhase+0x5d [/__w/1/s/src/coreclr/jit/inline.h @ 143] 0d (Inline Function) --------`--------     libclrjit!Compiler::compCompile+0x3588 [/__w/1/s/src/coreclr/jit/compiler.cpp @ 4951] 0e (Inline Function) --------`--------     libclrjit!Compiler::compCompileHelper+0x5235 [/__w/1/s/src/coreclr/jit/compiler.cpp @ 7364] 0f (Inline Function) --------`--------     libclrjit!Compiler::compCompile::$_0::operator()+0x5235 [/__w/1/s/src/coreclr/jit/compiler.cpp @ 6501] 10 (Inline Function) --------`--------     libclrjit!Compiler::compCompile+0x53f2 [/__w/1/s/src/coreclr/jit/compiler.cpp @ 6520] 11 (Inline Function) --------`--------     libclrjit!jitNativeCode::$_0::operator()::{lambda(jitNativeCode(CORINFO_METHOD_STRUCT_ *, CORINFO_MODULE_STRUCT_ *, ICorJitInfo *, CORINFO_METHOD_INFO *, void **, unsigned int *, JitFlags *, void *)::$_0::operator()(jitNativeCode(CORINFO_METHOD_STRUCT_ *, CORINFO_MODULE_STRUCT_ *, ICorJitInfo *, CORINFO_METHOD_INFO *, void **, unsigned int *, JitFlags *, void *)::__JITParam *)::__JITParam *)#1}::operator()+0x5934 [/__w/1/s/src/coreclr/jit/compiler.cpp @ 8004] 12 (Inline Function) --------`--------     libclrjit!jitNativeCode::$_0::operator()+0x5950 [/__w/1/s/src/coreclr/jit/compiler.cpp @ 8028] 13 00007f1d`f71652e0 00007f1d`f5dfac34     libclrjit!jitNativeCode+0x5b46 [/__w/1/s/src/coreclr/jit/compiler.cpp @ 8030] 14 00007f1d`f7167220 00007f1d`fd612e1b     libclrjit!CILJit::compileMethod+0x84 [/__w/1/s/src/coreclr/jit/ee_il_dll.cpp @ 291] 15 00007f1d`f71672b0 00007f1d`fd613022     libcoreclr!invokeCompileMethodHelper+0xdb [/__w/1/s/src/coreclr/vm/jitinterface.cpp @ 12565] 16 00007f1d`f7167320 00007f1d`fd613be7     libcoreclr!invokeCompileMethod+0xb2 [/__w/1/s/src/coreclr/vm/jitinterface.cpp @ 15732480] 17 00007f1d`f71673a0 00007f1d`fd64ef2a     libcoreclr!UnsafeJitFunction+0x927 [/__w/1/s/src/coreclr/vm/jitinterface.cpp @ 15732480] 18 00007f1d`f7167760 00007f1d`fd64e80b     libcoreclr!MethodDesc::JitCompileCodeLocked+0xfa [/__w/1/s/src/coreclr/vm/prestub.cpp @ 15732480] 19 00007f1d`f7167830 00007f1d`fd64df70     libcoreclr!MethodDesc::JitCompileCodeLockedEventWrapper+0x38b [/__w/1/s/src/coreclr/vm/prestub.cpp @ 820] 1a 00007f1d`f7167920 00007f1d`fd64d95d     libcoreclr!MethodDesc::JitCompileCode+0x220 [/__w/1/s/src/coreclr/vm/prestub.cpp @ 15732480] 1b 00007f1d`f71679e0 00007f1d`fd67fbe2     libcoreclr!MethodDesc::PrepareILBasedCode+0x2ad [/__w/1/s/src/coreclr/vm/prestub.cpp @ 441] 1c 00007f1d`f7167a70 00007f1d`fd67f0d4     libcoreclr!TieredCompilationManager::CompileCodeVersion+0x102 [/__w/1/s/src/coreclr/vm/tieredcompilation.cpp @ 15732480] 1d (Inline Function) --------`--------     libcoreclr!TieredCompilationManager::OptimizeMethod+0x11 [/__w/1/s/src/coreclr/vm/tieredcompilation.cpp @ 935] 1e 00007f1d`f7167b60 00007f1d`fd67e815     libcoreclr!TieredCompilationManager::DoBackgroundWork+0x244 [/__w/1/s/src/coreclr/inc/check.h @ 820] 1f 00007f1d`f7167c60 00007f1d`fd67e67e     libcoreclr!TieredCompilationManager::BackgroundWorkerStart+0xf5 [/__w/1/s/src/coreclr/vm/tieredcompilation.cpp @ 533] 20 00007f1d`f7167cc0 00007f1d`fd67b4c5     libcoreclr!TieredCompilationManager::BackgroundWorkerBootstrapper1+0x6e [/__w/1/s/src/coreclr/inc/check.h @ 483] 21 (Inline Function) --------`--------     libcoreclr!ManagedThreadBase_DispatchInner+0x2 [/__w/1/s/src/coreclr/vm/threads.cpp @ 7259] 22 (Inline Function) --------`--------     libcoreclr!ManagedThreadBase_DispatchMiddle+0x3d [/__w/1/s/src/coreclr/inc/check.h @ 7303] 23 (Inline Function) --------`--------     libcoreclr!<unnamed-class>::operator()+0x3d [/__w/1/s/src/coreclr/inc/check.h @ 7461] 24 (Inline Function) --------`--------     libcoreclr!<unnamed-class>::operator()+0xa9 [/__w/1/s/src/coreclr/inc/check.h @ 7463] 25 00007f1d`f7167cf0 00007f1d`fd67ba7d     libcoreclr!ManagedThreadBase_DispatchOuter+0x135 [/__w/1/s/src/coreclr/inc/check.h @ 7487] 26 (Inline Function) --------`--------     libcoreclr!ManagedThreadBase_FullTransition+0x18 [/__w/1/s/src/coreclr/vm/threads.cpp @ 7507] 27 00007f1d`f7167e00 00007f1d`fd67e590     libcoreclr!ManagedThreadBase::KickOff+0x2d [/__w/1/s/src/coreclr/vm/threads.cpp @ 7543] 28 00007f1d`f7167e30 00007f1d`fd9fbb7e     libcoreclr!TieredCompilationManager::BackgroundWorkerBootstrapper0+0x20 [/__w/1/s/src/coreclr/inc/check.h @ 465] 29 00007f1d`f7167e50 00007f1d`fe0c6609     libcoreclr!CorUnix::CPalThread::ThreadEntry+0x1fe [/__w/1/s/src/coreclr/pal/inc/pal.h @ 1763] 2a 00007f1d`f7167f00 00007f1d`fdc99353     libpthread_2_31!start_thread+0xd9 [/build/glibc-e2p3jK/glibc-2.31/nptl/pthread_create.c @ 478] 2b 00007f1d`f7167fc0 ffffffff`ffffffff     libc_2_31!_GI___clone+0x43 [/build/glibc-e2p3jK/glibc-2.31/misc/../sysdeps/unix/sysv/linux/x86_64/clone.S @ 97] 2c 00007f1d`f7167fc8 00000000`00000000     0xffffffff`ffffffff

It seems we have an invariant that isn't holding:

      BasicBlock* lastNonLoopBlock = cur;
        while (true)
        {
            // Should always have a "bottom" block of the loop where we stop.
            assert(lastNonLoopBlock->Next() != nullptr);
            if (loop->ContainsBlock(lastNonLoopBlock->Next()))

A dump can be obtained here, or here

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Apr 29, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Apr 29, 2024
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@AndyAyersMS
Copy link
Member

Is this crash just on linux or do you see it on windows too?

@wtgodbe
Copy link
Member Author

wtgodbe commented Apr 29, 2024

I've seen it on Linux and Mac - the same tests seem to run & pass on Windows.

@AndyAyersMS
Copy link
Member

Can you outline plausible repro steps? I am trying to reconstruct these from CI... the JIT is getting into what it thinks is an impossible situation, I would like to capture all the interaction of JIT and runtime up until that point if possible.

Or would I be able to use runfo work to fetch the assets used by CI?

@wtgodbe
Copy link
Member Author

wtgodbe commented Apr 29, 2024

You only need to run build.sh & activate.sh once

@wtgodbe
Copy link
Member Author

wtgodbe commented Apr 29, 2024

Or would I be able to use runfo work to fetch the assets used by CI?

I'm not sure about this, someone from @dotnet/dnceng might know more

@JulieLeeMSFT JulieLeeMSFT added this to the 9.0.0 milestone Apr 30, 2024
@JulieLeeMSFT JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label Apr 30, 2024
@AndyAyersMS
Copy link
Member

Seems to crash in

System.Text.Json.JsonHelpers:TraverseGraphWithTopologicalSort[System.__Canon](System.__Canon,System.Func`2[System.__Canon,System.__Canon],System.Collections.Generic.IEqualityComparer`1[System.__Canon]):System.__Canon[]
Hash 0xDDFD2475

@AndyAyersMS
Copy link
Member

AndyAyersMS commented Apr 30, 2024

Jitdump here:https://gist.github.com/AndyAyersMS/5c161009b027e4d146edb5032a526487

This is from a runfo payload, overlaid with locally built checked jit. (built against the preview 4 branch). I can't get SPMI to record this. Not sure why.

 runfo get-helix-payload -j a39e51c7-3039-40b4-b15d-3f9c1f0411cc -w Microsoft.AspNetCore.SignalR.Client.FunctionalTests--net9.0

@AndyAyersMS
Copy link
Member

Seems like profile data is awfully thin, only two counts. I know checked runtimes have very aggressive tiering behavior but this ought to be a release build. Odd. I cannot get the runfo repro to crash under SPMI.

I have a full local build of aspnetcore that does crash, but I can't figure out if it is crashing the same way, and can't unwrap the layers of forking done by dotnet test to get at this in a debugger in useful way.

@jakobbotsch
Copy link
Member

jakobbotsch commented Apr 30, 2024

The dump definitely looks like we aren't properly accounting for the fact that compaction of previous loops can change which block is the lexical top block of subsequent loops:

***************  Natural loop graph
L00 header: BB07
  Members (33): [BB07..BB35];[BB39..BB42]
  Entry: BB06 -> BB07
  Exit: BB07 -> BB47; BB42 -> BB43
  Back: BB42 -> BB07
L01 header: BB32 parent: L00
  Members (11): [BB22..BB32]
  Entry: BB21 -> BB32
  Exit: BB32 -> BB33
  Back: BB31 -> BB32
L02 header: BB44
  Members (18): [BB36..BB38];[BB44..BB46];[BB48..BB58];BB60
  Entry: BB43 -> BB44
  Exit: BB44 -> BB47; BB46 -> BB47; BB58 -> BB59
  Back: BB60 -> BB44
L03 header: BB46 parent: L02
  Members (14): [BB36..BB38];BB46;[BB48..BB57]
  Entry: BB45 -> BB46
  Exit: BB46 -> BB47; BB57 -> BB58
  Back: BB57 -> BB46

Relocated blocks [BB36..BB38] inserted after BB44
Relocated block [BB47..BB47] inserted after BB60
Relocated block [BB59..BB59] inserted after BB47

Notice that the compaction done for L00

Relocated blocks [BB36..BB38] inserted after BB44

modifies the top most block for L02 to be BB44. Since we don't account for that, instead starting from BB36, we are going to run out of loop blocks before seeing all of them.

I'll see if I can repro it locally and work on a fix.

@premun
Copy link
Member

premun commented Apr 30, 2024

Or would I be able to use runfo work to fetch the assets used by CI?

I'm not sure about this, someone from @dotnet/dnceng might know more

dnceng does not manage runfo but if you're sending some of these to Helix, you could get the payloads.

@jakobbotsch
Copy link
Member

Managed to get an SPMI collection on 64f7eca: repro-40858.zip

jakobbotsch added a commit to jakobbotsch/runtime that referenced this issue Apr 30, 2024
Switch `FlowGraphNaturalLoop::GetLexicallyTopMostBlock` and
`FlowGraphNaturalLoop::GetLexicallyBottomMostBlock` to more robust
implementations that scan the basic block list forwards (and backwards)
to find the boundary blocks.

Fix dotnet#101695
jakobbotsch added a commit that referenced this issue Apr 30, 2024
Switch `FlowGraphNaturalLoop::GetLexicallyTopMostBlock` and
`FlowGraphNaturalLoop::GetLexicallyBottomMostBlock` to more robust
implementations that scan the basic block list forwards (and backwards)
to find the boundary blocks.

Fix #101695
github-actions bot pushed a commit that referenced this issue Apr 30, 2024
Switch `FlowGraphNaturalLoop::GetLexicallyTopMostBlock` and
`FlowGraphNaturalLoop::GetLexicallyBottomMostBlock` to more robust
implementations that scan the basic block list forwards (and backwards)
to find the boundary blocks.

Fix #101695
mmitche pushed a commit that referenced this issue Apr 30, 2024
Switch `FlowGraphNaturalLoop::GetLexicallyTopMostBlock` and
`FlowGraphNaturalLoop::GetLexicallyBottomMostBlock` to more robust
implementations that scan the basic block list forwards (and backwards)
to find the boundary blocks.

Fix #101695

Co-authored-by: Jakob Botsch Nielsen <jakob.botsch.nielsen@gmail.com>
michaelgsharp pushed a commit to michaelgsharp/runtime that referenced this issue May 9, 2024
…01714)

Switch `FlowGraphNaturalLoop::GetLexicallyTopMostBlock` and
`FlowGraphNaturalLoop::GetLexicallyBottomMostBlock` to more robust
implementations that scan the basic block list forwards (and backwards)
to find the boundary blocks.

Fix dotnet#101695
Ruihan-Yin pushed a commit to Ruihan-Yin/runtime that referenced this issue May 30, 2024
…01714)

Switch `FlowGraphNaturalLoop::GetLexicallyTopMostBlock` and
`FlowGraphNaturalLoop::GetLexicallyBottomMostBlock` to more robust
implementations that scan the basic block list forwards (and backwards)
to find the boundary blocks.

Fix dotnet#101695
@github-actions github-actions bot locked and limited conversation to collaborators May 31, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants