Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GC root enumerating crashing with BULK_WRITEBARRIER helper on the stack #101890

Closed
jkotas opened this issue May 5, 2024 · 4 comments
Closed

GC root enumerating crashing with BULK_WRITEBARRIER helper on the stack #101890

jkotas opened this issue May 5, 2024 · 4 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI blocking-clean-ci-optional Blocking optional rolling runs
Milestone

Comments

@jkotas
Copy link
Member

jkotas commented May 5, 2024

Crash dumps:

https://dev.azure.com/dnceng-public/public/_build/results?buildId=666172&view=ms.vss-test-web.build-test-results-tab&runId=16523620&paneView=dotnet-dnceng.dnceng-build-release-tasks.helix-test-information-tab&resultId=141696

https://dev.azure.com/dnceng-public/public/_build/results?buildId=666172&view=ms.vss-test-web.build-test-results-tab&runId=16523620&paneView=dotnet-dnceng.dnceng-build-release-tasks.helix-test-information-tab&resultId=141697

Both of these are crashes while enumerating GC roots:

* thread #1, name = 'System.Memory.T', stop reason = signal SIGSEGV
  * frame #0: 0x0169219c System.Memory.Tests`WKS::gc_heap::mark_object_simple(unsigned char**) [inlined] MethodTable::HasComponentSize(this=0x00000004) at MethodTable.h:226:25 [opt]
    frame #1: 0x0169219c System.Memory.Tests`WKS::gc_heap::mark_object_simple(unsigned char**) [inlined] WKS::my_get_size(ob=0xeeb38cf8) at gc.cpp:11491 [opt]
    frame #2: 0x01692196 System.Memory.Tests`WKS::gc_heap::mark_object_simple(po=<unavailable>) at gc.cpp:27782 [opt]
    frame #3: 0x01693a28 System.Memory.Tests`WKS::GCHeap::Promote(ppObject=0xf14fec70, sc=<unavailable>, flags=<unavailable>) at gc.cpp:49248:5 [opt]
    frame #4: 0x016b0064 System.Memory.Tests`GcInfoDecoder::ReportUntrackedSlots(GcSlotDecoder&, REGDISPLAY*, unsigned int, void (*)(void*, void**, unsigned int), void*) [inlined] GcInfoDecoder::ReportSlotToGC(this=0xf0afd838, slotDecoder=0xf0afd4e0, slotIndex=10, pRD=0xf0afd948, reportScratchSlots=true, pCallBack=(System.Memory.Tests`EnumGcRefsCallback(void*, void**, unsigned int) + 1 at GcEnum.cpp:119), hCallBack=0xf0afd8c0)(void*, void**, unsigned int), void*) at gcinfodecoder.cpp:0 [opt]
    frame #5: 0x016b001e System.Memory.Tests`GcInfoDecoder::ReportUntrackedSlots(this=0xf0afd838, slotDecoder=0xf0afd4e0, pRD=0xf0afd948, inputFlags=<unavailable>, pCallBack=(System.Memory.Tests`EnumGcRefsCallback(void*, void**, unsigned int) + 1 at GcEnum.cpp:119), hCallBack=0xf0afd8c0)(void*, void**, unsigned int), void*) at gcinfodecoder.cpp:1100 [opt]
    frame #6: 0x016af0d8 System.Memory.Tests`GcInfoDecoder::EnumerateLiveSlots(this=<unavailable>, pRD=0xf0afd948, reportScratchSlots=false, inputFlags=<unavailable>, pCallBack=(System.Memory.Tests`EnumGcRefsCallback(void*, void**, unsigned int) + 1 at GcEnum.cpp:119), hCallBack=0xf0afd8c0)(void*, void**, unsigned int), void*) at gcinfodecoder.cpp:1049:9 [opt]
    frame #7: 0x016b0700 System.Memory.Tests`UnixNativeCodeManager::EnumGcRefs(this=<unavailable>, pMethodInfo=0xf0afd9cc, safePointAddress=<unavailable>, pRegisterSet=<unavailable>, hCallback=0xf0afd8c0, isActiveStackFrame=<unavailable>) at UnixNativeCodeManager.cpp:239:18 [opt]
    frame #8: 0x01679cb4 System.Memory.Tests`EnumGcRefs(pCodeManager=<unavailable>, pMethodInfo=<unavailable>, safePointAddress=<unavailable>, pRegisterSet=<unavailable>, pfnEnumCallback=(System.Memory.Tests`WKS::GCHeap::Promote(Object**, ScanContext*, unsigned int) + 1 at gc.cpp:49182), pvCallbackData=0xf0afdaf0, isActiveStackFrame=<unavailable>)(Object**, ScanContext*, unsigned int), ScanContext*, bool) at GcEnum.cpp:139:19 [opt]
...

The stack trace of the target thread:

  * frame #0: 0xf7caa674 libpthread.so.0`__libc_do_syscall at libc-do-syscall.S:46
    frame #1: 0xf7ca5124 libpthread.so.0`__pthread_cond_wait at futex-internal.h:186:13
    frame #2: 0xf7ca510c libpthread.so.0`__pthread_cond_wait at pthread_cond_wait.c:508
    frame #3: 0xf7ca4fba libpthread.so.0`__pthread_cond_wait(cond=0x049c0910, mutex=0x049c0940) at pthread_cond_wait.c:638
    frame #4: 0x016ab9c6 System.Memory.Tests`GCEvent::Impl::Wait(this=0x049c0910, milliseconds=<unavailable>, alertable=<unavailable>) at events.cpp:149:22 [opt]
    frame #5: 0x0167d768 System.Memory.Tests`Thread::InlineSuspend(UNIX_CONTEXT*) [inlined] Thread::WaitForGC(this=0xf14ff8a0, pTransitionFrame=<unavailable>) at thread.cpp:80:39 [opt]
    frame #6: 0x0167d73a System.Memory.Tests`Thread::InlineSuspend(this=0xf14ff8a0, interruptedContext=<unavailable>) at thread.cpp:884 [opt]
    frame #7: 0x016aa07e System.Memory.Tests`ActivationHandler(code=34, siginfo=0xf14fe898, context=0xf14fe918) at PalRedhawkUnix.cpp:1004:9 [opt]
    frame #8: 0xf7bc2840 libc.so.6 at sigrestorer.S:77
    frame #9: 0x019ece3a System.Memory.Tests`System.Buffer__BulkMoveWithWriteBarrier(destination=0xf14fec1c, source=0xeeb19a5c, byteCount=100) at Buffer.cs:185
    frame #10: 0x01afb91a System.Memory.Tests`System.Reflection.Runtime.TypeInfos.NativeFormat.NativeFormatRuntimeNamedTypeInfo__get_Name(this=0xeeb19a3c) at NativeFormatRuntimeNamedTypeInfo.cs:189
    frame #11: 0x01afafca System.Memory.Tests`System.Reflection.Runtime.TypeInfos.RuntimeNamedTypeInfo__get_FullName(this=0xeeb19a3c) at RuntimeNamedTypeInfo.cs:96
    frame #12: 0x0268774e System.Memory.Tests`System_Linq_System_Linq_Enumerable_ArraySelectIterator_2<System___Canon__System___Canon>__MoveNext(this=0xeeb389b0) at Select.cs:179

Target method:

System.Memory.Tests`System.Reflection.Runtime.TypeInfos.NativeFormat.NativeFormatRuntimeNamedTypeInfo__get_Name:
    0x1afb8f0 <+0>:  push.w {r4, r11, lr}
    0x1afb8f4 <+3>:  sub    sp, #0x74
    0x1afb8f6 <+5>:  add.w  r11, sp, #0x78
    0x1afb8fa <+9>:  movs   r1, #0x0
    0x1afb8fc <+11>: str    r1, [sp]
    0x1afb8fe <+13>: str    r1, [sp, #0x4]
    0x1afb900 <+15>: mov    r4, r0
    0x1afb902 <+17>: add.w  r1, r4, #0x20
    0x1afb906 <+21>: ldrsb.w r0, [r1]
    0x1afb90a <+25>: movw   r3, #0x151b
    0x1afb90e <+29>: movt   r3, #0xffef
    0x1afb912 <+33>: add    r3, pc
    0x1afb914 <+35>: add    r0, sp, #0xc
    0x1afb916 <+37>: movs   r2, #0x64
    0x1afb918 <+39>: blx    r3 <- CORINFO_HELP_BULK_WRITEBARRIER
    0x1afb91a <+41>: ldr    r0, [r4, #0x14] <---- we crash enumerating GC roots here
@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label May 5, 2024
@jkotas jkotas added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels May 5, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label May 5, 2024
@jkotas jkotas added the blocking-clean-ci-optional Blocking optional rolling runs label May 5, 2024
@jkotas
Copy link
Member Author

jkotas commented May 5, 2024

@EgorBo It looks like the GC reporting is messed up around the new buld write barrier helper. Could you please take a look?

So far, I have seen it on native aot linux-arm only. We seem to have higher number of intermittent crashes than usual momentarily, with multiple different root causes. So it is not easy to tell whether this specific crash is hitting linux-arm only.

@EgorBo
Copy link
Member

EgorBo commented May 5, 2024

@EgorBo It looks like the GC reporting is messed up around the new buld write barrier helper. Could you please take a look?

So far, I have seen it on native aot linux-arm only. We seem to have higher number of intermittent crashes than usual momentarily, with multiple different root causes. So it is not easy to tell whether this specific crash is hitting linux-arm only.

@SingleAccretion made an interesting guess that it might be related to #99410 (comment) (hard to tell from the asm you attached whether it's tallcall arg setup region or not)

@EgorBo
Copy link
Member

EgorBo commented May 5, 2024

ah, very unlikely here, I don't have any arm32 device to test, but on 64bit we don't emit any tail calls in that function so seems unlikely

@EgorBo
Copy link
Member

EgorBo commented Jul 12, 2024

Seems like it's not failing anymore, very likely fixed by #103301 which removed such helpers out of nogc blocks + potentially #102580

@EgorBo EgorBo closed this as completed Jul 12, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Aug 12, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI blocking-clean-ci-optional Blocking optional rolling runs
Projects
None yet
Development

No branches or pull requests

3 participants