Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AddRemoveFromDifferentThreads<string>.ConcurrentStack benchmark hangs on ARM64 #64980

Closed
adamsitnik opened this issue Feb 8, 2022 · 58 comments · Fixed by #65300
Closed

AddRemoveFromDifferentThreads<string>.ConcurrentStack benchmark hangs on ARM64 #64980

adamsitnik opened this issue Feb 8, 2022 · 58 comments · Fixed by #65300
Assignees
Labels
arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Milestone

Comments

@adamsitnik
Copy link
Member

The AddRemoveFromDifferentThreads<string>.ConcurrentStack benchmark hangs on Windows ARM64 (.NET 7 Preview 1).

image

Repro:

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py --architecture arm64 -f net7.0 --dotnet-versions 7.0.100-preview.1.22077.12  --filter *AddRemoveFromDifferentThreads*string*ConcurrentStack*

I was not able to attach a VS debugger to the live process as VS seems to not support it on ARM64 yet.

I've captured a dump and uploaded it here.

PerfView shows me that the process is stuck in the following loop:

https://github.com/dotnet/performance/blob/3edf4c3149f5c903777f13af9b08b562b1672302/src/benchmarks/micro/libraries/System.Collections/Concurrent/AddRemoveFromDifferentThreads.cs#L94-L100

Since AddRemoveFromDifferentThreads<int>.ConcurrentStack and AddRemoveFromDifferentThreads<string>.ConcurrentBag work fine I guess it's a codegen issue.

Hardware: Surface Pro X, Win 10, arm64

Please let me know if there is any way I could help.

cc @kunalspathak @AndyAyersMS @kouvel

@adamsitnik adamsitnik added arch-arm64 os-windows area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI labels Feb 8, 2022
@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Feb 8, 2022
@ghost
Copy link

ghost commented Feb 8, 2022

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

The AddRemoveFromDifferentThreads<string>.ConcurrentStack benchmark hangs on Windows ARM64 (.NET 7 Preview 1).

image

Repro:

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py --architecture arm64 -f net7.0 --dotnet-versions 7.0.100-preview.1.22077.12  --filter *AddRemoveFromDifferentThreads*string*ConcurrentStack*

I was not able to attach a VS debugger to the live process as VS seems to not support it on ARM64 yet.

I've captured a dump and uploaded it here.

PerfView shows me that the process is stuck in the following loop:

https://github.com/dotnet/performance/blob/3edf4c3149f5c903777f13af9b08b562b1672302/src/benchmarks/micro/libraries/System.Collections/Concurrent/AddRemoveFromDifferentThreads.cs#L94-L100

Since AddRemoveFromDifferentThreads<int>.ConcurrentStack and AddRemoveFromDifferentThreads<string>.ConcurrentBag work fine I guess it's a codegen issue.

Hardware: Surface Pro X, Win 10, arm64

Please let me know if there is any way I could help.

cc @kunalspathak @AndyAyersMS @kouvel

Author: adamsitnik
Assignees: -
Labels:

arch-arm64, os-windows, area-CodeGen-coreclr

Milestone: -

@kunalspathak
Copy link
Member

I will take a look at the code gen.

@kunalspathak kunalspathak self-assigned this Feb 8, 2022
@adamsitnik
Copy link
Member Author

@janvorli has hit this issue on Linux, I am going to rename the issue and change the labels

@adamsitnik adamsitnik changed the title AddRemoveFromDifferentThreads<string>.ConcurrentStack benchmark hangs on Windows ARM64 AddRemoveFromDifferentThreads<string>.ConcurrentStack benchmark hangs on ARM64 Feb 9, 2022
@janvorli
Copy link
Member

janvorli commented Feb 9, 2022

What I can see on my Arm64 linux is that at the time of the hang, the stack is empty (the _head member of the stack is null), but we have read only 73033 elements (based on the count) while the Size is 200000.
It sounds like a concurrency issue between the ConcurrentStack.Push and ConcurrentStack.TryPop, might be due to some arm64 specific JIT optimization or something. Lock free code is always prone to various subtle details.

@EgorBo
Copy link
Member

EgorBo commented Feb 9, 2022

Does it reproduce every run? We can try to revert all the recent VOLATILE-related changes which I believe were correct but slightly changed things.

@janvorli
Copy link
Member

janvorli commented Feb 9, 2022

I don't know, I just hit that while running the whole benchmark suite and I have attached lldb to the process to see what's wrong.
@adamsitnik might know details on the reproducibility.

@janvorli
Copy link
Member

janvorli commented Feb 9, 2022

I've attempted to run the test (with the filter set as @adamsitnik described in the repro) five times and it hung in all of the cases. So it sounds like it reproduces with at least high probability.

@EgorBo
Copy link
Member

EgorBo commented Feb 9, 2022

Thanks! I'll check a couple of Memory-model related commits

@EgorBo
Copy link
Member

EgorBo commented Feb 9, 2022

Ah, I didn't notice @kunalspathak already self-assigned it so leaving it for him. I'd check these:
#64354
#62895
#62224
#62043
#60219
#63720 (JIC)

@kunalspathak
Copy link
Member

Yes, I was looking into this yesterday. Unfortunately, the windbg on the machine is not working as expected and dotnet/diagnostics#2858 is in progress to fix it. Here are the threads that are hanging. I will debug it little more.

image

@JulieLeeMSFT JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label Feb 9, 2022
@JulieLeeMSFT JulieLeeMSFT added this to the 7.0.0 milestone Feb 9, 2022
@kunalspathak
Copy link
Member

I have verified that this issue doesn't reproduce when I set COmPlus_ReadyToRun=0 and most likely seems to be related to codegen. However, I see that this benchmark runs without problem in our perf lab and most likely because we don't use R2R images for those runs? Is that correct assumption @adamsitnik or @DrewScoggins ? Should we use R2R images because at the end of release, when we do the comparison of .NET X vs. .NET Y, we might see surprises that were not caught throught out the release.

@adamsitnik
Copy link
Member Author

@kunalspathak afaik the perf lab is currently using .NET SDK to build the micro benchmarks project, but it uses CoreRun from local build of dotnet/runtime to execute it. So if R2R is not enabled by default in dotnet/runtime build output then it's not used.

IIRC In the past we had pure SDK runs from dotnet/performance but SDK wasn't updated as regularly as we wanted (they were often few days long delays because the process runtime->SDK->installer changes propagation wasn't very smooth). For a while we did both, but due to machine availability it was decided to run the benchmarks only for dotnet/runtime builds. @DrewScoggins please correct me if I am wrong.

The dotnet/performance CI validation uses SDK to run all the benchmarks just once, but it currently has only x86 and x64 CI legs. Perhaps we should re-add ARM64 leg?

@janvorli
Copy link
Member

I am hitting the same hang with ready to run disabled as well.

@adamsitnik
Copy link
Member Author

@janvorli how have you configured the env var?

The python script from the perf repo has one "hidden feature" : it by default dumps all the DOTNET* and ComPlus* variables:

[2022/02/04 17:19:03][INFO] ------------------------------------------
[2022/02/04 17:19:03][INFO] Running .NET micro benchmarks for 'net7.0'
[2022/02/04 17:19:03][INFO] ------------------------------------------
[2022/02/04 17:19:03][INFO] --------------------------------------------------
[2022/02/04 17:19:03][INFO] Dumping COMPlus/DOTNET environment:
[2022/02/04 17:19:03][INFO]   "DOTNET_CLI_TELEMETRY_OPTOUT=1"
[2022/02/04 17:19:03][INFO]   "DOTNET_MULTILEVEL_LOOKUP=0"
[2022/02/04 17:19:03][INFO]   "DOTNET_ROOT=D:\projects\performance\tools\dotnet\x86"
[2022/02/04 17:19:03][INFO] --------------------------------------------------

So if you want the script to actually use given env var, you need to do it in an explicit way:

--bdn-arguments "--envVars key:value"

@janvorli
Copy link
Member

Ah, I've thought it would just use the ambient environment. Let me try again.

@janvorli
Copy link
Member

I can confirm that the issue doesn't occur with R2R disabled.

@kunalspathak
Copy link
Member

The dotnet/performance CI validation uses SDK to run all the benchmarks just once, but it currently has only x86 and x64 CI legs. Perhaps we should re-add ARM64 leg?

I am more inclined to just do the daily runs based on SDK because we ship R2R code and as you see, things change between R2R and JIT. Running with CoreRun, we are not measuring the performance of bits that we are shipping. @danmoseley - any thoughts?

@AndyAyersMS
Copy link
Member

We generally don't expect BDN to be measuring R2R code unless we've somehow failed to tier up appropriately. So to first order it should not matter a whole lot whether the assemblies have R2R or not.

However Tier1 codegen could shift a bit because with R2R disabled I think we also lose any embedded PGO data.

@kunalspathak
Copy link
Member

I confirm that the issue doesn't reproduce if I revert the changes in #62895. For that, I need the clrjit.dll as well as crossgened binary System.Collections.Concurrent.dll produced from the reverted clrjit changes in order for the hang to disappear. I will keep investigating.

@kunalspathak
Copy link
Member

kunalspathak commented Feb 11, 2022

I confirm that the issue doesn't reproduce if I revert the changes in #62895.

Looks like I was wrong. The benchmark is very sensitive and slightest change affects the behavior. Basically, if the methods in System.Collections.Concurrent.dll gets JITted in-time, we don't see the hang. That makes it harder to pin down to the offending commit. Moreover, to understand which commit might have affected it, I tried to replace coreclr.dll, clrjit.dll, SPC.dll and SCC.dll but sometimes the methods are JITted and other times not.

The observation of @janvorli is correct that Pop thread (Consumer thread) waits in this while loop. I didn't notice until sometime back that the Push thread (Producer thread) is stuck while trying to allocate an object. Below is the call stack of the Producer thread where we can see that when it tries to allocate an object from inside Push, the GC kicks off and the thread is put for suspension, however that call never returns. The producer thread seems to wait for the suspension to complete forever and thus cannot add anything to the stack.

I could verify the same observation in the dump that @adamsitnik has provided.

00 ntdll!ZwWaitForSingleObject
01 KERNELBASE!WaitForSingleObjectEx
02 coreclr!CLREventWaitHelper2
03 coreclr!CLREventWaitHelper
04 coreclr!CLREventBase::WaitEx
05 coreclr!CLREventBase::Wait
06 coreclr!ThreadSuspend::SuspendRuntime
07 coreclr!ThreadSuspend::SuspendEE
08 coreclr!GCToEEInterface::SuspendEE
09 coreclr!WKS::GCHeap::GarbageCollectGeneration
0a coreclr!WKS::gc_heap::trigger_gc_for_alloc
0b coreclr!WKS::gc_heap::try_allocate_more_space
0c coreclr!WKS::gc_heap::allocate_more_space
0d coreclr!WKS::gc_heap::allocate
0e coreclr!WKS::GCHeap::Alloc
0f coreclr!Alloc
10 coreclr!AllocateObject
11 coreclr!JIT_New
12 System_Collections_Concurrent
13 0x0
14 0x0
15 System_Private_CoreLib
16 System_Private_CoreLib
17 coreclr!CallDescrWorkerInternal
18 coreclr!CallDescrWorkerWithHandler
19 coreclr!DispatchCallSimple
1a coreclr!ThreadNative::KickOffThread_Worker
1b coreclr!ManagedThreadBase_DispatchInner
1c coreclr!ManagedThreadBase_DispatchMiddle
1d coreclr!ManagedThreadBase_DispatchOuter
1e coreclr!ManagedThreadBase_FullTransition
1f coreclr!ManagedThreadBase::KickOff
20 coreclr!ThreadNative::KickOffThread
21 KERNEL32!BaseThreadInitThunk
22 ntdll!RtlUserThreadStart

When the Push method is waiting for the allocation to complete, it is still R2R code and isn't JITTed.

image

I tried looking for !locks in the dump but nothing stands out. At the same time, it is not straightforward to reason about why this repros only on R2R images and not JIT code. It may be something with memory barrier changes, but I am unsure about it until we understand for what lock is GC waiting.

@mangod9 or @Maoni0 - any pointers on how to move forward?

@Maoni0
Copy link
Member

Maoni0 commented Feb 11, 2022

GC is simply waiting for suspension to finish and it's not finishing. this means some thread is not getting suspended. what's the output for !threads? it should tell you which thread(s) are still in coop mode.

@EgorBo
Copy link
Member

EgorBo commented Feb 11, 2022

Here is the difference in the disasm. Left is with tail call (where issue repros) and right is no tail call.

Is there a comparison with what JIT produces here?

TryPop tier1 jit codegen
; Assembly listing for method System.Collections.Concurrent.ConcurrentStack`1[__Canon][System.__Canon]:TryPop(byref):bool:this
; Emitting BLENDED_CODE for generic ARM64 CPU - MacOS
; Tier-1 compilation
; optimized code
; fp based frame
; fully interruptible
; No PGO data
; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;  V00 this         [V00,T00] (  5,  4   )     ref  ->  x19         this class-hnd single-def
;  V01 arg1         [V01,T01] (  5,  3.50)   byref  ->  x20         single-def
;  V02 loc0         [V02,T02] (  6,  4   )     ref  ->  x21         class-hnd single-def
;# V03 OutArgs      [V03    ] (  1,  1   )  lclBlk ( 0) [sp+00H]   "OutgoingArgSpace"
;* V04 tmp1         [V04,T06] (  0,  0   )    long  ->  zero-ref    "impRuntimeLookup slot"
;  V05 tmp2         [V05,T04] (  2,  2   )   byref  ->   x0         single-def "impAppendStmt"
;  V06 tmp3         [V06,T05] (  2,  2   )     ref  ->   x1         class-hnd single-def "impAppendStmt"
;* V07 tmp4         [V07    ] (  0,  0   )    long  ->  zero-ref    "spilling Runtime Lookup tree"
;  V08 cse0         [V08,T03] (  3,  2.50)   byref  ->   x0         "CSE - aggressive"
;
; Lcl frame size = 8

G_M57257_IG01:              ;; offset=0000H
        00000000          stp     fp, lr, [sp,#-48]!
        00000000          stp     x19, x20, [sp,#24]
        00000000          str     x21, [sp,#40]
        00000000          mov     fp, sp
        00000000          mov     x19, x0
        00000000          mov     x20, x1
      ;; bbWeight=1    PerfScore 4.50
G_M57257_IG02:              ;; offset=0018H
        00000000          add     x0, x19, #8
        00000000          ldar    x21, [x0]
        00000000          cbnz    x21, G_M57257_IG05
      ;; bbWeight=1    PerfScore 4.50
G_M57257_IG03:              ;; offset=0024H
        00000000          str     xzr, [x20]
        00000000          mov     w0, wzr
      ;; bbWeight=0.50 PerfScore 0.75
G_M57257_IG04:              ;; offset=002CH
        00000000          ldr     x21, [sp,#40]
        00000000          ldp     x19, x20, [sp,#24]
        00000000          ldp     fp, lr, [sp],#48
        00000000          ret     lr
      ;; bbWeight=0.50 PerfScore 2.50
G_M57257_IG05:              ;; offset=003CH
        00000000          ldrsb   wzr, [x19]
        00000000          ldr     x1, [x21,#16]
        00000000          mov     x2, x21
        00000000          bl      System.Threading.Interlocked:CompareExchange(byref,System.Object,System.Object):System.Object
        00000000          cmp     x0, x21
        00000000          bne     G_M57257_IG07
        00000000          ldr     x15, [x21,#8]
        00000000          mov     x14, x20
        00000000          bl      CORINFO_HELP_CHECKED_ASSIGN_REF
        00000000          mov     w0, #1
      ;; bbWeight=0.50 PerfScore 7.00
G_M57257_IG06:              ;; offset=0064H
        00000000          ldr     x21, [sp,#40]
        00000000          ldp     x19, x20, [sp,#24]
        00000000          ldp     fp, lr, [sp],#48
        00000000          ret     lr
      ;; bbWeight=0.50 PerfScore 2.50
G_M57257_IG07:              ;; offset=0074H
        00000000          mov     x0, x19
        00000000          mov     x1, x20
      ;; bbWeight=0.50 PerfScore 0.50
G_M57257_IG08:              ;; offset=007CH
        00000000          ldr     x21, [sp,#40]
        00000000          ldp     x19, x20, [sp,#24]
        00000000          ldp     fp, lr, [sp],#48
        00000000          b       System.Collections.Concurrent.ConcurrentStack`1[__Canon][System.__Canon]:TryPopCore(byref):bool:this
      ;; bbWeight=0.50 PerfScore 2.50

tail calls TryPopCore

@janvorli
Copy link
Member

I think that rather than the TryPop code, the caller's code is more relevant to see why it repros with R2R enabled and doesn't with R2R disabled.

@jakobbotsch
Copy link
Member

jakobbotsch commented Feb 11, 2022

We do try to ensure we do not form uninterruptible loops with tailcalls:

// Gc-interruptability in the following case:
// foo(a, b, c, d, e) { bar(a, b, c, d, e); }
// bar(a, b, c, d, e) { foo(a, b, d, d, e); }
//
// Since the instruction group starting from the instruction that sets up first
// stack arg to the end of the tail call is marked as non-gc interruptible,
// this will form a non-interruptible tight loop causing gc-starvation. To fix
// this we insert GT_NO_OP as embedded stmt before GT_START_NONGC, if the method
// has a single basic block and is not a GC-safe point. The presence of a single
// nop outside non-gc interruptible region will prevent gc starvation.
if ((comp->fgBBcount == 1) && !(comp->compCurBB->bbFlags & BBF_GC_SAFE_POINT))
{
assert(comp->fgFirstBB == comp->compCurBB);
GenTree* noOp = new (comp, GT_NO_OP) GenTree(GT_NO_OP, TYP_VOID);
BlockRange().InsertBefore(startNonGCNode, noOp);
}

This only runs in some cases so perhaps this needs to be expanded. I am not at my PC but I will take a closer look when I am.

@kunalspathak
Copy link
Member

he caller's code is more relevant to see why it repros with R2R enabled and doesn't with R2R disabled.

Its same with or without R2R.

@EgorBo
Copy link
Member

EgorBo commented Feb 11, 2022

I think that rather than the TryPop code, the caller's code is more relevant to see why it repros with R2R enabled and doesn't with R2R disabled.

G_M42430_IG01:              ;; offset=0000H
        00000000          stp     fp, lr, [sp,#-48]!
        00000000          stp     x19, x20, [sp,#32]
        00000000          mov     fp, sp
        00000000          str     xzr, [fp,#24] // [V02 loc1]
        00000000          mov     x19, x0
      ;; bbWeight=1    PerfScore 4.00
G_M42430_IG02:              ;; offset=0014H
        00000000          ldr     x0, [x19,#8]
        00000000          ldr     x0, [x0,#8]
        00000000          ldrsb   wzr, [x0]
        00000000          mov     x2, xzr
        00000000          movn    w1, #0
        00000000          bl      System.Threading.Barrier:SignalAndWait(int,System.Threading.CancellationToken):bool:this
        00000000          ldr     x0, [x19,#8]
        00000000          ldr     x0, [x0,#8]
        00000000          ldrsb   wzr, [x0]
        00000000          mov     x2, xzr
        00000000          movn    w1, #0
        00000000          bl      System.Threading.Barrier:SignalAndWait(int,System.Threading.CancellationToken):bool:this
        00000000          mov     w20, wzr
        00000000          ldr     x0, [x19,#8]
        00000000          ldr     w0, [x0,#32]
        00000000          cmp     w0, #0
        00000000          ble     G_M42430_IG06
      ;; bbWeight=1    PerfScore 30.00
G_M42430_IG03:              ;; offset=0058H
        00000000          ldr     x0, [x19,#16]
        00000000          add     x1, fp, #24 // [V02 loc1]
        00000000          ldr     wzr, [x0]
        00000000          bl      System.Collections.Concurrent.ConcurrentStack`1[__Canon][System.__Canon]:TryPop(byref):bool:this
        00000000          cbz     w0, G_M42430_IG05
      ;; bbWeight=4    PerfScore 34.00
G_M42430_IG04:              ;; offset=006CH
        00000000          add     w20, w20, #1
      ;; bbWeight=2    PerfScore 1.00
G_M42430_IG05:              ;; offset=0070H
        00000000          ldr     x0, [x19,#8]
        00000000          ldr     w0, [x0,#32]
        00000000          cmp     w20, w0
        00000000          blt     G_M42430_IG03
      ;; bbWeight=4    PerfScore 30.00
G_M42430_IG06:              ;; offset=0080H
        00000000          ldp     x19, x20, [sp,#32]
        00000000          ldp     fp, lr, [sp],#48
        00000000          ret     lr
      ;; bbWeight=1    PerfScore 3.00

; Total bytes of code 140, prolog size 16, PerfScore 116.00, instruction count 35, allocated bytes for code 140 (MethodHash=b47d5a41) for method <>c__DisplayClass6_0[__Canon][System.__Canon]:<SetupConcurrentStackIteration>b__1():this
; ============================================================

Codegen for that lambda

@EgorBo
Copy link
Member

EgorBo commented Feb 11, 2022

Its same with or without R2R.

Yeah I guess the only difference is the PRECODE for TryPop

@janvorli
Copy link
Member

Ok, that makes sense

@danmoseley
Copy link
Member

I am more inclined to just do the daily runs based on SDK because we ship R2R code and as you see, things change between R2R and JIT. Running with CoreRun, we are not measuring the performance of bits that we are shipping. @danmoseley - any thoughts?

@kunalspathak this has been discussed before, unless perhaps I'm thinking of the discussion about running unit tests on R2R bits. I suggest perhaps opening an issue to discuss in the perf repo.

@DrewScoggins
Copy link
Member

@kunalspathak afaik the perf lab is currently using .NET SDK to build the micro benchmarks project, but it uses CoreRun from local build of dotnet/runtime to execute it. So if R2R is not enabled by default in dotnet/runtime build output then it's not used.

This is correct as far as I know.

IIRC In the past we had pure SDK runs from dotnet/performance but SDK wasn't updated as regularly as we wanted (they were often few days long delays because the process runtime->SDK->installer changes propagation wasn't very smooth). For a while we did both, but due to machine availability it was decided to run the benchmarks only for dotnet/runtime builds. @DrewScoggins please correct me if I am wrong.

No, this is the exact reason that we decided against running the microbenchmarks on the published SDK. We would find regressions and then have to sort through hundreds of possible commits between the two builds from the runtime repo to try and find the source of the regression. It was also extremely tedious to try and trace back to the correct hashes from the runtime repo.

The dotnet/performance CI validation uses SDK to run all the benchmarks just once, but it currently has only x86 and x64 CI legs. Perhaps we should re-add ARM64 leg?

Do we have some kind of Arm64 VMs that we could do this validation on, or would we need to use our current Arm64 performance hardware? We have very limited Arm64 hardware, and moving to doing this level of validation there would tax that limited infrastructure too greatly.

@kunalspathak
Copy link
Member

This only runs in some cases so perhaps this needs to be expanded. I am not at my PC but I will take a closer look when I am.

Assigning to you @jakobbotsch

@AndyAyersMS
Copy link
Member

My understanding is that the when the caller sees a definite (every iteration) call to a managed method in a loop, it's free to assume that the loop has a gc safepoint and so no polling is needed in the loop. There's no safe way for the caller to know that the callee can't actually be hijacked.

So seems like any fix here has to happen in the callee. In the somewhat similar #57219 there was just a tiny window where the callee could be suspended. Perhaps e we need to look at forcing small methods (or perhaps methods with very short paths from entry to exit) to be fully interruptible so we don't need to hijack them?

@jakobbotsch
Copy link
Member

jakobbotsch commented Feb 12, 2022

I remembered it right: dotnet/coreclr#16039

@janvorli If return address hijacking is completely disabled for any function containing tailcalls then surely all such methods have to be forced to be fully interruptible by the JIT, or it is very easy to create mutually tailcalling methods that can never be suspended. We can also have "irrelevant" tailcalls in these functions that are never executed, but then have the side effect of disabling return address hijacking globally for the method.

Is there no alternative to completely disabling return address hijacking in this case? For example, implementing support for hijacking by modifying lr, which would also help the JIT to optimize leaf methods?

EDIT: Hmm, we should already need to mark such functions fully interruptible anyway, so not sure if this can actually get us into trouble.

@jakobbotsch
Copy link
Member

TryPop is only marked partially interruptible when crossgenned while it is marked fully interruptible in JIT, so that explains the difference between those. However, TryPop should have been marked fully interruptible by the following code, so there must be a JIT bug here.

if (block->endsWithTailCallOrJmp(this, true) && optReachWithoutCall(fgFirstBB, block))
{
// This tail call might combine with other tail calls to form a
// loop. Thus we need to either add a poll, or make the method
// fully interruptible. I chose the later because that's what
// JIT64 does.
SetInterruptible(true);
}

@jakobbotsch
Copy link
Member

The JIT believes that System.Interlocked.CompareExchange results in a GC safe point when being crossgenned while it does not believe so when jitting. The reason is that crossgen2 does not return this flag to the JIT:

if (mflags & CORINFO_FLG_NOGCCHECK)
{
call->AsCall()->gtCallMoreFlags |= GTF_CALL_M_NOGCCHECK;
}

So looks like crossgen2 needs to report this flag back.

@jkotas
Copy link
Member

jkotas commented Feb 12, 2022

implementing support for hijacking by modifying lr, which would also help the JIT to optimize leaf methods?

It would be useful to implement hijacking for lr. It is non-trivial. The problem is actually with unhijacking. The current implementation assumes that it is possible to unhijack the thread by just writing into memory without having the context that is not possible when the return address is in register.

So looks like crossgen2 needs to report this flag back.

It is easy to fix crossgen2 to report this flag back for FCalls. It violates the R2R versioning principles a bit. We do not consider changing method implementation from FCall to managed and vice versa to be a breaking change. Once crossgen2 starts returning this flag back for FCalls, changing an implementation from managed to FCall will become potential R2R breaking change.

@jakobbotsch
Copy link
Member

jakobbotsch commented Feb 12, 2022

It is easy to fix crossgen2 to report this flag back for FCalls. It violates the R2R versioning principles a bit. We do not consider changing method implementation from FCall to managed and vice versa to be a breaking change. Once crossgen2 starts returning this flag back for FCalls, changing an implementation from managed to FCall will become potential R2R breaking change.

Is there any alternative, other than marking practically all methods fully interruptible in R2R codegen?

From what I understand, even today changing a managed function to a fcall will cause GC starvation if that fcall is not hijackable. Changing crossgen to check for the attribute would only make us better off in this regard.

The current case is with a tailcall, but the same problem exists for any function that has a loop with a call in it that may change to a fcall then.

@jkotas
Copy link
Member

jkotas commented Feb 12, 2022

I think it is ok to change crossgen to check for this flag as an imperfect workaround. As you have pointed out, the story for FCalls and GC suspension is far from perfect, even for the JIT case.

jakobbotsch added a commit to jakobbotsch/runtime that referenced this issue Feb 14, 2022
When there are loops or tailcalls in a function the JIT will check for
a dominating call that is a GC safepoint to figure out if the function
needs to be marked fully interruptible or not. Methods with the
InternalCall flag may turn into fcalls that are not hijackable and thus
not GC safepoints, so in this case JIT should still mark the function
fully interruptible, but was not doing so because crossgen2 was not
reporting this flag back.

Fix dotnet#64980
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Feb 14, 2022
jkotas pushed a commit that referenced this issue Feb 14, 2022
When there are loops or tailcalls in a function the JIT will check for
a dominating call that is a GC safepoint to figure out if the function
needs to be marked fully interruptible or not. Methods with the
InternalCall flag may turn into fcalls that are not hijackable and thus
not GC safepoints, so in this case JIT should still mark the function
fully interruptible, but was not doing so because crossgen2 was not
reporting this flag back.

Fix #64980
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Feb 14, 2022
adamsitnik added a commit to adamsitnik/performance that referenced this issue Mar 1, 2022
adamsitnik added a commit to dotnet/performance that referenced this issue Mar 1, 2022
* define NET7_0_PREVIEW2_OR_GREATER const for .NET 7 Preview2+ builds

* don't use Regex.Count API for older versions of .NET 7 (used as baseline for monthly perf runs)

* don't try to run AddRemoveFromDifferentThreads.ConcurrentStack benchmark on ARM64 as it hangs

details: dotnet/runtime#64980

* don't try to run Json_FromString.Jil_ benchmark on ARM64 as it throws AccessViolationException

details: dotnet/runtime#64657
@ghost ghost locked as resolved and limited conversation to collaborators Mar 17, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging a pull request may close this issue.