-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AddRemoveFromDifferentThreads<string>.ConcurrentStack benchmark hangs on ARM64 #64980
Comments
Tagging subscribers to this area: @JulieLeeMSFT Issue DetailsThe Repro: git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py --architecture arm64 -f net7.0 --dotnet-versions 7.0.100-preview.1.22077.12 --filter *AddRemoveFromDifferentThreads*string*ConcurrentStack* I was not able to attach a VS debugger to the live process as VS seems to not support it on ARM64 yet. I've captured a dump and uploaded it here. PerfView shows me that the process is stuck in the following loop: Since Hardware: Surface Pro X, Win 10, arm64 Please let me know if there is any way I could help. cc @kunalspathak @AndyAyersMS @kouvel
|
I will take a look at the code gen. |
@janvorli has hit this issue on Linux, I am going to rename the issue and change the labels |
What I can see on my Arm64 linux is that at the time of the hang, the stack is empty (the _head member of the stack is null), but we have read only 73033 elements (based on the count) while the Size is 200000. |
Does it reproduce every run? We can try to revert all the recent VOLATILE-related changes which I believe were correct but slightly changed things. |
I don't know, I just hit that while running the whole benchmark suite and I have attached lldb to the process to see what's wrong. |
I've attempted to run the test (with the filter set as @adamsitnik described in the repro) five times and it hung in all of the cases. So it sounds like it reproduces with at least high probability. |
Thanks! I'll check a couple of Memory-model related commits |
Yes, I was looking into this yesterday. Unfortunately, the windbg on the machine is not working as expected and dotnet/diagnostics#2858 is in progress to fix it. Here are the threads that are hanging. I will debug it little more. |
I have verified that this issue doesn't reproduce when I set |
@kunalspathak afaik the perf lab is currently using .NET SDK to build the micro benchmarks project, but it uses CoreRun from local build of dotnet/runtime to execute it. So if R2R is not enabled by default in dotnet/runtime build output then it's not used. IIRC In the past we had pure SDK runs from dotnet/performance but SDK wasn't updated as regularly as we wanted (they were often few days long delays because the process runtime->SDK->installer changes propagation wasn't very smooth). For a while we did both, but due to machine availability it was decided to run the benchmarks only for dotnet/runtime builds. @DrewScoggins please correct me if I am wrong. The dotnet/performance CI validation uses SDK to run all the benchmarks just once, but it currently has only x86 and x64 CI legs. Perhaps we should re-add ARM64 leg? |
I am hitting the same hang with ready to run disabled as well. |
@janvorli how have you configured the env var? The python script from the perf repo has one "hidden feature" : it by default dumps all the
So if you want the script to actually use given env var, you need to do it in an explicit way: --bdn-arguments "--envVars key:value" |
Ah, I've thought it would just use the ambient environment. Let me try again. |
I can confirm that the issue doesn't occur with R2R disabled. |
I am more inclined to just do the daily runs based on SDK because we ship R2R code and as you see, things change between R2R and JIT. Running with CoreRun, we are not measuring the performance of bits that we are shipping. @danmoseley - any thoughts? |
We generally don't expect BDN to be measuring R2R code unless we've somehow failed to tier up appropriately. So to first order it should not matter a whole lot whether the assemblies have R2R or not. However Tier1 codegen could shift a bit because with R2R disabled I think we also lose any embedded PGO data. |
I confirm that the issue doesn't reproduce if I revert the changes in #62895. For that, I need the |
Looks like I was wrong. The benchmark is very sensitive and slightest change affects the behavior. Basically, if the methods in The observation of @janvorli is correct that I could verify the same observation in the dump that @adamsitnik has provided.
When the I tried looking for |
GC is simply waiting for suspension to finish and it's not finishing. this means some thread is not getting suspended. what's the output for !threads? it should tell you which thread(s) are still in coop mode. |
TryPop tier1 jit codegen; Assembly listing for method System.Collections.Concurrent.ConcurrentStack`1[__Canon][System.__Canon]:TryPop(byref):bool:this
; Emitting BLENDED_CODE for generic ARM64 CPU - MacOS
; Tier-1 compilation
; optimized code
; fp based frame
; fully interruptible
; No PGO data
; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
; V00 this [V00,T00] ( 5, 4 ) ref -> x19 this class-hnd single-def
; V01 arg1 [V01,T01] ( 5, 3.50) byref -> x20 single-def
; V02 loc0 [V02,T02] ( 6, 4 ) ref -> x21 class-hnd single-def
;# V03 OutArgs [V03 ] ( 1, 1 ) lclBlk ( 0) [sp+00H] "OutgoingArgSpace"
;* V04 tmp1 [V04,T06] ( 0, 0 ) long -> zero-ref "impRuntimeLookup slot"
; V05 tmp2 [V05,T04] ( 2, 2 ) byref -> x0 single-def "impAppendStmt"
; V06 tmp3 [V06,T05] ( 2, 2 ) ref -> x1 class-hnd single-def "impAppendStmt"
;* V07 tmp4 [V07 ] ( 0, 0 ) long -> zero-ref "spilling Runtime Lookup tree"
; V08 cse0 [V08,T03] ( 3, 2.50) byref -> x0 "CSE - aggressive"
;
; Lcl frame size = 8
G_M57257_IG01: ;; offset=0000H
00000000 stp fp, lr, [sp,#-48]!
00000000 stp x19, x20, [sp,#24]
00000000 str x21, [sp,#40]
00000000 mov fp, sp
00000000 mov x19, x0
00000000 mov x20, x1
;; bbWeight=1 PerfScore 4.50
G_M57257_IG02: ;; offset=0018H
00000000 add x0, x19, #8
00000000 ldar x21, [x0]
00000000 cbnz x21, G_M57257_IG05
;; bbWeight=1 PerfScore 4.50
G_M57257_IG03: ;; offset=0024H
00000000 str xzr, [x20]
00000000 mov w0, wzr
;; bbWeight=0.50 PerfScore 0.75
G_M57257_IG04: ;; offset=002CH
00000000 ldr x21, [sp,#40]
00000000 ldp x19, x20, [sp,#24]
00000000 ldp fp, lr, [sp],#48
00000000 ret lr
;; bbWeight=0.50 PerfScore 2.50
G_M57257_IG05: ;; offset=003CH
00000000 ldrsb wzr, [x19]
00000000 ldr x1, [x21,#16]
00000000 mov x2, x21
00000000 bl System.Threading.Interlocked:CompareExchange(byref,System.Object,System.Object):System.Object
00000000 cmp x0, x21
00000000 bne G_M57257_IG07
00000000 ldr x15, [x21,#8]
00000000 mov x14, x20
00000000 bl CORINFO_HELP_CHECKED_ASSIGN_REF
00000000 mov w0, #1
;; bbWeight=0.50 PerfScore 7.00
G_M57257_IG06: ;; offset=0064H
00000000 ldr x21, [sp,#40]
00000000 ldp x19, x20, [sp,#24]
00000000 ldp fp, lr, [sp],#48
00000000 ret lr
;; bbWeight=0.50 PerfScore 2.50
G_M57257_IG07: ;; offset=0074H
00000000 mov x0, x19
00000000 mov x1, x20
;; bbWeight=0.50 PerfScore 0.50
G_M57257_IG08: ;; offset=007CH
00000000 ldr x21, [sp,#40]
00000000 ldp x19, x20, [sp,#24]
00000000 ldp fp, lr, [sp],#48
00000000 b System.Collections.Concurrent.ConcurrentStack`1[__Canon][System.__Canon]:TryPopCore(byref):bool:this
;; bbWeight=0.50 PerfScore 2.50 tail calls TryPopCore |
I think that rather than the |
We do try to ensure we do not form uninterruptible loops with tailcalls: runtime/src/coreclr/jit/lower.cpp Lines 1951 to 1966 in e04d750
This only runs in some cases so perhaps this needs to be expanded. I am not at my PC but I will take a closer look when I am. |
Its same with or without R2R. |
G_M42430_IG01: ;; offset=0000H
00000000 stp fp, lr, [sp,#-48]!
00000000 stp x19, x20, [sp,#32]
00000000 mov fp, sp
00000000 str xzr, [fp,#24] // [V02 loc1]
00000000 mov x19, x0
;; bbWeight=1 PerfScore 4.00
G_M42430_IG02: ;; offset=0014H
00000000 ldr x0, [x19,#8]
00000000 ldr x0, [x0,#8]
00000000 ldrsb wzr, [x0]
00000000 mov x2, xzr
00000000 movn w1, #0
00000000 bl System.Threading.Barrier:SignalAndWait(int,System.Threading.CancellationToken):bool:this
00000000 ldr x0, [x19,#8]
00000000 ldr x0, [x0,#8]
00000000 ldrsb wzr, [x0]
00000000 mov x2, xzr
00000000 movn w1, #0
00000000 bl System.Threading.Barrier:SignalAndWait(int,System.Threading.CancellationToken):bool:this
00000000 mov w20, wzr
00000000 ldr x0, [x19,#8]
00000000 ldr w0, [x0,#32]
00000000 cmp w0, #0
00000000 ble G_M42430_IG06
;; bbWeight=1 PerfScore 30.00
G_M42430_IG03: ;; offset=0058H
00000000 ldr x0, [x19,#16]
00000000 add x1, fp, #24 // [V02 loc1]
00000000 ldr wzr, [x0]
00000000 bl System.Collections.Concurrent.ConcurrentStack`1[__Canon][System.__Canon]:TryPop(byref):bool:this
00000000 cbz w0, G_M42430_IG05
;; bbWeight=4 PerfScore 34.00
G_M42430_IG04: ;; offset=006CH
00000000 add w20, w20, #1
;; bbWeight=2 PerfScore 1.00
G_M42430_IG05: ;; offset=0070H
00000000 ldr x0, [x19,#8]
00000000 ldr w0, [x0,#32]
00000000 cmp w20, w0
00000000 blt G_M42430_IG03
;; bbWeight=4 PerfScore 30.00
G_M42430_IG06: ;; offset=0080H
00000000 ldp x19, x20, [sp,#32]
00000000 ldp fp, lr, [sp],#48
00000000 ret lr
;; bbWeight=1 PerfScore 3.00
; Total bytes of code 140, prolog size 16, PerfScore 116.00, instruction count 35, allocated bytes for code 140 (MethodHash=b47d5a41) for method <>c__DisplayClass6_0[__Canon][System.__Canon]:<SetupConcurrentStackIteration>b__1():this
; ============================================================ Codegen for that lambda |
Yeah I guess the only difference is the PRECODE for TryPop |
Ok, that makes sense |
@kunalspathak this has been discussed before, unless perhaps I'm thinking of the discussion about running unit tests on R2R bits. I suggest perhaps opening an issue to discuss in the perf repo. |
This is correct as far as I know.
No, this is the exact reason that we decided against running the microbenchmarks on the published SDK. We would find regressions and then have to sort through hundreds of possible commits between the two builds from the runtime repo to try and find the source of the regression. It was also extremely tedious to try and trace back to the correct hashes from the runtime repo.
Do we have some kind of Arm64 VMs that we could do this validation on, or would we need to use our current Arm64 performance hardware? We have very limited Arm64 hardware, and moving to doing this level of validation there would tax that limited infrastructure too greatly. |
Assigning to you @jakobbotsch |
My understanding is that the when the caller sees a definite (every iteration) call to a managed method in a loop, it's free to assume that the loop has a gc safepoint and so no polling is needed in the loop. There's no safe way for the caller to know that the callee can't actually be hijacked. So seems like any fix here has to happen in the callee. In the somewhat similar #57219 there was just a tiny window where the callee could be suspended. Perhaps e we need to look at forcing small methods (or perhaps methods with very short paths from entry to exit) to be fully interruptible so we don't need to hijack them? |
@janvorli If return address hijacking is completely disabled for any function containing tailcalls then surely all such methods have to be forced to be fully interruptible by the JIT, or it is very easy to create mutually tailcalling methods that can never be suspended. We can also have "irrelevant" tailcalls in these functions that are never executed, but then have the side effect of disabling return address hijacking globally for the method. Is there no alternative to completely disabling return address hijacking in this case? For example, implementing support for hijacking by modifying EDIT: Hmm, we should already need to mark such functions fully interruptible anyway, so not sure if this can actually get us into trouble. |
runtime/src/coreclr/jit/flowgraph.cpp Lines 4225 to 4232 in 55884ce
|
The JIT believes that runtime/src/coreclr/jit/importer.cpp Lines 9047 to 9050 in 55884ce
So looks like crossgen2 needs to report this flag back. |
It would be useful to implement hijacking for lr. It is non-trivial. The problem is actually with unhijacking. The current implementation assumes that it is possible to unhijack the thread by just writing into memory without having the context that is not possible when the return address is in register.
It is easy to fix crossgen2 to report this flag back for FCalls. It violates the R2R versioning principles a bit. We do not consider changing method implementation from FCall to managed and vice versa to be a breaking change. Once crossgen2 starts returning this flag back for FCalls, changing an implementation from managed to FCall will become potential R2R breaking change. |
Is there any alternative, other than marking practically all methods fully interruptible in R2R codegen? From what I understand, even today changing a managed function to a fcall will cause GC starvation if that fcall is not hijackable. Changing crossgen to check for the attribute would only make us better off in this regard. The current case is with a tailcall, but the same problem exists for any function that has a loop with a call in it that may change to a fcall then. |
I think it is ok to change crossgen to check for this flag as an imperfect workaround. As you have pointed out, the story for FCalls and GC suspension is far from perfect, even for the JIT case. |
When there are loops or tailcalls in a function the JIT will check for a dominating call that is a GC safepoint to figure out if the function needs to be marked fully interruptible or not. Methods with the InternalCall flag may turn into fcalls that are not hijackable and thus not GC safepoints, so in this case JIT should still mark the function fully interruptible, but was not doing so because crossgen2 was not reporting this flag back. Fix dotnet#64980
When there are loops or tailcalls in a function the JIT will check for a dominating call that is a GC safepoint to figure out if the function needs to be marked fully interruptible or not. Methods with the InternalCall flag may turn into fcalls that are not hijackable and thus not GC safepoints, so in this case JIT should still mark the function fully interruptible, but was not doing so because crossgen2 was not reporting this flag back. Fix #64980
…ark on ARM64 as it hangs details: dotnet/runtime#64980
* define NET7_0_PREVIEW2_OR_GREATER const for .NET 7 Preview2+ builds * don't use Regex.Count API for older versions of .NET 7 (used as baseline for monthly perf runs) * don't try to run AddRemoveFromDifferentThreads.ConcurrentStack benchmark on ARM64 as it hangs details: dotnet/runtime#64980 * don't try to run Json_FromString.Jil_ benchmark on ARM64 as it throws AccessViolationException details: dotnet/runtime#64657
The
AddRemoveFromDifferentThreads<string>.ConcurrentStack
benchmark hangs on Windows ARM64 (.NET 7 Preview 1).Repro:
I was not able to attach a VS debugger to the live process as VS seems to not support it on ARM64 yet.
I've captured a dump and uploaded it here.
PerfView shows me that the process is stuck in the following loop:
https://github.com/dotnet/performance/blob/3edf4c3149f5c903777f13af9b08b562b1672302/src/benchmarks/micro/libraries/System.Collections/Concurrent/AddRemoveFromDifferentThreads.cs#L94-L100
Since
AddRemoveFromDifferentThreads<int>.ConcurrentStack
andAddRemoveFromDifferentThreads<string>.ConcurrentBag
work fine I guess it's a codegen issue.Hardware: Surface Pro X, Win 10, arm64
Please let me know if there is any way I could help.
cc @kunalspathak @AndyAyersMS @kouvel
The text was updated successfully, but these errors were encountered: