Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARM32/ARM64 gcstress-extra failures #69657

Closed
jakobbotsch opened this issue May 22, 2022 · 5 comments · Fixed by #70053
Closed

ARM32/ARM64 gcstress-extra failures #69657

jakobbotsch opened this issue May 22, 2022 · 5 comments · Fixed by #70053
Assignees
Labels
arch-arm32 arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI blocking-clean-ci-optional Blocking optional rolling runs GCStress
Milestone

Comments

@jakobbotsch
Copy link
Member

There are many arm32/arm64 failures in recent gcstress-extra runs:
https://dev.azure.com/dnceng/public/_build/results?buildId=1783259&view=results

There are a few x86/x64 failures too, but they look more consistent and probably not like GC holes.

@ghost ghost added the untriaged New issue has not been triaged by the area owner label May 22, 2022
@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@jakobbotsch jakobbotsch added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 22, 2022
@ghost
Copy link

ghost commented May 22, 2022

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

There are many arm32/arm64 failures in recent gcstress-extra runs:
https://dev.azure.com/dnceng/public/_build/results?buildId=1783259&view=results

There are a few x86/x64 failures too, but they look more consistent and probably not like GC holes.

Author: jakobbotsch
Assignees: -
Labels:

area-CodeGen-coreclr, untriaged

Milestone: -

@jakobbotsch jakobbotsch added arch-arm32 arch-arm64 and removed untriaged New issue has not been triaged by the area owner labels May 22, 2022
@jakobbotsch jakobbotsch added this to the 7.0.0 milestone May 22, 2022
@SingleAccretion
Copy link
Contributor

SingleAccretion commented May 22, 2022

This may also be the cause of multiple failures documented in #68986.

I looked at a very simple & nice Linux ARM64 dump from here.

In the test, we have a few block copies between on-stack structs:

Added IP mapping: 0x0028 STACK_EMPTY (G_M8673_IG04,ins#15,ofs#60)
Generating: N058 (???,???) [000205] -----------                            IL_OFFSET void   INLRT @ 0x028[E-] REG NA
Generating: N060 (  3,  2) [000025] -c---------                   t25 =    LCL_VAR   struct<BigCopy, 32>(AX) V01 loc0          NA REG NA
Generating: N062 (???,???) [000235] Dc-----N---                  t235 =    LCL_VAR_ADDR byref  V02 loc1          NA REG NA
                                                                        /--*  t235   byref
                                                                        +--*  t25    struct
Generating: N064 (  7,  5) [000028] sA---------                         *  STORE_BLK struct<BigCopy, 32> (copy) (Unroll) REG NA
IN0016:                           ldp     x0, x1, [fp,#248]
IN0017:                           stp     x0, x1, [fp,#216]
IN0018:                           ldp     x0, x1, [fp,#264]
IN0019:                           stp     x0, x1, [fp,#232]
Added IP mapping: 0x002A STACK_EMPTY (G_M8673_IG04,ins#19,ofs#76)
Generating: N066 (???,???) [000206] -----------                            IL_OFFSET void   INLRT @ 0x02A[E-] REG NA
Generating: N068 (  3,  2) [000029] -c-----N---                   t29 =    LCL_VAR   struct<BigCopy, 32>(AX) V02 loc1          NA REG NA
Generating: N070 (???,???) [000236] Dc-----N---                  t236 =    LCL_VAR_ADDR byref  V10 tmp4          NA REG NA
                                                                        /--*  t236   byref
                                                                        +--*  t29    struct
Generating: N072 (  7,  5) [000176] sA---------                         *  STORE_BLK struct<BigCopy, 32> (copy) (Unroll) REG NA
IN001a:                           ldp     x0, x1, [fp,#216]
IN001b:                           stp     x0, x1, [fp,#72]
IN001c:                           ldp     x0, x1, [fp,#232]
IN001d:                           stp     x0, x1, [fp,#88]

Where BigCopy is this:

struct BigCopy
{
    public long l1, l2, l3;
    public object gc;
}

We crash just after these copies have been completed:

IN0016: 000090  A94F87A0          ldp     x0, x1, [fp,#248] // Copy #1
IN0017: 000094  A90D87A0          stp     x0, x1, [fp,#216]
IN0018: 000098  A95087A0          ldp     x0, x1, [fp,#264]
IN0019: 00009C  A90E87A0          stp     x0, x1, [fp,#232]
IN001a: 0000A0  A94D87A0          ldp     x0, x1, [fp,#216] // Copy #2
IN001b: 0000A4  A90487A0          stp     x0, x1, [fp,#72]
IN001c: 0000A8  A94E87A0          ldp     x0, x1, [fp,#232]
IN001d: 0000AC  A90587A0          stp     x0, x1, [fp,#88]
IN001e: 0000B0  910123A0          add     x0, fp, #72
                             ; byrRegs +[x0]
IN001f: 0000B4  910243A8          add     x8, fp, #144
                             ; byrRegs +[x8]
IN0020: 0000B8  52800061          mov     w1, #3
IN0021: 0000BC  93407C21          sxtw    x1, w1
<-------------------- Point of the crash -------------------->
IN0022: 0000C0  D2971102          movz    x2, #0xb888
IN0023: 0000C4  F2A5A4A2          movk    x2, #0x2d25 LSL #16
IN0024: 0000C8  F2CFFF02          movk    x2, #0x7ff8 LSL #32
IN0025: 0000CC  F9400042          ldr     x2, [x2]
IN0026: 0000D0  D63F0040          blr     x2

With an assert that tells us the object reference at [fp,#72] is invalid.

I think what's happening here is that we fail to report the registers used for copying (x0 and x1) to the GC.

In the "baseline" (actually, my fork), we mark these copies as non-interruptible:

                                                ;; size=60 bbWeight=1    PerfScore 12.00
G_M8673_IG05:        ; func=00, offs=000090H, size=0010H, BB05 [0001], nogc, extend
IN0016: 000090  A94F87A0          ldp     x0, x1, [fp,#248]
IN0017: 000094  A90D87A0          stp     x0, x1, [fp,#216]
IN0018: 000098  A95087A0          ldp     x0, x1, [fp,#264]
IN0019: 00009C  A90E87A0          stp     x0, x1, [fp,#232]
                                                ;; size=16 bbWeight=1    PerfScore 8.00
G_M8673_IG06:        ; func=00, offs=0000A0H, size=0010H, BB05 [0001], nogc, extend
IN001a: 0000A0  A94D87A0          ldp     x0, x1, [fp,#216]
IN001b: 0000A4  A90487A0          stp     x0, x1, [fp,#72]
IN001c: 0000A8  A94E87A0          ldp     x0, x1, [fp,#232]
IN001d: 0000AC  A90587A0          stp     x0, x1, [fp,#88]

So this implicates #69202.

cc @kunalspathak

@kunalspathak
Copy link
Member

In #69202, we mark the IG as nogc only if we know that adjustments will be made to calculate offset to encode it for an instruction that I don't see here. Prior to #69202, we would mark IG as non-gc if destination is local stack (which is the case of IG05 and IG06). I Regardless, I will have investigate it further.

@kunalspathak
Copy link
Member

There was a scenario where I was not marking a region as non-interruptible in the copy block having gc refs. Fixed in #70053.

@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jun 1, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Jul 1, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-arm32 arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI blocking-clean-ci-optional Blocking optional rolling runs GCStress
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants