Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MonoTests.System.Drawing.TestBitmap fail on Linux arm on checked coreclr #37082

Closed
safern opened this issue May 27, 2020 · 26 comments
Closed

MonoTests.System.Drawing.TestBitmap fail on Linux arm on checked coreclr #37082

safern opened this issue May 27, 2020 · 26 comments
Labels
arch-arm32 area-System.Drawing os-linux Linux OS (any supported distro) test-run-core Test failures in .NET Core test runs
Milestone

Comments

@safern
Copy link
Member

safern commented May 27, 2020

Some tests from TestBitmap fail with:

Assert failure(PID 23 [0x00000017], Thread: 39 [0x0027]): object->HasEmptySyncBlockInfo()
    File: /__w/1/s/src/coreclr/src/vm/jithelpers.cpp Line: 2279
    Image: /root/helix/work/correlation/dotnet

when running on checked coreclr. Some of these tests are:

  • LockBits_ImageLockMode_Invalid
  • LockBits_Double
  • SetResolution_Negative_X

Helix queue: Ubuntu.1804.Armarch.Open
Docker image: mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-18.04-helix-arm32v7-bfcd90a-20200121150440

cc: @BruceForstall @jashook

category:correctness
theme:gc-stress
skill-level:expert
cost:medium

@safern safern added arch-arm32 test-run-core Test failures in .NET Core test runs labels May 27, 2020
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-System.Drawing untriaged New issue has not been triaged by the area owner labels May 27, 2020
@ghost
Copy link

ghost commented May 27, 2020

Tagging subscribers to this area: @safern, @tannergooding
Notify danmosemsft if you want to be subscribed.

@safern safern added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed area-System.Drawing labels May 27, 2020
@safern
Copy link
Member Author

safern commented May 27, 2020

I'm disabling these on: #36910

@BruceForstall
Copy link
Member

I've seen this failure with many tests in #36486. I haven't had the chance to investigate / disable them yet.

@BruceForstall BruceForstall added this to the 5.0 milestone Jun 1, 2020
@BruceForstall BruceForstall removed the untriaged New issue has not been triaged by the area owner label Jun 1, 2020
@BruceForstall
Copy link
Member

@janvorli I haven't investigated this yet. The assert makes it seem like a VM, not codegen, issue.

@BruceForstall BruceForstall self-assigned this Jun 24, 2020
@BruceForstall
Copy link
Member

@BruceForstall BruceForstall added the os-linux Linux OS (any supported distro) label Jul 29, 2020
@BruceForstall
Copy link
Member

I ran the System.Memory.Tests case over 450 times and didn't see this assert.

@safern
Copy link
Member Author

safern commented Jul 30, 2020

I remember thin being reproing pretty consistently on TestBitmap tests.

@BruceForstall
Copy link
Member

Unfortunately, due to dotnet/arcade#5786, we aren't getting any Kusto data about Linux arm failures, so we can't look historically and see what the failure rate is.

@AndyAyersMS
Copy link
Member

@BruceForstall I might poach this one from you...

@BruceForstall
Copy link
Member

@AndyAyersMS Yes, feel free. Note that #40126 showed this still repros.

@AndyAyersMS
Copy link
Member

Yep, readily repros at 664d9f8. This seems to be a gc or runtime issue (or perhaps some kind of heap corruption). Assert fires here:

* thread #16, name = 'corerun', stop reason = signal SIGTRAP
  * frame #0: 0xf7ad92e6 libcoreclr.so`DBG_DebugBreak + 2
    frame #1: 0xf7a7a7e6 libcoreclr.so`::DebugBreak() at debug.cpp:410
    frame #2: 0xf7679c6c libcoreclr.so`::DbgAssertDialog(szFile=0xf7b8660a, iLine=2284, szExpr=<unavailable>) at debug.cpp:697
    frame #3: 0xf7946342 libcoreclr.so`JIT_NewS_MP_FastPortable(typeHnd_=0xe4e16c8c) at jithelpers.cpp:2284

which I think is checking if underlying heap object storage was properly initially zeroed.

Re-labelling as a VM issue for now.

I'll try check for heap corruption, but having issues getting the right SOS currently... (dotnet/diagnostics#1418).

@AndyAyersMS AndyAyersMS added area-VM-coreclr and removed area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI labels Aug 7, 2020
@AndyAyersMS
Copy link
Member

Got the right SOS, and there is heap corruption.

(lldb) verifyheap
Object f22787ec has an invalid method table.
Last good object: F22787A0.

Let me see if I can track this down.

@AndyAyersMS AndyAyersMS self-assigned this Aug 7, 2020
@AndyAyersMS
Copy link
Member

I'll keep looking but it might also help to have somebody on the VM side looking too. cc @mangod9

@mangod9
Copy link
Member

mangod9 commented Aug 13, 2020

Is this still Linux arm32 specific failure, or repros on other platforms?

@AndyAyersMS
Copy link
Member

Linux arm32 only.

@mangod9
Copy link
Member

mangod9 commented Aug 13, 2020

Ok, thx. + @AntonLapounov

@AndyAyersMS
Copy link
Member

Seeing if I can pin this down to a particular test case. Running serially I get an assert during TestBitmap.LockBits_Double, but this passes if run as the only test.

@AndyAyersMS
Copy link
Member

AndyAyersMS commented Aug 13, 2020

Running two tests I can repro:

-method MonoTests.System.Drawing.TestBitmap.LockBitmap_Format32bppArgb_Format24bppRgb_ReadWrite_Partial 
-method MonoTests.System.Drawing.TestBitmap.LockBits_Double

Not sure if it matters what the upstream test is, checking (yes, seems to matter; not sure exactly on what just yet -- seems likely it is bitmap locking).

@AndyAyersMS
Copy link
Member

@safern do we know if this is a regression, or a test that we just started running?

Also suspect my local libgdiplus is perhaps not up to date; some other tests fail with libgdiplus errors. I have

Package: libgdiplus
Status: install ok installed
Priority: optional
Section: libs
Installed-Size: 366
Maintainer: Debian Mono Group <pkg-mono-group@lists.alioth.debian.org>
Architecture: armhf
Version: 6.0.4-0xamarin1+ubuntu1804b1
Depends: libc6 (>= 2.27), libcairo2 (>= 1.10.0), libexif12 (>= 0.6.21-1~), libfontconfig1 (>= 2.12), libfreetype6 (>= 2.2.1), libgif7 (>= 5.1), libglib2.0-0 (>= 2.31.8), libjpeg8 (>= 8c), libpng16-16 (>= 1.6.2-1), libtiff5 (>= 4.0.3), libx11-6
Description: interface library for System.Drawing of Mono
 This package contains a GDI+ API compatible implementation needed by the
 System.Drawing library of Mono.
Description-md5: 448897d7c1f6d9b0a49096653fa8811b
Homepage: http://www.mono-project.com/Libgdiplus

/usr/lib/libgdiplus.so.0.0.0: ELF 32-bit LSB shared object, ARM, EABI5 version 1 (SYSV), dynamically linked, BuildID[sha1]=2995eb2b944c77cc5e6bb526e1dd81416abeefba, stripped

@AndyAyersMS
Copy link
Member

Corrupt heap location is just past the end of a BitmapData object that is allocated by the test. So seeing if I can get a suitable data breakpoint to watch this spot.

(lldb) verifyheap
Object f21277b4 has an invalid method table.
Last good object: F2127768.
(lldb) dumpobj F2127768
Name:        System.Drawing.Imaging.BitmapData
MethodTable: e9db1960
EEClass:     e9d9cb64
Size:        76(0x4c) bytes
File:        /mnt/laptop/repos/runtime/artifacts/tests/coreclr/Linux.arm.Checked/Tests/Core_Root/System.Drawing.Common.dll
Fields:
      MT    Field   Offset                 Type VT     Attr    Value Name
f64536d8  4000658        4         System.Int32  1 instance        4 _width
f64536d8  4000659        8         System.Int32  1 instance        4 _height
f64536d8  400065a        c         System.Int32  1 instance       12 _stride
e9db0d34  400065b       10         System.Int32  1 instance   137224 _pixelFormat
f5ae2410  400065c       14        System.IntPtr  1 instance 00B1A678 _scan0
f64536d8  400065d       18         System.Int32  1 instance        0 _reserved
f5ae2410  400065e       1c        System.IntPtr  1 instance 00000D00 palette
f64536d8  400065f       20         System.Int32  1 instance        0 property_count
f5ae2410  4000660       24        System.IntPtr  1 instance 00000000 property
f5ae0af4  4000661       28        System.Single  1 instance 0.000000 dpi_horz
f5ae0af4  4000662       2c        System.Single  1 instance 0.000000 dpi_vert
f64536d8  4000663       30         System.Int32  1 instance        0 image_flags
f64536d8  4000664       34         System.Int32  1 instance        0 left
f64536d8  4000665       38         System.Int32  1 instance    65536 top
f64536d8  4000666       3c         System.Int32  1 instance        0 x
f64536d8  4000667       40         System.Int32  1 instance        0 y
f64536d8  4000668       44         System.Int32  1 instance        4 transparent

@AndyAyersMS
Copy link
Member

I can get this extracted repro to crash intermittently when run with GCStress=3 HeapVerify=1.

Compiling   97 X::HashLock, IL size = 291, hash=0xd94fe6d8 Tier-0 switched to FullOpts
Compiling   97 Microsoft.Win32.SafeHandles.SafeEvpMdCtxHandle::ReleaseHandle, IL size = 13, hash=0x59b7c068 Tier-0
Compiling   99 System.Drawing.Bitmap::LockBits, IL size = 39, hash=0x030f5ef1 Tier-0
Compiling  100 ILStubClass::IL_STUB_PInvoke, IL size = 109, hash=0xc1f5e930 FullOpts
./repro-ex.sh: line 22:  1563 Aborted                 (core dumped) $CORERUN /home/andy/bugs/r37082/ex.exe

So far this does not repro when run under a debugger.

Guessing there's memory corruption introduced by libgdiplus's LockBits method, but haven't confirmed this.

@AndyAyersMS
Copy link
Member

The LLDB I'm using can't backtrace from a crashudmp, but gdb can...

(gdb) bt
#0  __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
#1  0xf7e1ccaa in __waitpid (pid=1614, stat_loc=0xff8de550, options=0) at ../sysdeps/unix/sysv/linux/waitpid.c:30
#2  0xf79530bc in PROCCreateCrashDump (argv=<optimized out>) at /home/andy/repos/runtime/src/coreclr/src/pal/src/thread/process.cpp:3336
#3  0xf7951218 in PROCAbort () at /home/andy/repos/runtime/src/coreclr/src/pal/src/thread/process.cpp:3471
#4  0xf78fdee2 in invoke_previous_action (action=<optimized out>, code=<optimized out>, siginfo=<optimized out>, context=<optimized out>, 
    signalRestarts=<optimized out>) at /home/andy/repos/runtime/src/coreclr/src/pal/src/exception/signal.cpp:334
#5  0xf78fd400 in sigtrap_handler (code=5, siginfo=0xff8de5d0, context=0xff8de650)
    at /home/andy/repos/runtime/src/coreclr/src/pal/src/exception/signal.cpp:596
#6  <signal handler called>
#7  0xf795a2e6 in DBG_DebugBreak () from /mnt/laptop/repos/runtime/artifacts/tests/coreclr/Linux.arm.Checked/Tests/Core_Root/libcoreclr.so
#8  0xf78fb7e6 in DebugBreak () at /home/andy/repos/runtime/src/coreclr/src/pal/src/debug/debug.cpp:410
#9  0xf78e7d7c in WKS::FATAL_GC_ERROR () at /home/andy/repos/runtime/src/coreclr/src/gc/gcpriv.h:28
#10 WKS::gc_heap::verify_free_lists () at /home/andy/repos/runtime/src/coreclr/src/gc/gc.cpp:34819
#11 0xf78d36b0 in WKS::gc_heap::verify_heap (begin_gc_p=<optimized out>) at /home/andy/repos/runtime/src/coreclr/src/gc/gc.cpp:35245
#12 0xf78d52d0 in WKS::gc_heap::garbage_collect (n=2) at /home/andy/repos/runtime/src/coreclr/src/gc/gc.cpp:17824
#13 0xf78c8d34 in WKS::GCHeap::GarbageCollectGeneration (this=<optimized out>, gen=2, reason=<optimized out>)
    at /home/andy/repos/runtime/src/coreclr/src/gc/gc.cpp:37429
#14 0xf78e9a90 in WKS::GCHeap::GarbageCollectTry (this=<optimized out>, generation=<optimized out>, low_memory_p=<optimized out>, 
    mode=<optimized out>) at /home/andy/repos/runtime/src/coreclr/src/gc/gc.cpp:36682
#15 WKS::GCHeap::GarbageCollect (this=<optimized out>, generation=<optimized out>, low_memory_p=<optimized out>, mode=<optimized out>)
    at /home/andy/repos/runtime/src/coreclr/src/gc/gc.cpp:36616
#16 0xf78e8d9a in WKS::GCHeap::StressHeap (this=0x253d638, context=<optimized out>) at /home/andy/repos/runtime/src/coreclr/src/gc/gc.cpp:36270
#17 0xf77abd66 in _GCStress::StressGcTriggerPolicy::Trigger (acontext=<optimized out>)
    at /home/andy/repos/runtime/src/coreclr/src/vm/gcstress.h:297
#18 _GCStress::GCSBase<(gcs_trigger_points)1, _GCStress::IgnoreFastGcSPolicy, _GCStress::AnyGcModePolicy, _GCStress::StressGcTriggerPolicy>::MaybeTrigger (acontext=0x2581708, minFastGc=0) at /home/andy/repos/runtime/src/coreclr/src/vm/gcstress.h:415
#19 _GCStress::GCStress<(gcs_trigger_points)10, mpl::null_type, mpl::null_type, mpl::null_type>::MaybeTrigger (acontext=0x2581708)
    at /home/andy/repos/runtime/src/coreclr/src/vm/gcstress.h:464
#20 Alloc (size=13, flags=GC_ALLOC_NO_FLAGS) at /home/andy/repos/runtime/src/coreclr/src/vm/gchelpers.cpp:227
---Type <return> to continue, or q <return> to quit---
#21 0xf77aa964 in AllocateSzArray (pArrayMT=0xf14d60e0, cElements=1, flags=GC_ALLOC_NO_FLAGS)
    at /home/andy/repos/runtime/src/coreclr/src/vm/gchelpers.cpp:483
#22 0xf77c82f8 in JIT_NewArr1 (arrayMT=0xf14d60e0, size=1) at /home/andy/repos/runtime/src/coreclr/src/vm/jithelpers.cpp:2723
#23 0xf4e611a4 in ?? ()

Via lldb we can see that last method on the stack is X.HashLock. Disassembling, we can see this is the first call site after the call to LockBits.

@AndyAyersMS
Copy link
Member

@jeffhandley @safern pretty sure the issue here lies in libgdiplus for linux arm32, can you re-label & reassign?

@ghost
Copy link

ghost commented Aug 14, 2020

Tagging subscribers to this area: @safern, @tannergooding
See info in area-owners.md if you want to be subscribed.

@danmoseley danmoseley removed this from the 5.0.0 milestone Aug 14, 2020
@safern safern added this to the Future milestone Aug 14, 2020
@safern
Copy link
Member Author

safern commented Aug 14, 2020

Thanks @AndyAyersMS for investigating.

cc: @marek-safar @akoeplinger

@teo-tsirpanis
Copy link
Contributor

Since #64084 got merged, System.Drawing.Common does not work anymore on Unix. This issue should be closed.

@safern safern closed this as completed Feb 1, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Mar 3, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-arm32 area-System.Drawing os-linux Linux OS (any supported distro) test-run-core Test failures in .NET Core test runs
Projects
None yet
Development

No branches or pull requests

7 participants