Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Double free or corruption on Raspbian #10523

Closed
anmenaga opened this issue Jun 15, 2018 · 72 comments
Closed

Double free or corruption on Raspbian #10523

anmenaga opened this issue Jun 15, 2018 · 72 comments
Labels
arch-arm32 os-linux Linux OS (any supported distro)

Comments

@anmenaga
Copy link

After the move from .NETCore 2.0 to 2.1 this started happening very frequently.
PowerShell running on Raspberry Pi 3 Model B ("Raspbian GNU/Linux 9 (stretch)")
crashes with:
*** Error in "pwsh": double free or corruption (fasttop): 0x6e800fe0 ***

stack from gdb:

Thread 6076 "pwsh" received signal SIGABRT, Aborted.
[Switching to Thread 0x63647450 (LWP 32303)]
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51	../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
dotnet/coreclr#1  0x76bc2824 in __GI_abort () at abort.c:89
dotnet/coreclr#2  0x76bfbf78 in __libc_message (do_abort=do_abort@entry=2, fmt=<optimized out>) at ../sysdeps/posix/libc_fatal.c:175
dotnet/coreclr#3  0x76c02ad4 in malloc_printerr (action=<optimized out>, str=0x76cb5120 "double free or corruption (fasttop)", ptr=<optimized out>, ar_ptr=<optimized out>) at malloc.c:5049
dotnet/coreclr#4  0x76c03514 in _int_free (av=0x6e800010, p=0x6e800fd8, have_lock=<optimized out>) at malloc.c:3905
dotnet/coreclr#5  0x768381e0 in HeapFree () from /home/pi/PSP3_2/libcoreclr.so
dotnet/coreclr#6  0x765fd0aa in EEHeapFreeInProcessHeap(unsigned int, void*) () from /home/pi/PSP3_2/libcoreclr.so
dotnet/coreclr#7  0x76532c70 in operator delete(void*) () from /home/pi/PSP3_2/libcoreclr.so
dotnet/coreclr#8  0x765adff0 in Thread::intermediateThreadProc(void*) () from /home/pi/PSP3_2/libcoreclr.so
dotnet/coreclr#9  0x76851966 in CorUnix::CPalThread::ThreadEntry(void*) () from /home/pi/PSP3_2/libcoreclr.so
dotnet/coreclr#10 0x76ecefc4 in start_thread (arg=0x63647450) at pthread_create.c:335
dotnet/coreclr#11 0x76c66c68 in ?? () at ../sysdeps/unix/sysv/linux/arm/clone.S:76 from /lib/arm-linux-gnueabihf/libc.so.6

Can share the core file with above stack.

@SteveL-MSFT
Copy link
Contributor

SteveL-MSFT commented Jun 15, 2018

This is blocking PSCore6.1 release and is regression from dotnetcore 2.0 (and thus PSCore6.0)

@RussKeldorph
Copy link
Contributor

@janvorli

@janvorli
Copy link
Member

The code that allocates and frees the affected pointer hasn't changed since .NET core 1.0, so it looks like it is more likely a memory corruption rather than a double free. @anmenaga can you please share the core file with me?
Also, could you provide me with repro steps? Does it crash just when launching powershell or is it related to some command execution or any arbitrary commands can eventually lead to this?

@anmenaga
Copy link
Author

@janvorli In our case this is happening on a Pi where PowerShell is working with special hardware. Continuously running PS instance where different scripts are getting launched 24x7.
Previously the system was successfully running for a month uninterrupted; but when we switched to latest PS version (which is the first one to be using .NET Core 2.1) we get 6-hours runtime max before this crash happens.
So reproing for yourself might be tricky, but I can give you access to Pi to live gdb debugging session; let me know.

@janvorli
Copy link
Member

@anmenaga I have an additional question. When it crashes, is the call stack always the same / similar to the one you've shown or is it completely random?

@anmenaga
Copy link
Author

@janvorli this is the first time we attached debugger before the crash. Machine is currently holding live debug session at the point of the crash.
We can release it and check the callstack next time it happens....

@anmenaga
Copy link
Author

Here is another one:

Thread 2272 "pwsh" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x676d1450 (LWP 9094)]
0x75b00048 in ?? ()
(gdb) backtrace
#0  0x75b00048 in ?? ()
dotnet/coreclr#1  0x7685c966 in CorUnix::CPalThread::ThreadEntry(void*) () from /home/pi/PSP3_2/libcoreclr.so
dotnet/coreclr#2  0x76ed9fc4 in start_thread (arg=0x676d1450) at pthread_create.c:335
dotnet/coreclr#3  0x76c71c68 in ?? () at ../sysdeps/unix/sysv/linux/arm/clone.S:76 from /lib/arm-linux-gnueabihf/libc.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) 

@JustArchi
Copy link
Contributor

JustArchi commented Jun 23, 2018

I've got similar reports from users running my netcoreapp2.1 project on Raspbian, so it looks like a global issue, not just something pwsh-related. They also claim that it didn't happen before on netcoreapp2.0. I didn't get reports from other platforms and setups, so it looks arm or raspbian-related.

Sadly I don't have any stacktraces, only generic reports. Just wanted to confirm this one with entirely different project, thanks.

*** Error in `./ArchiSteamFarm': double free or corruption (fasttop): 0x019e2798 ***

@kpreisser
Copy link

kpreisser commented Jun 28, 2018

Hi,
we have the same problem with an .NET Core console application running on Raspbian 9 that runs a web server using Kestrel. After targeting netcoreapp2.1 and publishing for linux-arm with .NET Core SDK 2.1.301, the application crashes after some time (e.g. an hour) with the following output:

*** Error in `./CoDaBix-Shell': double free or corruption (fasttop): 0x6bf01570 ***
Aborted

However, when running with .NET Core 2.0, this does not appear. This issue is currently blocking us from moving to .NET Core 2.1.

Unfortunately atm I don't know how I can get further diagnostic information about this crash. Is there something I can do to help diagnosing this?

Thanks!

@jkotas
Copy link
Member

jkotas commented Jun 28, 2018

@kpreisser Enable code dumps by running ulimit -c unlimited. This will cause coredump to be generated when you hit this next time. Open the codedump in debugger (lldb or gdb), find the stacktrace of the crash and share it here if possible. It would be useful to do this several times to see whether there is any pattern. We will then decide on the next steps. Thanks for your help in finding root cause of this crash!

@kpreisser
Copy link

kpreisser commented Jul 2, 2018

Hi @jkotas,

thank you! I enabled core dumps and ran the application today, where it crashed after a few hours with:

*** Error in `./CoDaBix-Shell': double free or corruption (fasttop): 0x007b6228 ***
Aborted (core dumped)

This is the output from gdb:

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1".
Core was generated by `./CoDaBix-Shell --remoteHttp'.
Program terminated with signal SIGABRT, Aborted.
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
[Current thread is 1 (Thread 0x6beff450 (LWP 11113))]
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
dotnet/coreclr#1  0x76c0e824 in __GI_abort () at abort.c:89
dotnet/coreclr#2  0x76c47f78 in __libc_message (do_abort=do_abort@entry=2, fmt=<optimized out>) at ../sysdeps/posix/libc_fatal.c:175
dotnet/coreclr#3  0x76c4ead4 in malloc_printerr (action=<optimized out>, str=0x76d01120 "double free or corruption (fasttop)", ptr=<optimized out>,
    ar_ptr=<optimized out>) at malloc.c:5049
dotnet/coreclr#4  0x76c4f514 in _int_free (av=0x76d1d794 <main_arena>, p=0x7b6220, have_lock=<optimized out>) at malloc.c:3905
dotnet/coreclr#5  0x76884244 in HeapFree () from /home/pi/codabix/libcoreclr.so
dotnet/coreclr#6  0x766490aa in EEHeapFreeInProcessHeap(unsigned int, void*) () from /home/pi/codabix/libcoreclr.so
dotnet/coreclr#7  0x7657ec70 in operator delete(void*) () from /home/pi/codabix/libcoreclr.so
dotnet/coreclr#8  0x765f9ff0 in Thread::intermediateThreadProc(void*) () from /home/pi/codabix/libcoreclr.so
dotnet/coreclr#9  0x7689d9ca in CorUnix::CPalThread::ThreadEntry(void*) () from /home/pi/codabix/libcoreclr.so
dotnet/coreclr#10 0x76f1afc4 in start_thread (arg=0x6beff450) at pthread_create.c:335
dotnet/coreclr#11 0x76cb2c68 in ?? () at ../sysdeps/unix/sysv/linux/arm/clone.S:76 from /lib/arm-linux-gnueabihf/libc.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

I do not yet have more stack traces, but I will continue to run the application and then post the stacks if they are different.

Thanks!

@kpreisser
Copy link

Yesterday the application crashed with a different error:

Segmentation fault (core dumped)

gdb:

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1".
Core was generated by `./CoDaBix-Shell --remoteHttp'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x6c006d78 in ?? ()
[Current thread is 1 (Thread 0x686c6450 (LWP 27243))]
(gdb) bt
#0  0x6c006d78 in ?? ()
dotnet/coreclr#1  0x768569ca in CorUnix::CPalThread::ThreadEntry(void*) () from /home/pi/codabix/libcoreclr.so
dotnet/coreclr#2  0x76ed3fc4 in start_thread (arg=0x686c6450) at pthread_create.c:335
dotnet/coreclr#3  0x76c6bc68 in ?? () at ../sysdeps/unix/sysv/linux/arm/clone.S:76 from /lib/arm-linux-gnueabihf/libc.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

@js8749
Copy link

js8749 commented Jul 10, 2018

I have this same problem, i haved in 2.1.300 and i update to 2.1.301 to see if it stops and it have the same problems

@RhavoX
Copy link

RhavoX commented Jul 11, 2018

In my case this issue was not resolved in 2.1.301

@p3root
Copy link

p3root commented Jul 16, 2018

Same here.

I have 2 stacktraces are the same.

_GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51 51 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory. (gdb) bt #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51 dotnet/coreclr#1 0x76c73824 in __GI_abort () at abort.c:89 dotnet/coreclr#2 0x76cacf78 in __libc_message (do_abort=do_abort@entry=2, fmt=<optimized out>) at ../sysdeps/posix/libc_fatal.c:175 dotnet/coreclr#3 0x76cb3ad4 in malloc_printerr (action=<optimized out>, str=0x76d66070 "double free or corruption (fasttop)", ptr=<optimized out>, ar_ptr=<optimized out>) at malloc.c:5049 dotnet/coreclr#4 0x76cb4514 in _int_free (av=0x72000010, p=0x7206e298, have_lock=<optimized out>) at malloc.c:3905 dotnet/coreclr#5 0x768e7244 in HeapFree () from /opt/####/libcoreclr.so dotnet/coreclr#6 0x766ac0aa in EEHeapFreeInProcessHeap(unsigned int, void*) () from /opt/#####/libcoreclr.so dotnet/coreclr#7 0x765e1c70 in operator delete(void*) () from /opt/####/libcoreclr.so dotnet/coreclr#8 0x7665cff0 in Thread::intermediateThreadProc(void*) () from /opt/###/libcoreclr.so dotnet/coreclr#9 0x769009ca in CorUnix::CPalThread::ThreadEntry(void*) () from /opt/#####/libcoreclr.so dotnet/coreclr#10 0x76f81fc4 in start_thread (arg=0x62854450) at pthread_create.c:335 dotnet/coreclr#11 0x76d17bc8 in ?? () at ../sysdeps/unix/sysv/linux/arm/clone.S:76 from /lib/arm-linux-gnueabihf/libc.so.6 Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Here my .net core version

`Host (useful for support):
Version: 2.1.2
Commit: 811c3ce6c0

.NET Core SDKs installed:
No SDKs were found.

.NET Core runtimes installed:
Microsoft.AspNetCore.All 2.1.2 [/opt/dotnet/shared/Microsoft.AspNetCore.All]
Microsoft.AspNetCore.App 2.1.2 [/opt/dotnet/shared/Microsoft.AspNetCore.App]
Microsoft.NETCore.App 2.1.2 [/opt/dotnet/shared/Microsoft.NETCore.App]

To install additional .NET Core runtimes or SDKs:
https://aka.ms/dotnet-download
`

Build on a windows pc with the following version info:

`.NET Core SDK (gemäß "global.json"):
Version: 2.1.301
Commit: 59524873d6

Laufzeitumgebung:
OS Name: Windows
OS Version: 10.0.16299
OS Platform: Windows
RID: win10-x64
Base Path: C:\Program Files\dotnet\sdk\2.1.301\

Host (useful for support):
Version: 2.1.1
Commit: 6985b9f684

.NET Core SDKs installed:
2.1.2 [C:\Program Files\dotnet\sdk]
2.1.4 [C:\Program Files\dotnet\sdk]
2.1.301 [C:\Program Files\dotnet\sdk]

.NET Core runtimes installed:
Microsoft.AspNetCore.All 2.1.1 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.All]
Microsoft.AspNetCore.App 2.1.1 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
Microsoft.NETCore.App 2.0.3 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 2.1.1 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
`

@kouvel
Copy link
Member

kouvel commented Jul 18, 2018

This could be some kind of corruption or a bad free somewhere. The object being deleted would have been allocated not long before the point of failure. Could someone please share the stacks of all threads when this occurs (the double-free issue)? If it's a bad free that's happening elsewhere the stacks of other threads may help to narrow down where the bad free might be. If someone could share a core dump of the double-free issue that may be useful as well.

@kouvel
Copy link
Member

kouvel commented Jul 18, 2018

It may also be useful to see what values were extracted from the memory before the delete, and the pointer value itself, to see if there is any obvious corruption, though it may not be reliable info.

@p3root
Copy link

p3root commented Jul 18, 2018

Here is a backtrace of all threads. I could give you access to my raspberry if needed as well.

Backtrace

@kpreisser
Copy link

Hi,
For the "double free or corruption" issue:
Backtrace

For the "Segmentation fault" issue:
Backtrace

The core files have about 300 MB; I think I cannot share them publicly as they might contain private code, but maybe I can share them privately.

Additionally to these two crashes, we have also discovered that sometimes the application does not crash but is stuck at 100% CPU usage (which doesn't happen with .NET Core 2.0).

Thank you!

@js8749
Copy link

js8749 commented Jul 19, 2018 via email

@rquackenbush
Copy link

We're running into the same scenario as this:

Additionally to these two crashes, we have also discovered that sometimes the application does not crash but is stuck at 100% CPU usage (which doesn't happen with .NET Core 2.0).

Working on getting dumps.

@jesbrd
Copy link

jesbrd commented Jul 20, 2018

Adding to @rquackenbush comment: here is what happens in our situation after about 4 hours of runtime:

** Error in 'Application': double free or corruption (fasttop): 0x73630b18 ***

This particular error was on Raspbian as well, but it occurs on all ARM devices where our applications are executing.

@js8749
Copy link

js8749 commented Jul 20, 2018 via email

@rquackenbush
Copy link

This repo contains a gcore dump of the running process. I also included the pdb files from the running code:
https://github.com/Boondocks/boondocks-core-dumps

This is the code that is running:
https://github.com/Boondocks/Boondocks/tree/develop/src/Agent

@kouvel
Copy link
Member

kouvel commented Jul 21, 2018

In the seg fault stack trace it looks like a different recently heap-allocated object's data is corrupted. The point of failure is close to the double-free issue, they may be caused by the same underlying issue. I didn't gather much from the other threads' stacks, they look typical. It looks like the point of failure would be too late, I'll look into getting a repro for the PS issue meanwhile. If you have some repro steps that I could use (even if it takes a while before crash, just something that works) with any of the apps above that would be helpful.

@rquackenbush
Copy link

@kouvel - thanks for looking at that. Were you able to identify what code the problematic thread was running (or trying to run)? Any help there would help us in creating a repro.

@kouvel
Copy link
Member

kouvel commented Jul 23, 2018

@rquackenbush I wasn't able to identify a suspicious thread from the native stacks that were posted. I haven't yet figured out how to look at the core dump you shared, it seems like more things would be needed (executable image, dependency modules, maybe more). If you're looking for managed stacks in order to determine what code is running in each thread it may be easier for you to open the core dump with lldb-5.0 on the same machine, "plugin load libsosplugin.so" (should be alongside the loaded libcoreclr.so), and "clrstack" for each thread as described here.

@rquackenbush
Copy link

@kouvel - unfortunately it looks like lldb-5.0 isn't available on arm quite yet. I've tried running lldb-3.9 per the instructions, but the sosplugin doesn't appear to be compatible per this issue:

https://github.com/dotnet/coreclr/issues/18889

I'm attempting to build lldb locally on the pi, but that is taking an eternity. It's also not clear how I'll be able to compile a matching libsosplugin.so given the instructions for building coreclr use a docker container on an x86 machine:

https://github.com/dotnet/coreclr/blob/master/Documentation/building/linux-instructions.md#build-for-armlinux.

@kouvel
Copy link
Member

kouvel commented Jul 23, 2018

Oh it looks like the cross-build script for arm installs lldb 3.6 dev package by default:
https://github.com/dotnet/coreclr/blob/92d2c4bde42569d2aa22e44550d69f7d743bf9a0/cross/build-rootfs.sh#L19

And it looks like the sos plugin build would use the latest lldb headers that are installed, so I'm not sure which version it would be build against, probably 3.6. Though I'm not sure if it actually works, based on other issues linked to the issue above the sos plugin appears to have issues on arm. I'll have to try this out myself.

@kouvel
Copy link
Member

kouvel commented Aug 8, 2018

I got a repro now on 2.1.2 with the same stack trace, it just took longer.

@RoySalisbury
Copy link

Its pretty elusive. I think once we find the root cause, it will be easy to get a test case that happens faster. My original thought was it was a TheradPool cleanup issue. Somehow a thread was getting cleaned up twice. But I have nothing to base that on other than 5 lines of a stack trace. :)

But, at least you have something now.

@js8749
Copy link

js8749 commented Aug 8, 2018 via email

@RoySalisbury
Copy link

@js8749 I'm sure multiple tests cases can always help. You should defiantly upload something that they can try.

@RoySalisbury
Copy link

RoySalisbury commented Aug 13, 2018 via email

@RoySalisbury
Copy link

Hmmm.. Not showing up here for some reason, but just saw a reply in my email from @kouvel that stated he may have found a possible culprit.

It looks like the PAL process thread list is corrupted. Two threads were started with the same PAL thread object, T1 creator stores the new thread args pointer, T2 creator overwrites it with its own new thread args pointer, T1 starts and deletes it, then T2 starts and tries to delete it again. Reviewing if there are any issues that would cause the list to get corrupted.

@kouvel
Copy link
Member

kouvel commented Aug 13, 2018

Yes that's most likely it, I deleted my comments, will update once I test a fix

@anmenaga
Copy link
Author

My repro scenario involves very frequent operations with Thread creation/deletion;
so it feels like @kouvel is on the right track.

@kouvel
Copy link
Member

kouvel commented Aug 21, 2018

I've tested a fix that seems to work, working on getting a fix in

@js8749
Copy link

js8749 commented Aug 21, 2018 via email

@RoySalisbury
Copy link

@kouvel That's great news.

How does that typically work on things like this? If approved it just goes into the next release (e.g, 2.1.3), or does it take much longer (e.g., 2.2)?

@kouvel
Copy link
Member

kouvel commented Aug 21, 2018

@js8749 the timing in my runs were also unpredictable, sometimes it took days and sometimes hours.

@RoySalisbury, it would go into 2.1, 2.2, and master for 3.0, which version of 2.1 is not clear at the moment, hopefully the next one.

kouvel referenced this issue in kouvel/coreclr Aug 22, 2018
Fixes https://github.com/dotnet/coreclr/issues/18486
- Lock release needs to be at least volatile
kouvel referenced this issue in kouvel/coreclr Aug 22, 2018
Fix for https://github.com/dotnet/coreclr/issues/18486
- Lock release needs to be at least volatile
kouvel referenced this issue in kouvel/coreclr Aug 22, 2018
Fix for https://github.com/dotnet/coreclr/issues/18486
- Lock release needs to be at least volatile
kouvel referenced this issue in dotnet/coreclr Aug 22, 2018
Fix for https://github.com/dotnet/coreclr/issues/18486
- Lock release needs to be at least volatile
mikem8361 referenced this issue in dotnet/diagnostics Aug 23, 2018
Fix for https://github.com/dotnet/coreclr/issues/18486
- Lock release needs to be at least volatile

coreclr master PR: dotnet/coreclr#19604
@dleeapho
Copy link

/cc @MichaelSimons

kouvel referenced this issue in dotnet/coreclr Aug 28, 2018
@kouvel
Copy link
Member

kouvel commented Aug 28, 2018

The fix is currently targeting a September release for 2.1. Closing based on dotnet/coreclr#19606.

@kouvel kouvel closed this as completed Aug 28, 2018
kouvel referenced this issue in dotnet-maestro-bot/coreclr Aug 28, 2018
@JustArchi
Copy link
Contributor

Could we get a small notification in this issue when runtime with this issue being fixed gets released? Thank you in advance 👍

@JustArchi
Copy link
Contributor

JustArchi commented Sep 13, 2018

@kouvel Can you confirm that 2.1.4 runtime has dotnet/coreclr#19606 fix included? Thank you.

@RoySalisbury
Copy link

@kouvel
Copy link
Member

kouvel commented Sep 13, 2018

Based on the commit hash of coreclr.dll it looks like it did not make it into 2.1.4 though it was expected to be at the time (latest commit included was on Aug 13). It should be included in 2.1.5, which is scheduled for October. Apologies for the confusion.

@JustArchi
Copy link
Contributor

Thank you for further explanation. Let's hope it's going to make it in 2.1.5 then 🙂

@RoySalisbury
Copy link

Looks like this DID make it into 2.1.5

2018-08-28 - [9663131aec] Fix a PAL spin lock issue (#19606)

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 16, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-arm32 os-linux Linux OS (any supported distro)
Projects
None yet
Development

No branches or pull requests