Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault on Apple Silicon / arm64 #343

Open
xhochy opened this issue Dec 15, 2020 · 26 comments
Open

Segmentation fault on Apple Silicon / arm64 #343

xhochy opened this issue Dec 15, 2020 · 26 comments

Comments

@xhochy
Copy link
Contributor

xhochy commented Dec 15, 2020

Building mimalloc fails in the TLS initialization with a segmentation fault:

% lldb ./mimalloc-test-api
(lldb) target create "./mimalloc-test-api"
Current executable set to '/Users/uwe/Development/mimalloc/build/mimalloc-test-api' (arm64).
(lldb) run
Process 17703 launched: '/Users/uwe/Development/mimalloc/build/mimalloc-test-api' (arm64)
mimalloc-test-api was compiled with optimization - stepping may behave oddly; variables may not be available.
Process 17703 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
    frame #0: 0x000000010001f264 mimalloc-test-api`_mi_process_init [inlined] mi_tls_slot(slot=0) at mimalloc-internal.h:711:9 [opt]
   708 	#elif defined(__aarch64__)
   709 	  void** tcb; UNUSED(ofs);
   710 	  __asm__ volatile ("mrs %0, tpidr_el0" : "=r" (tcb));
-> 711 	  res = tcb[slot];
   712 	#endif
   713 	  return res;
   714 	}
Target 0: (mimalloc-test-api) stopped.
@devnexen
Copy link
Contributor

devnexen commented Dec 15, 2020

Out of curiosity, would the dev branch still fails ?

@xhochy
Copy link
Contributor Author

xhochy commented Dec 15, 2020

Fails with exactly the same issue:

Process 18154 launched: '/Users/uwe/Development/mimalloc/build/mimalloc-test-api' (arm64)
mimalloc-test-api was compiled with optimization - stepping may behave oddly; variables may not be available.
Process 18154 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
    frame #0: 0x000000010001efa0 mimalloc-test-api`_mi_process_init [inlined] mi_tls_slot(slot=0) at mimalloc-internal.h:712:9 [opt]
   709 	#elif defined(__aarch64__)
   710 	  void** tcb; UNUSED(ofs);
   711 	  __asm__ volatile ("mrs %0, tpidr_el0" : "=r" (tcb));
-> 712 	  res = tcb[slot];
   713 	#endif
   714 	  return res;
   715 	}
Target 0: (mimalloc-test-api) stopped.
(lldb) p slot
(size_t) $0 = 0
(lldb) p tcb
(void **) $1 = 0x0000000000000000

@devnexen
Copy link
Contributor

weird because the assembly is supposed to be different.

@xhochy
Copy link
Contributor Author

xhochy commented Dec 15, 2020

weird because the assembly is supposed to be different.

As far as I see in the diff to master, some of the x86 assembly has changed but the aarch64 assembly stayed the same. As the x86_64 code has special casings for __MACH__, I guess we need that for __MACH__ && __aarch64__, too.

@devnexen
Copy link
Contributor

Right ... what happens if you use this change :

uintptr_t tcb; UNUSED(ofs);UNUSED(slot); __asm__ volatile ("mrs %0, tpidr_el0" : "=r" (tcb)); res = tcb;

Just theorical, I wish I had a M1 to test for real :-)

@xhochy
Copy link
Contributor Author

xhochy commented Dec 15, 2020

Seems to pass and now faults in mi_tls_slot_set where we have the near exact same code path.

@devnexen
Copy link
Contributor

Yes it would be similar-ish solution if it works like :
uintptr_t tcb; UNUSED(ofs);UNUSED(slot) __asm__ volatile ("mrs %0, tpidr_el0" : "=r" (tcb)); tcb = (uintptr_t)value;

Thanks for your patience :-)

@xhochy
Copy link
Contributor Author

xhochy commented Dec 15, 2020

With MI_MALLOC_OVERRIDE=OFF, ./mimalloc-test-api now passes.

With MI_MALLOC_OVERRIDE=ON, we now get to:

(lldb) run
Process 27086 launched: '/Users/uwe/Development/mimalloc/build/mimalloc-test-api' (arm64)
Process 27086 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0xb0)
    frame #0: 0x0000000180cbc6cc libsystem_platform.dylib`_platform_memset + 108
libsystem_platform.dylib`_platform_memset:
->  0x180cbc6cc <+108>: stp    q0, q0, [x0]
    0x180cbc6d0 <+112>: stp    q0, q0, [x0, #0x20]
    0x180cbc6d4 <+116>: add    x3, x0, #0x40             ; =0x40
    0x180cbc6d8 <+120>: and    x3, x3, #0xffffffffffffffc0
Target 0: (mimalloc-test-api) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0xb0)
  * frame #0: 0x0000000180cbc6cc libsystem_platform.dylib`_platform_memset + 108
    frame #1: 0x0000000100008038 mimalloc-test-api`mi_stats_reset at stats.c:359:35 [opt]
    frame #2: 0x000000010001ec2c mimalloc-test-api`mi_process_init at init.c:476:3 [opt]
    frame #3: 0x000000010007377c dyld`ImageLoaderMachO::doModInitFunctions(ImageLoader::LinkContext const&) + 468
    frame #4: 0x0000000100073b94 dyld`ImageLoaderMachO::doInitialization(ImageLoader::LinkContext const&) + 56
    frame #5: 0x000000010006d84c dyld`ImageLoader::recursiveInitialization(ImageLoader::LinkContext const&, unsigned int, char const*, ImageLoader::InitializerTimingList&, ImageLoader::UninitedUpwards&) + 620
    frame #6: 0x000000010006b300 dyld`ImageLoader::processInitializers(ImageLoader::LinkContext const&, unsigned int, ImageLoader::InitializerTimingList&, ImageLoader::UninitedUpwards&) + 192
    frame #7: 0x000000010006b3cc dyld`ImageLoader::runInitializers(ImageLoader::LinkContext const&, ImageLoader::InitializerTimingList&) + 96
    frame #8: 0x000000010005684c dyld`dyld::initializeMainExecutable() + 220
    frame #9: 0x000000010005cb98 dyld`dyld::_main(macho_header const*, unsigned long, int, char const**, char const**, char const**, unsigned long*) + 7388
    frame #10: 0x0000000100055258 dyld`dyldbootstrap::start(dyld3::MachOLoaded const*, int, char const**, dyld3::MachOLoaded const*, unsigned long*) + 476
    frame #11: 0x0000000100055038 dyld`_dyld_start + 56

@devnexen
Copy link
Contributor

Too bad the changes work on Linux arm64 maybe the M1 gear needs a specific code path in this case.
Just an idea, what happens if you comment line 298 in include/mimalloc-internal.h

Just wanted to see while mimalloc uses the thread slot maybe that s not such good thing for the M1 case. Maybe I m wrong tough, lot of maybe at the moment :-)

@xhochy
Copy link
Contributor Author

xhochy commented Dec 15, 2020

When I comment line 298, the mimalloc-api-test passes but the mimalloc-test-stress still fails which seems to be related to the commenting of line 298.

(lldb) run
Process 30829 launched: '/Users/uwe/Development/mimalloc/build/mimalloc-test-stress' (arm64)
Process 30829 stopped
* thread #1, stop reason = signal SIGABRT
    frame #0: 0x0000000180c43cec libsystem_kernel.dylib`__pthread_kill + 8
libsystem_kernel.dylib`__pthread_kill:
->  0x180c43cec <+8>:  b.lo   0x180c43d0c               ; <+40>
    0x180c43cf0 <+12>: pacibsp
    0x180c43cf4 <+16>: stp    x29, x30, [sp, #-0x10]!
    0x180c43cf8 <+20>: mov    x29, sp
Target 0: (mimalloc-test-stress) stopped.
(lldb) bt
* thread #1, stop reason = signal SIGABRT
  * frame #0: 0x0000000180c43cec libsystem_kernel.dylib`__pthread_kill + 8
    frame #1: 0x0000000180c74c24 libsystem_pthread.dylib`pthread_kill + 292
    frame #2: 0x0000000180bbc864 libsystem_c.dylib`abort + 104
    frame #3: 0x0000000180c93d58 libdyld.dylib`_tlv_bootstrap + 16
    frame #4: 0x000000010012b208 libmimalloc-debug.1.6.dylib`mi_calloc [inlined] mi_get_default_heap at mimalloc-internal.h:346:10 [opt]
    frame #5: 0x000000010012b1f8 libmimalloc-debug.1.6.dylib`mi_calloc(count=<unavailable>, size=80) at alloc.c:584 [opt]
    frame #6: 0x000000018a2b5a7c libkeymgr.dylib`dwarf2_unwind_dyld_add_image_hook + 40
    frame #7: 0x0000000100018d94 dyld`dyld::registerAddCallback(void (*)(mach_header const*, long)) + 400
    frame #8: 0x0000000180c7b1c8 libdyld.dylib`_dyld_register_func_for_add_image + 132
    frame #9: 0x000000018a2b5798 libkeymgr.dylib`__keymgr_initializer + 32
    frame #10: 0x000000018a2c67d0 libSystem.B.dylib`libSystem_initializer + 196
    frame #11: 0x000000010003390c dyld`ImageLoaderMachO::doModInitFunctions(ImageLoader::LinkContext const&) + 868
    frame #12: 0x0000000100033b94 dyld`ImageLoaderMachO::doInitialization(ImageLoader::LinkContext const&) + 56
    frame #13: 0x000000010002d84c dyld`ImageLoader::recursiveInitialization(ImageLoader::LinkContext const&, unsigned int, char const*, ImageLoader::InitializerTimingList&, ImageLoader::UninitedUpwards&) + 620
    frame #14: 0x000000010002d794 dyld`ImageLoader::recursiveInitialization(ImageLoader::LinkContext const&, unsigned int, char const*, ImageLoader::InitializerTimingList&, ImageLoader::UninitedUpwards&) + 436
    frame #15: 0x000000010002d794 dyld`ImageLoader::recursiveInitialization(ImageLoader::LinkContext const&, unsigned int, char const*, ImageLoader::InitializerTimingList&, ImageLoader::UninitedUpwards&) + 436
    frame #16: 0x000000010002b300 dyld`ImageLoader::processInitializers(ImageLoader::LinkContext const&, unsigned int, ImageLoader::InitializerTimingList&, ImageLoader::UninitedUpwards&) + 192
    frame #17: 0x000000010002b3cc dyld`ImageLoader::runInitializers(ImageLoader::LinkContext const&, ImageLoader::InitializerTimingList&) + 96
    frame #18: 0x000000010001684c dyld`dyld::initializeMainExecutable() + 220
    frame #19: 0x000000010001cb98 dyld`dyld::_main(macho_header const*, unsigned long, int, char const**, char const**, char const**, unsigned long*) + 7388
    frame #20: 0x0000000100015258 dyld`dyldbootstrap::start(dyld3::MachOLoaded const*, int, char const**, dyld3::MachOLoaded const*, unsigned long*) + 476
    frame #21: 0x0000000100015038 dyld`_dyld_start + 56

@devnexen
Copy link
Contributor

Right so maybe M1 needs a specific code path (specific register like x86 case ?). I see on other projects issues popping up about M1 here and there seems to be really a particular animal.

@daanx
Copy link
Collaborator

daanx commented Dec 15, 2020

Hi @xhochy , lucky you with an M1 :-) Thanks @devnexen for looking into this.
Unfortunately, there have been various issues on macOS and it may be related to that and not directly to the M1; I currently have a mac mini to develop on so I will first go through previous issues and then get back to this. The main trouble on macOS is that the loader itself (dyld_ImageLoader) calls malloc to allocate memory for thread local storage -- causing an infinite loop if any thread-local storage is accessed. Currently, we avoid this by using a dedicated thread slot but that is not guaranteed to work on future OS's and may be part of the problem -- maybe you can try another slot number? a few good ones are in the comments in mimalloc-internal.h.

If you get to it, can you submit the assembly changes together with the correct pre-processor test for the M1?

Hope we can get this working!
(I tested mimalloc extensively on aarch64 on Linux, including long TSAN tests and there was recent work with model-checking the atomics so pretty sure that part is working ok)

@daanx
Copy link
Collaborator

daanx commented Dec 16, 2020

@xhochy, I just pushed an update for macOS ; can you try again with the latest dev ? (may not make a difference but then we know at least it is not a bug in the malloc zones overriding)

@xhochy
Copy link
Contributor Author

xhochy commented Dec 16, 2020

@xhochy, I just pushed an update for macOS ; can you try again with the latest dev ? (may not make a difference but then we know at least it is not a bug in the malloc zones overriding)

Latest dev didn't make a difference. Exactly the same behaviour as before.

@xhochy
Copy link
Contributor Author

xhochy commented Dec 16, 2020

Currently, we avoid this by using a dedicated thread slot but that is not guaranteed to work on future OS's and may be part of the problem -- maybe you can try another slot number?

Also didn't make any difference, with or without the suggestions by @devnexen.

In general this looks like the same issue as people are facing on iOS (see #262 and I tried to use the fixes suggested there.

Thus I defined MI_TLS_RECURSE_GUARD in my C(XX)FLAGS and changed the tcb acess to happen when tcb is not-NULL. Also I commented out MI_TLS_SLOT.

This then leads to a slight different error:

(lldb) run
Process 61782 launched: '/Users/uwe/Development/mimalloc/build/mimalloc-test-stress' (arm64)
objc[61782]: realized class 0x1e8e712c8 has corrupt data pointer 0x40000040090
Process 61782 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
    frame #0: 0x0000000180c64130 libsystem_kernel.dylib`__abort_with_payload + 8
libsystem_kernel.dylib`__abort_with_payload:
->  0x180c64130 <+8>:  b.lo   0x180c64150               ; <+40>
    0x180c64134 <+12>: pacibsp
    0x180c64138 <+16>: stp    x29, x30, [sp, #-0x10]!
    0x180c6413c <+20>: mov    x29, sp
Target 0: (mimalloc-test-stress) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
  * frame #0: 0x0000000180c64130 libsystem_kernel.dylib`__abort_with_payload + 8
    frame #1: 0x0000000180c66a20 libsystem_kernel.dylib`abort_with_payload_wrapper_internal + 104
    frame #2: 0x0000000180c669b8 libsystem_kernel.dylib`abort_with_reason + 32
    frame #3: 0x0000000180b311dc libobjc.A.dylib`_objc_fatalv(unsigned long long, unsigned long long, char const*, char*) + 120
    frame #4: 0x0000000180b31164 libobjc.A.dylib`_objc_fatal(char const*, ...) + 44
    frame #5: 0x0000000180b20f7c libobjc.A.dylib`realizeClassWithoutSwift(objc_class*, objc_class*) + 112
    frame #6: 0x0000000180b21048 libobjc.A.dylib`realizeClassWithoutSwift(objc_class*, objc_class*) + 316
    frame #7: 0x0000000180b21060 libobjc.A.dylib`realizeClassWithoutSwift(objc_class*, objc_class*) + 340
    frame #8: 0x0000000180b21048 libobjc.A.dylib`realizeClassWithoutSwift(objc_class*, objc_class*) + 316
    frame #9: 0x0000000180b0b564 libobjc.A.dylib`_read_images + 2684
    frame #10: 0x0000000180b0a44c libobjc.A.dylib`map_images_nolock + 2468
    frame #11: 0x0000000180b1b770 libobjc.A.dylib`map_images + 92
    frame #12: 0x000000010001a16c dyld`dyld::notifyBatchPartial(dyld_image_states, bool, char const* (*)(dyld_image_states, unsigned int, dyld_image_info const*), bool, bool) + 1672
    frame #13: 0x000000010001a358 dyld`dyld::registerObjCNotifiers(void (*)(unsigned int, char const* const*, mach_header const* const*), void (*)(char const*, mach_header const*), void (*)(char const*, mach_header const*)) + 80
    frame #14: 0x0000000180c7bd88 libdyld.dylib`_dyld_objc_notify_register + 164
    frame #15: 0x0000000180b09a08 libobjc.A.dylib`_objc_init + 1384
    frame #16: 0x0000000180ac7768 libdispatch.dylib`_os_object_init + 24
    frame #17: 0x0000000180ad632c libdispatch.dylib`libdispatch_init + 476
    frame #18: 0x000000018a2c67e8 libSystem.B.dylib`libSystem_initializer + 220
    frame #19: 0x000000010003390c dyld`ImageLoaderMachO::doModInitFunctions(ImageLoader::LinkContext const&) + 868
    frame #20: 0x0000000100033b94 dyld`ImageLoaderMachO::doInitialization(ImageLoader::LinkContext const&) + 56
    frame #21: 0x000000010002d84c dyld`ImageLoader::recursiveInitialization(ImageLoader::LinkContext const&, unsigned int, char const*, ImageLoader::InitializerTimingList&, ImageLoader::UninitedUpwards&) + 620
    frame #22: 0x000000010002d794 dyld`ImageLoader::recursiveInitialization(ImageLoader::LinkContext const&, unsigned int, char const*, ImageLoader::InitializerTimingList&, ImageLoader::UninitedUpwards&) + 436
    frame #23: 0x000000010002d794 dyld`ImageLoader::recursiveInitialization(ImageLoader::LinkContext const&, unsigned int, char const*, ImageLoader::InitializerTimingList&, ImageLoader::UninitedUpwards&) + 436
    frame #24: 0x000000010002b300 dyld`ImageLoader::processInitializers(ImageLoader::LinkContext const&, unsigned int, ImageLoader::InitializerTimingList&, ImageLoader::UninitedUpwards&) + 192
    frame #25: 0x000000010002b3cc dyld`ImageLoader::runInitializers(ImageLoader::LinkContext const&, ImageLoader::InitializerTimingList&) + 96
    frame #26: 0x000000010001684c dyld`dyld::initializeMainExecutable() + 220
    frame #27: 0x000000010001cb98 dyld`dyld::_main(macho_header const*, unsigned long, int, char const**, char const**, char const**, unsigned long*) + 7388
    frame #28: 0x0000000100015258 dyld`dyldbootstrap::start(dyld3::MachOLoaded const*, int, char const**, dyld3::MachOLoaded const*, unsigned long*) + 476
    frame #29: 0x0000000100015038 dyld`_dyld_start + 56

I guess that the fix for the M1 and iOS will be the exact same as they both use similar execution models and the same chip design ;)

@devnexen
Copy link
Contributor

I just did a run with an iPhone 6 :
image
image

with the actual dev but with these little changes already mentioned
image

@xhochy
Copy link
Contributor Author

xhochy commented Dec 16, 2020

Yes, that also works me for. Only mimalloc-test-stress is failing for me.

@devnexen
Copy link
Contributor

Was wondering if, only in the case of darwin ARM devices, we would use, tpidrr0_el0 TID register instead ? tpidr_el0 always return 0 otherwise.

@daanx
Copy link
Collaborator

daanx commented Dec 17, 2020

Hmm, I wish I had an M1 to test with; It is always tricky on the mac as the loader calls malloc to initialize thread local state but with the right tcb slot it should be ok. Perhaps the tpidrr0_el0 register should be used as @devnexen suggests?
This is tricky to debug but it would be great if you can find the culprit :-) Without an M1 I available it is hard for me to guess what is going wrong :-(

@xhochy
Copy link
Contributor Author

xhochy commented Dec 30, 2020

tpidrro_el0 seems to work fine on macOS, Linux only seems to set tpidr_el0 so I special cased that.

@jrtc27
Copy link

jrtc27 commented Jan 20, 2021

They're completely unrelated registers that are both available. Linux and FreeBSD leave it entirely up to userspace to manage TLS how it sees fit, and so userspace needs a writable register not a read-only one. On macOS/Mach-O TLS used to be more bolted on the side (and came extremely late) unlike the implementation for ELF, and it appears it still is. But aside from that, my guess is that for dubious "security" reasons they decided to have the kernel play a larger part in controlling userspace's TLS and, whilst userspace can read it quickly, writing it needs to go through the kernel (which isn't exactly a hot path, but also there's not much good reason for it, it's just unnecessary overhead, especially since you still need to context switch tpidr_el0 given it's writable, yet if you have userspace use that one you can avoid needing to context switch tpidrro_el0). The existence of both in the Arm architecture is a bit of a wart that should never have happened (and especially not with so similar names), but here we are.

So, both choose which register to use based on their requirements and design. Neither is right or wrong, nor is either a special case, they're just different. Having said that, macOS is the odd one out for the Unixes.

@daanx
Copy link
Collaborator

daanx commented Jan 30, 2021

I believe we figured out a way to make this work -- thanks so much! Really nice to see how this could be patched and merged without me having an M1 available -- the power of the internet :-) Thanks again!

@daanx daanx closed this as completed Jan 30, 2021
@noloader
Copy link

noloader commented Apr 20, 2021

@daanx,

I bought an M1 for testing because I contribute to Open Source projects. You are welcomed to use it. Send over your authorized_keys for SSH access (sorry, public key only). My email is noloader, gmail account.

The GCC Compile Farm also has a M1 for testing. You can get an account at https://cfarm.tetaneutral.net/; see the "Request and Account" link at the top of the page. Through the Compile Farm you get access to a lot of machines including ARM64, POWER8, POWER9 and Sparc. OSes include AIX, Linux, OS X, the BSDs and Solaris.

@daanx
Copy link
Collaborator

daanx commented Jun 18, 2021

Hi @xhochy -- this has taken awhile but I think I fixed the issue in the latest releases (v1.7.2, v2.0.2).
The issue was two-fold; we need indeed the tldrro_el0 register but it seems we need to clear the bottom 3 bits; moreover, we need to not use interpose anymore; i think other fixes made the zones overriding now work reliably and on the M1 interpose causes some sort of problems now (not sure what). With interpose off, everything seems to work :-)

I did some quick testing and mimalloc is often twice as fast (up to 10x in xmalloc-test) than the system allocator:

# --------------------------------------------------
# benchmark allocator elapsed rss user sys page-faults page-reclaims
      
cfrac sys 06.13 2208 6.12 0.00 4 304
cfrac xmi 03.61 2464 3.61 0.00 0 334
cfrac je 04.23 2864 4.22 0.00 0 364
      
espresso sys 05.26 2208 5.25 0.01 12 296
espresso xmi 03.73 4480 3.71 0.01 0 460
espresso je 04.06 3968 4.05 0.01 0 433
      
xmalloc-testN sys 7.854 49760 29.27 10.52 3 5632
xmalloc-testN xmi 0.601 116096 39.53 0.15 0 7452
xmalloc-testN je 0.794 70272 32.77 1.78 0 4596
      
mstressN sys 03.24 1061712 4.39 2.10 0 657980
mstressN xmi 01.65 717056 3.28 0.18 0 68190
mstressN je 04.05 732992 3.37 3.49 0 1327397

@daanx daanx reopened this Jun 18, 2021
@jserv
Copy link
Contributor

jserv commented Jun 18, 2021

I tested the latest dev branch (commit 076f81) with -DMI_OSX_ZONE=ON -DMI_INTERPOSE=OFF, and both Release and Debug build worked well.

Environment: Apple M1 (MacBook Air), macOS v11.4

@whisper-bye
Copy link

any progress on it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants