Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segfault on Libdl.dllist() on 32-bit Linux #24643

Closed
staticfloat opened this issue Nov 17, 2017 · 23 comments
Closed

segfault on Libdl.dllist() on 32-bit Linux #24643

staticfloat opened this issue Nov 17, 2017 · 23 comments
Labels
system:linux Affects only Linux system:32-bit Affects only 32-bit systems

Comments

@staticfloat
Copy link
Member

Latest master segfaults if you run Libdl.dllist():

$ gdb --args ./julia -e 'Libdl.dllist()'
...
(gdb) r
...

Program received signal SIGSEGV, Segmentation fault.
0xe14048a6 in jlcapi_dl_phdr_info_callback_65440 ()
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.209.el6_9.2.i686 zlib-1.2.3-29.el6.i686
(gdb) bt
#0  0xe14048a6 in jlcapi_dl_phdr_info_callback_65440 ()
#1  0xf7c5a56d in dl_iterate_phdr () from /lib/libc.so.6
#2  0xe1404696 in japi1_dllist_65438 ()
#3  0xf7d38965 in jl_apply_generic () from /buildworker/worker/package_linux32/build/usr/bin/../lib/libjulia.so.0.7
#4  0xf7e95e93 in do_call () at /buildworker/worker/package_linux32/build/src/interpreter.c:323
#5  0xf7e96d05 in jl_interpret_toplevel_expr_in_callback () at /buildworker/worker/package_linux32/build/src/interpreter.c:395
#6  0xf7d4dbdb in Lenter_interpreter_frame_start_val () from /buildworker/worker/package_linux32/build/usr/bin/../lib/libjulia.so.0.7
#7  0xf7e98673 in jl_interpret_toplevel_expr_in () at /buildworker/worker/package_linux32/build/src/interpreter.c:777
#8  0xf7d6f8cf in jl_toplevel_eval_flex.constprop.8 () from /buildworker/worker/package_linux32/build/usr/bin/../lib/libjulia.so.0.7
#9  0xf7d4883f in jl_toplevel_eval_in () from /buildworker/worker/package_linux32/build/usr/bin/../lib/libjulia.so.0.7
#10 0xf013ccf6 in japi1_eval_1541 () from /buildworker/worker/package_linux32/build/usr/lib/julia/sys.so
#11 0xf7d38965 in jl_apply_generic () from /buildworker/worker/package_linux32/build/usr/bin/../lib/libjulia.so.0.7
#12 0xe14040b3 in julia_process_options_65438 ()
#13 0xe1403234 in japi1__start_65436 ()
#14 0xf7d38965 in jl_apply_generic () from /buildworker/worker/package_linux32/build/usr/bin/../lib/libjulia.so.0.7
#15 0x0804965f in true_main () at /buildworker/worker/package_linux32/build/ui/../src/julia.h:1475
#16 0x08048f67 in main () at /buildworker/worker/package_linux32/build/ui/repl.c:237
julia> versioninfo()
Julia Version 0.7.0-DEV.2520
Commit 81e245c850 (2017-11-16 22:43 UTC)
Platform Info:
  OS: Linux (i686-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E3-1241 v3 @ 3.50GHz
  WORD_SIZE: 32
  BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Nehalem)
  LAPACK: libopenblas
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, haswell)
Environment:
@staticfloat staticfloat added system:32-bit Affects only 32-bit systems bug Indicates an unexpected problem or unintended behavior labels Nov 17, 2017
@ararslan ararslan added the system:linux Affects only Linux label Nov 17, 2017
@yuyichao
Copy link
Contributor

Good that you can reproduce it. Jameson mentioned that in #21849 (comment) but I couldn't reproduce locally. Were you able to figure out what actually segfaults in the debugger?

@staticfloat
Copy link
Member Author

No, I haven't figured out why the segfault is occurring. Unfortunately, packaging up the binary and running it outside of the docker environment doesn't trigger the bug. If you like, I can get you SSH access in?

@staticfloat
Copy link
Member Author

staticfloat commented Nov 18, 2017

I'm actually thinking this might be an alignment issue. I had some time to dig back into this; the segmentation fault is occurring within this chunk of code passed in to dl_phdr_info_callback as a callback. The disassembly from gdb (with pointer to faulting instruction):

(gdb) disas
Dump of assembler code for function jlcapi_dl_phdr_info_callback_65624:
   0xe0ea78c0 <+0>:     push   %ebp
   0xe0ea78c1 <+1>:     push   %ebx
   0xe0ea78c2 <+2>:     push   %edi
   0xe0ea78c3 <+3>:     push   %esi
   0xe0ea78c4 <+4>:     sub    $0x3c,%esp
   0xe0ea78c7 <+7>:     movl   $0x0,0xc(%esp)
   0xe0ea78cf <+15>:    movl   $0x0,0x8(%esp)
   0xe0ea78d7 <+23>:    mov    %gs:0x0,%edx
   0xe0ea78de <+30>:    mov    %edx,0x10(%esp)
   0xe0ea78e2 <+34>:    lea    -0x15b0(%edx),%eax
   0xe0ea78e8 <+40>:    mov    %eax,0x14(%esp)
   0xe0ea78ec <+44>:    movl   $0x2,0x4(%esp)
   0xe0ea78f4 <+52>:    mov    -0x15b0(%edx),%ecx
   0xe0ea78fa <+58>:    mov    %ecx,0x8(%esp)
   0xe0ea78fe <+62>:    lea    0x4(%esp),%ecx
   0xe0ea7902 <+66>:    mov    %ecx,-0x15b0(%edx)
   0xe0ea7908 <+72>:    lea    -0x15ac(%edx),%ecx
   0xe0ea790e <+78>:    test   %eax,%eax
   0xe0ea7910 <+80>:    lea    0x1c(%esp),%edi
   0xe0ea7914 <+84>:    cmovne %ecx,%edi
   0xe0ea7917 <+87>:    mov    0xf7f1fde4,%ecx
   0xe0ea791d <+93>:    mov    0xee5247f0,%esi
   0xe0ea7923 <+99>:    cmp    %ecx,%esi
   0xe0ea7925 <+101>:   mov    %esi,%edx
   0xe0ea7927 <+103>:   cmovae %ecx,%edx
   0xe0ea792a <+106>:   mov    (%edi),%eax
   0xe0ea792c <+108>:   mov    %eax,0x18(%esp)
   0xe0ea7930 <+112>:   test   %eax,%eax
   0xe0ea7932 <+114>:   mov    %edx,%ebp
   0xe0ea7934 <+116>:   cmovne %ecx,%ebp
   0xe0ea7937 <+119>:   mov    $0xe0ea79a0,%eax
   0xe0ea793c <+124>:   mov    $0xe0ea7750,%ebx
   0xe0ea7941 <+129>:   cmove  %ebx,%eax
   0xe0ea7944 <+132>:   cmpl   $0x0,0x14(%esp)
   0xe0ea7949 <+137>:   cmove  %edx,%ebp
   0xe0ea794c <+140>:   mov    $0xe0ea7750,%edx
   0xe0ea7951 <+145>:   cmove  %edx,%eax
   0xe0ea7954 <+148>:   cmp    %ecx,%esi
   0xe0ea7956 <+150>:   mov    %ebp,(%edi)
   0xe0ea7958 <+152>:   mov    0x50(%esp),%ecx
   0xe0ea795c <+156>:   movups (%ecx),%xmm0
   0xe0ea795f <+159>:   cmovae %edx,%eax
   0xe0ea7962 <+162>:   mov    0x58(%esp),%ecx
=> 0xe0ea7966 <+166>:   movaps %xmm0,0x20(%esp)
   0xe0ea796b <+171>:   mov    %ecx,0xc(%esp)
   0xe0ea796f <+175>:   mov    0x54(%esp),%edx
   0xe0ea7973 <+179>:   sub    $0x4,%esp
   0xe0ea7976 <+182>:   lea    0x24(%esp),%esi
   0xe0ea797a <+186>:   push   %ecx
   0xe0ea797b <+187>:   push   %edx
   0xe0ea797c <+188>:   push   %esi
   0xe0ea797d <+189>:   call   *%eax
   0xe0ea797f <+191>:   add    $0x10,%esp
   0xe0ea7982 <+194>:   mov    0x18(%esp),%ecx
   0xe0ea7986 <+198>:   mov    %ecx,(%edi)
   0xe0ea7988 <+200>:   mov    0x8(%esp),%ecx
   0xe0ea798c <+204>:   mov    0x10(%esp),%edx
   0xe0ea7990 <+208>:   mov    %ecx,-0x15b0(%edx)
   0xe0ea7996 <+214>:   add    $0x3c,%esp
   0xe0ea7999 <+217>:   pop    %esi
   0xe0ea799a <+218>:   pop    %edi
   0xe0ea799b <+219>:   pop    %ebx
   0xe0ea799c <+220>:   pop    %ebp
   0xe0ea799d <+221>:   ret
End of assembler dump.

Inspecting the movaps docs, it looks like it requires 16-byte alignment, but the address we're moving to isn't (unless I'm misunderstanding how these addresses are considered "aligned"):

(gdb) p $esp + 0x20
$8 = (void *) 0xffffcb14

I don't have a particularly good way of tying this native code back to Julia source, unfortunately. My best attempts to get this are shown here, but it appears that @code_native is giving me different results that cfunction().

@vtjnash
Copy link
Member

vtjnash commented Nov 18, 2017

Since $esp is required by the platform ABI to end in 0, can you walk up the backtrace stack and see when it became invalid?

@yuyichao
Copy link
Contributor

How old is the buildbot? I would guess that this is due to the stack requirement not being met on old glibc versions/code compiled with old gcc. I'm pretty sure this requirement is much younger than i386 and I've also seen comment in glibc mentioning the stack realignment requirement in order to support code compiled with old gcc.

If that's true, just stepping back up two levels should give an aligned stack and disassembling the dl_iterate_phdr function should probably show when it got misaligned.

@staticfloat
Copy link
Member Author

Yes, @yuyichao is correct, stepping back up to japi1_dllist_*() results in a properly aligned stack. The disassembly of dl_iterate_phdr (up to the call into Julia code) follows:

   0xf7c5a470 <+0>:     push   %ebp                                                                                                               [77/1092]
   0xf7c5a471 <+1>:     mov    %esp,%ebp
   0xf7c5a473 <+3>:     push   %edi
   0xf7c5a474 <+4>:     push   %esi
   0xf7c5a475 <+5>:     push   %ebx
   0xf7c5a476 <+6>:     call   0xf7b4fb5f <__i686.get_pc_thunk.bx>
   0xf7c5a47b <+11>:    add    $0x71b79,%ebx
   0xf7c5a481 <+17>:    sub    $0x48,%esp
   0xf7c5a484 <+20>:    mov    -0x30(%ebx),%edx
   0xf7c5a48a <+26>:    lea    0x4dc(%edx),%eax
   0xf7c5a490 <+32>:    mov    %eax,(%esp)
   0xf7c5a493 <+35>:    call   *0x7f4(%edx)
   0xf7c5a499 <+41>:    mov    -0x30(%ebx),%eax
   0xf7c5a49f <+47>:    mov    0x4(%ebp),%ecx
   0xf7c5a4a2 <+50>:    movl   $0x0,-0x44(%ebp)
   0xf7c5a4a9 <+57>:    mov    0x4(%eax),%edi
   0xf7c5a4ac <+60>:    mov    %eax,%edx
   0xf7c5a4ae <+62>:    mov    0x4c0(%eax),%eax
   0xf7c5a4b4 <+68>:    sub    $0x1,%eax
   0xf7c5a4b7 <+71>:    test   %eax,%eax
   0xf7c5a4b9 <+73>:    mov    %eax,-0x40(%ebp)
   0xf7c5a4bc <+76>:    jle    0xf7c5a517 <dl_iterate_phdr+167>
   0xf7c5a4be <+78>:    imul   $0x4c,%eax,%eax
   0xf7c5a4c1 <+81>:    add    %edx,%eax
   0xf7c5a4c3 <+83>:    mov    %eax,-0x3c(%ebp)
   0xf7c5a4c6 <+86>:    xchg   %ax,%ax
   0xf7c5a4c8 <+88>:    mov    -0x3c(%ebp),%eax
   0xf7c5a4cb <+91>:    mov    (%eax),%esi
   0xf7c5a4cd <+93>:    test   %esi,%esi
   0xf7c5a4cf <+95>:    je     0xf7c5a508 <dl_iterate_phdr+152>
   0xf7c5a4d1 <+97>:    nopl   0x0(%eax)
   0xf7c5a4d8 <+104>:   mov    -0x3c(%ebp),%edx
   0xf7c5a4db <+107>:   add    0x4(%edx),%edi
   0xf7c5a4de <+110>:   cmp    0x1ac(%esi),%ecx
   0xf7c5a4e4 <+116>:   jb     0xf7c5a501 <dl_iterate_phdr+145>
   0xf7c5a4e6 <+118>:   cmp    0x1b0(%esi),%ecx
   0xf7c5a4ec <+124>:   jae    0xf7c5a501 <dl_iterate_phdr+145>
   0xf7c5a4ee <+126>:   testb  $0x40,0x195(%esi)
   0xf7c5a4f5 <+133>:   je     0xf7c5a5e8 <dl_iterate_phdr+376>
   0xf7c5a4fb <+139>:   mov    -0x40(%ebp),%eax
   0xf7c5a4fe <+142>:   mov    %eax,-0x44(%ebp)
   0xf7c5a501 <+145>:   mov    0xc(%esi),%esi
   0xf7c5a504 <+148>:   test   %esi,%esi
   0xf7c5a506 <+150>:   jne    0xf7c5a4d8 <dl_iterate_phdr+104>
   0xf7c5a508 <+152>:   subl   $0x1,-0x40(%ebp)
   0xf7c5a50c <+156>:   mov    -0x40(%ebp),%eax
   0xf7c5a50f <+159>:   subl   $0x4c,-0x3c(%ebp)
   0xf7c5a513 <+163>:   test   %eax,%eax
   0xf7c5a515 <+165>:   jg     0xf7c5a4c8 <dl_iterate_phdr+88>
   0xf7c5a517 <+167>:   imul   $0x4c,-0x44(%ebp),%eax
   0xf7c5a51b <+171>:   mov    -0x30(%ebx),%edx
   0xf7c5a521 <+177>:   mov    (%edx,%eax,1),%esi
   0xf7c5a524 <+180>:   xor    %eax,%eax
   0xf7c5a526 <+182>:   test   %esi,%esi
   0xf7c5a528 <+184>:   je     0xf7c5a604 <dl_iterate_phdr+404>
   0xf7c5a52e <+190>:   mov    -0xbc(%ebx),%eax
   0xf7c5a534 <+196>:   mov    %edi,-0x3c(%ebp)
   0xf7c5a537 <+199>:   lea    -0x34(%ebp),%edi
   0xf7c5a53a <+202>:   mov    %edi,-0x40(%ebp)
   0xf7c5a53d <+205>:   mov    0xc(%ebp),%edi
   0xf7c5a540 <+208>:   movl   $0x0,-0x38(%ebp)
   0xf7c5a547 <+215>:   mov    0x1b4(%eax),%eax
   0xf7c5a54d <+221>:   mov    %eax,-0x44(%ebp)
   0xf7c5a550 <+224>:   jmp    0xf7c5a580 <dl_iterate_phdr+272>
   0xf7c5a552 <+226>:   nopw   0x0(%eax,%eax,1)
   0xf7c5a558 <+232>:   mov    -0x40(%ebp),%eax
   0xf7c5a55b <+235>:   mov    %edi,0x8(%esp)
   0xf7c5a55f <+239>:   movl   $0x28,0x4(%esp)
   0xf7c5a567 <+247>:   mov    %eax,(%esp)
=> 0xf7c5a56a <+250>:   call   *0x8(%ebp)

I think it's pretty obvious that $esp isn't going to be 16-byte aligned due to the sub $0x48,%esp, so this does seem to be the culprit. The glibc version on this buildbot is 2.12-1.209.el6_9.2, built using GCC 4.4.7.

Do you think this is a bug in glibc or GCC? E.g. if I use a newer GCC to build glibc 2.12, do you think that will fix the problem?

@yuyichao
Copy link
Contributor

due to the sub $0x48,%esp

Note that sp isn't aligned to 16 when entering the function (sp + 4 is).

Do you think this is a bug in glibc or GCC

It's a C function so glibc should be fine.

if I use a newer GCC to build glibc 2.12, do you think that will fix the problem?

Not sure...

@staticfloat
Copy link
Member Author

staticfloat commented Nov 18, 2017 via email

@vtjnash
Copy link
Member

vtjnash commented Nov 18, 2017

Seems like a gcc bug. I didn’t think 4.4.7 was that old to pre-date the ABI change. Sounds like a good idea to try recompiling glibc with a newer gcc.

@yuyichao
Copy link
Contributor

It should be GCC's job though IIRC it can be controlled by command line flags too.

I wasn't sure since I wasn't able to reproduce an issue on gcc 4.4.7 on https://gcc.godbolt.org/

@vtjnash
Copy link
Member

vtjnash commented Nov 18, 2017

(To clarify, the Linux user space ABI changed sometime after the introduction of AVX. While we could alter our cfunction stub to be backwards compatible, in practice that isn’t usually necessary, since the ABI change is generally assumed to have occurred a while ago)

@vtjnash
Copy link
Member

vtjnash commented Nov 18, 2017

Google confirmed for me: this was a breaking change for the gcc 4.5 release (2010).

@staticfloat
Copy link
Member Author

Hmmm, I tried to build my own libc.so.6, which I did, but it now segfaults any binary I try to use it with. :P

$ /opt/i686-linux-gnu/i686-linux-gnu/sys-root/lib/ld-2.12.2.so --library-path /opt/i686-linux-gnu/i686-linux-gnu/sys-root/lib /bin/true
Segmentation fault (core dumped)

Here's the output with LD_DEBUG=all, which isn't particularly helpful to me, but maybe it's more helpful to others. I'm not sure how to debug this, as I can't really figure out why it's crashing. The built glibc seems, on the surface, to be okay. I can use the newly built loader to load binaries and use the old libc (e.g. omitting the --library-path argument) but when I try to use the new libc, things break. Additionally, just copying the whole /opt/i686-linux-gnu/i686-linux-gnu/sys-root/lib directory to /lib breaks the whole system.

@staticfloat
Copy link
Member Author

If anyone wants to try this out at home, you can docker pull staticfloat/libctest. Among other things, you will find a custom-built glibc sitting in /opt/i686-linux-gnu/i686-linux-gnu/sys-root/lib.

@yuyichao
Copy link
Contributor

yuyichao commented Nov 19, 2017

Compile glibc with -mstackrealign ?

@yuyichao yuyichao added upstream The issue is with an upstream dependency, e.g. LLVM and removed bug Indicates an unexpected problem or unintended behavior labels Nov 19, 2017
@vtjnash
Copy link
Member

vtjnash commented Nov 20, 2017

Can you try using this patch on the x86 buildog:

diff --git a/src/codegen.cpp b/src/codegen.cpp
index 9f4a02ef10..105c926a2f 100644
--- a/src/codegen.cpp
+++ b/src/codegen.cpp
@@ -6646,6 +6646,7 @@ extern "C" void *jl_init_llvm(void)
     // to ensure compatibility with GCC codes
     options.StackAlignmentOverride = 16;
 #endif
+    options.StackAlignmentOverride = 4;
     EngineBuilder eb((std::unique_ptr<Module>(engine_module)));
     std::string ErrorStr;
     eb  .setEngineKind(EngineKind::JIT)

If this works as it should, I'll make a PR to put this behind a compiler flag (something like #ifdef _OLD_X86_STACK_ALIGN)

@vtjnash vtjnash removed the upstream The issue is with an upstream dependency, e.g. LLVM label Nov 20, 2017
@yuyichao
Copy link
Contributor

It seems that it would be dangerous since it may not give 16bytes alignments for other functions anymore? (judging from the comment).

@vtjnash
Copy link
Member

vtjnash commented Nov 20, 2017

Yeah, we'll also have to fix the rest of the build, but that's a secondary issue.

@yuyichao
Copy link
Contributor

But it'll make the binary built unusable everywhere else

@vtjnash
Copy link
Member

vtjnash commented Nov 20, 2017

That seems OK? If we want to build for & distribute a binary with the new (post-gcc 4.5) ABI, we need to build on a buildbot OS that uses it.

@yuyichao
Copy link
Contributor

The it seems that we should just drop the support or provide multiple generic binary.

@staticfloat
Copy link
Member Author

I agree that we should just drop support for the older ABI going forward. I just built a new buildbot based on Debian 8 that still uses a relatively old glibc (2.19) and is built with the proper ABI flags. It's also having trouble with libgit2 issues right now though.

@staticfloat
Copy link
Member Author

This is confirmed fixed by bumping up to Debian 8. Yay.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
system:linux Affects only Linux system:32-bit Affects only 32-bit systems
Projects
None yet
Development

No branches or pull requests

4 participants