Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segmentation fault in openib with failover enabled #2228

Closed
davidklaftenegger opened this issue Oct 17, 2016 · 19 comments
Closed

segmentation fault in openib with failover enabled #2228

davidklaftenegger opened this issue Oct 17, 2016 · 19 comments
Labels
Milestone

Comments

@davidklaftenegger
Copy link

davidklaftenegger commented Oct 17, 2016

When compiling openmpi-2.0.1 (or the nightly from last Wednesday) with --enable-btl-openib-failover we experience a segmentation fault on the first use of MPI communication in all MPI applications when using openib.

[jason0:15469] *** Process received signal ***
[jason0:15469] Signal: Segmentation fault (11)
[jason0:15469] Signal code: Address not mapped (1)
[jason0:15469] Failing at address: (nil)
0 pings 1
[jason0:15469] [ 0] /lib64/libpthread.so.0(+0x10d70)[0x7f40652b2d70]
[jason0:15469] [ 1] /usr/lib64/openmpi/mca_btl_openib.so(mca_btl_openib_sendi+0x66b)[0x7f405a46cceb]
[jason0:15469] [ 2] /usr/lib64/openmpi/mca_pml_ob1.so(+0xae18)[0x7f4059e2ae18]
[jason0:15469] [ 3] /usr/lib64/openmpi/mca_pml_ob1.so(mca_pml_ob1_isend+0x365)[0x7f4059e2b945]
[jason0:15469] [ 4] /usr/lib/libmpi.so.20(ompi_coll_base_barrier_intra_two_procs+0xb5)[0x7f406555b0f5]
[jason0:15469] [ 5] /usr/lib/libmpi.so.20(MPI_Barrier+0xb6)[0x7f4065518576]
[jason0:15469] [ 6] ./pingtest[0x4013a3]
[jason0:15469] [ 7] /lib64/libc.so.6(__libc_start_main+0xf0)[0x7f4064f26620]
[jason0:15469] [ 8] ./pingtest[0x400de9]
[jason0:15469] *** End of error message ***
[jason1:20431] *** Process received signal ***
[jason1:20431] Signal: Segmentation fault (11)
[jason1:20431] Signal code: Address not mapped (1)
[jason1:20431] Failing at address: (nil)
[jason1:20431] [ 0] /lib64/libpthread.so.0(+0x10d70)[0x7f6e12d28d70]
[jason1:20431] [ 1] /usr/lib64/openmpi/mca_btl_openib.so(mca_btl_openib_sendi+0x66b)[0x7f6e07dcbceb]
[jason1:20431] [ 2] /usr/lib64/openmpi/mca_pml_ob1.so(+0xae18)[0x7f6e0c16fe18]
[jason1:20431] [ 3] /usr/lib64/openmpi/mca_pml_ob1.so(mca_pml_ob1_isend+0x365)[0x7f6e0c170945]
[jason1:20431] [ 4] /usr/lib/libmpi.so.20(ompi_coll_base_barrier_intra_two_procs+0xb5)[0x7f6e12fd10f5]
[jason1:20431] [ 5] /usr/lib/libmpi.so.20(MPI_Barrier+0xb6)[0x7f6e12f8e576]
[jason1:20431] [ 6] ./pingtest[0x4013a3]
[jason1:20431] [ 7] /lib64/libc.so.6(__libc_start_main+0xf0)[0x7f6e1299c620]
[jason1:20431] [ 8] ./pingtest[0x400de9]
[jason1:20431] *** End of error message ***

This is a regression from openmpi-10.0.2, where this worked without incident.

When not setting --enable-btl-openib-failover our setup seems to work again.

In case that matters, we have an mlx4 Infiniband interconnect.
If you need any additional information, please tell me.

Yours,
David

@jsquyres jsquyres added the bug label Oct 17, 2016
@jsquyres jsquyres added this to the v2.0.2 milestone Oct 17, 2016
@jsquyres
Copy link
Member

@hjelmn Are you aware of any BTL changes that would cause this kind of behavior?

@jsquyres
Copy link
Member

@larrystevenwise Can you please investigate?

@larrystevenwise
Copy link

I'll see if I can reproduce on cxgb4...

@larrystevenwise
Copy link

@davidklaftenegger Please provide the configure command you used to configure ompi. Thanx!

@larrystevenwise
Copy link

reproduced it...

Program terminated with signal 11, Segmentation fault.
#0  0x00007fb6453865be in mca_btl_openib_sendi (btl=0x810f20, ep=0x88aa10, convertor=0x7fff6e0eaa00, header=0x7fff6e0eab10, header_size=14, payload_size=8, order=255 '\377', flags=3, tag=65 'A', descriptor=0x0)
    at btl_openib.c:1841
1841                *descriptor = (struct mca_btl_base_descriptor_t *) frag;
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.132.el6.x86_64 libgcc-4.4.7-4.el6.x86_64 libudev-147-2.51.el6.x86_64 numactl-2.0.7-8.el6.x86_64 tcp_wrappers-libs-7.6-57.el6.x86_64
(gdb) bt
#0  0x00007fb6453865be in mca_btl_openib_sendi (btl=0x810f20, ep=0x88aa10, convertor=0x7fff6e0eaa00, header=0x7fff6e0eab10, header_size=14, payload_size=8, order=255 '\377', flags=3, tag=65 'A', descriptor=0x0)
    at btl_openib.c:1841
#1  0x00007fb644b3e0fa in mca_bml_base_sendi (bml_btl=0x88a290, convertor=0x7fff6e0eaa00, header=0x7fff6e0eab10, header_size=14, payload_size=8, order=255 '\377', flags=3, tag=65 'A', descriptor=0x0)
    at ../../../../ompi/mca/bml/bml.h:301
#2  0x00007fb644b3f125 in mca_pml_ob1_send_inline (buf=0x897990, count=1, datatype=0x611960, dst=1, tag=-12, seqn=30173, dst_proc=0x889f10, endpoint=0x88a130, comm=0x894cc0) at pml_ob1_isend.c:119
#3  0x00007fb644b3f240 in mca_pml_ob1_isend (buf=0x897990, count=1, datatype=0x611960, dst=1, tag=-12, sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x894cc0, request=0x7fff6e0eac68) at pml_ob1_isend.c:156
#4  0x00007fb64fb2a33c in ompi_coll_base_allreduce_intra_recursivedoubling (sbuf=0x7fff6e0eae68, rbuf=0x7fff6e0eae60, count=1, dtype=0x611960, op=0x610960, comm=0x894cc0, module=0x895990)
    at base/coll_base_allreduce.c:221
#5  0x00007fb63fdecb45 in ompi_coll_tuned_allreduce_intra_dec_fixed (sbuf=0x7fff6e0eae68, rbuf=0x7fff6e0eae60, count=1, dtype=0x611960, op=0x610960, comm=0x894cc0, module=0x895990) at coll_tuned_decision_fixed.c:66
#6  0x00007fb64fabee36 in PMPI_Allreduce (sendbuf=0x7fff6e0eae68, recvbuf=0x7fff6e0eae60, count=1, datatype=0x611960, op=0x610960, comm=0x894cc0) at pallreduce.c:107
#7  0x000000000040500c in IMB_init_buffers_iter ()
#8  0x0000000000401e06 in main ()
(gdb)

@larrystevenwise
Copy link

So this:

1841                *descriptor = (struct mca_btl_base_descriptor_t *) frag;

barfed because descriptor is NULL:

(gdb) p descriptor
$3 = (mca_btl_base_descriptor_t **) 0x0

@larrystevenwise
Copy link

dunno if this is the right fix or not?

diff --git a/opal/mca/btl/openib/btl_openib.c b/opal/mca/btl/openib/btl_openib.c
index 05c15e1..00839de 100644
--- a/opal/mca/btl/openib/btl_openib.c
+++ b/opal/mca/btl/openib/btl_openib.c
@@ -1838,7 +1838,8 @@ int mca_btl_openib_sendi( struct mca_btl_base_module_t* btl,
 #if BTL_OPENIB_FAILOVER_ENABLED
         else {
             /* Return up in case needed for failover */
-            *descriptor = (struct mca_btl_base_descriptor_t *) frag;
+            if (descriptor)
+                *descriptor = (struct mca_btl_base_descriptor_t *) frag;
         }
 #endif
         OPAL_THREAD_UNLOCK(&ep->endpoint_lock);
[root@stevo1 ompi]#

@larrystevenwise
Copy link

Looks like mca_pml_ob1_send_inline() passes down a descriptor of NULL...

@larrystevenwise
Copy link

@davidklaftenegger can you please try the patch above? It fixes the issue on my setup...

@larrystevenwise
Copy link

Created PR #2323

@davidklaftenegger
Copy link
Author

we currently disabled the compilation flag for this, but we might be able to test it later this week.

@hjelmn
Copy link
Member

hjelmn commented Oct 31, 2016

Does failover even work in Open MPI anymore? The bfo pml is long since gone and the support in openib is bitrotten.

@jsquyres
Copy link
Member

jsquyres commented Nov 1, 2016

Fair point. Is the real issue that --enable-btl-openib-failover should really be deleted as a configure option (and the corresponding support in the openib should be removed)?

@larrystevenwise
Copy link

I'm pretty sure this doesn't work with iWARP and cxgb4, especially if it leverages IB inter-port failover. If folks wish it, I'll remove the code entirely. We should hear what MLNX thinks though.

@jsquyres
Copy link
Member

jsquyres commented Nov 1, 2016

@larrystevenwise On the call today, we decided that since the bfo PML is no longer included in Open MPI, we should remove all the failover code from the openib BTL (since it's stale, anyway). If you could update (or replace? whatever is easier) #2323 with commits to remove all the failover code from the openib BTL, that would be great.

@davidklaftenegger I'm sorry; this probably isn't the answer that you want, but the bfo PML was removed when there was lack of interest and a lack of a maintainer. Sorry! 😢

@davidklaftenegger
Copy link
Author

@jsquyres I have no stake in this, I only had it enabled because I saw no reason not to, until the resulting openmpi did not work.

@larrystevenwise
Copy link

Closed #2323 - I'll create a new PR to remove the failover code.

@larrystevenwise
Copy link

#2336 opened to remove openib failover code.

@jsquyres
Copy link
Member

jsquyres commented Nov 7, 2016

Fixed in #2336.

@jsquyres jsquyres closed this as completed Nov 7, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants