-
Notifications
You must be signed in to change notification settings - Fork 876
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
segmentation fault in openib with failover enabled #2228
Comments
@hjelmn Are you aware of any BTL changes that would cause this kind of behavior? |
@larrystevenwise Can you please investigate? |
I'll see if I can reproduce on cxgb4... |
@davidklaftenegger Please provide the configure command you used to configure ompi. Thanx! |
reproduced it... Program terminated with signal 11, Segmentation fault. #0 0x00007fb6453865be in mca_btl_openib_sendi (btl=0x810f20, ep=0x88aa10, convertor=0x7fff6e0eaa00, header=0x7fff6e0eab10, header_size=14, payload_size=8, order=255 '\377', flags=3, tag=65 'A', descriptor=0x0) at btl_openib.c:1841 1841 *descriptor = (struct mca_btl_base_descriptor_t *) frag; Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.132.el6.x86_64 libgcc-4.4.7-4.el6.x86_64 libudev-147-2.51.el6.x86_64 numactl-2.0.7-8.el6.x86_64 tcp_wrappers-libs-7.6-57.el6.x86_64 (gdb) bt #0 0x00007fb6453865be in mca_btl_openib_sendi (btl=0x810f20, ep=0x88aa10, convertor=0x7fff6e0eaa00, header=0x7fff6e0eab10, header_size=14, payload_size=8, order=255 '\377', flags=3, tag=65 'A', descriptor=0x0) at btl_openib.c:1841 #1 0x00007fb644b3e0fa in mca_bml_base_sendi (bml_btl=0x88a290, convertor=0x7fff6e0eaa00, header=0x7fff6e0eab10, header_size=14, payload_size=8, order=255 '\377', flags=3, tag=65 'A', descriptor=0x0) at ../../../../ompi/mca/bml/bml.h:301 #2 0x00007fb644b3f125 in mca_pml_ob1_send_inline (buf=0x897990, count=1, datatype=0x611960, dst=1, tag=-12, seqn=30173, dst_proc=0x889f10, endpoint=0x88a130, comm=0x894cc0) at pml_ob1_isend.c:119 #3 0x00007fb644b3f240 in mca_pml_ob1_isend (buf=0x897990, count=1, datatype=0x611960, dst=1, tag=-12, sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x894cc0, request=0x7fff6e0eac68) at pml_ob1_isend.c:156 #4 0x00007fb64fb2a33c in ompi_coll_base_allreduce_intra_recursivedoubling (sbuf=0x7fff6e0eae68, rbuf=0x7fff6e0eae60, count=1, dtype=0x611960, op=0x610960, comm=0x894cc0, module=0x895990) at base/coll_base_allreduce.c:221 #5 0x00007fb63fdecb45 in ompi_coll_tuned_allreduce_intra_dec_fixed (sbuf=0x7fff6e0eae68, rbuf=0x7fff6e0eae60, count=1, dtype=0x611960, op=0x610960, comm=0x894cc0, module=0x895990) at coll_tuned_decision_fixed.c:66 #6 0x00007fb64fabee36 in PMPI_Allreduce (sendbuf=0x7fff6e0eae68, recvbuf=0x7fff6e0eae60, count=1, datatype=0x611960, op=0x610960, comm=0x894cc0) at pallreduce.c:107 #7 0x000000000040500c in IMB_init_buffers_iter () #8 0x0000000000401e06 in main () (gdb) |
So this: 1841 *descriptor = (struct mca_btl_base_descriptor_t *) frag; barfed because descriptor is NULL: (gdb) p descriptor $3 = (mca_btl_base_descriptor_t **) 0x0 |
dunno if this is the right fix or not?
|
Looks like |
@davidklaftenegger can you please try the patch above? It fixes the issue on my setup... |
Created PR #2323 |
we currently disabled the compilation flag for this, but we might be able to test it later this week. |
Does failover even work in Open MPI anymore? The bfo pml is long since gone and the support in openib is bitrotten. |
Fair point. Is the real issue that |
I'm pretty sure this doesn't work with iWARP and cxgb4, especially if it leverages IB inter-port failover. If folks wish it, I'll remove the code entirely. We should hear what MLNX thinks though. |
@larrystevenwise On the call today, we decided that since the @davidklaftenegger I'm sorry; this probably isn't the answer that you want, but the |
@jsquyres I have no stake in this, I only had it enabled because I saw no reason not to, until the resulting openmpi did not work. |
Closed #2323 - I'll create a new PR to remove the failover code. |
#2336 opened to remove openib failover code. |
Fixed in #2336. |
When compiling openmpi-2.0.1 (or the nightly from last Wednesday) with
--enable-btl-openib-failover
we experience a segmentation fault on the first use of MPI communication in all MPI applications when using openib.This is a regression from openmpi-10.0.2, where this worked without incident.
When not setting
--enable-btl-openib-failover
our setup seems to work again.In case that matters, we have an mlx4 Infiniband interconnect.
If you need any additional information, please tell me.
Yours,
David
The text was updated successfully, but these errors were encountered: