Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault in mca_io_romio321.so #9432

Open
amckinstry opened this issue Sep 27, 2021 · 18 comments
Open

Segfault in mca_io_romio321.so #9432

amckinstry opened this issue Sep 27, 2021 · 18 comments

Comments

@amckinstry
Copy link

Background information

This is with 4.1.2~rc1 on multiple architectures on Debian unstable (development branch).
It is triggered by the tests in mpi4py:

eg.
https://buildd.debian.org/status/fetch.php?pkg=mpi4py&arch=ppc64el&ver=3.1.1-8&stamp=1632692000&raw=0

The source as installed is at:
https://sources.debian.org/src/openmpi/4.1.2%7Erc1-2/
For configuration information, see the rules file:
https://sources.debian.org/src/openmpi/4.1.2%7Erc1-2/debian/rules/

@edgargabriel
Copy link
Member

The romio segfault looks legit, but there is another issue here, namely why was romio used instead of ompio.

[ppc64el-osuosl-01:1381766] mca_base_component_repository_open: unable to open mca_io_ompio: libmca_common_ompio.so.41: cannot open shared object file: No such file or directory (ignored)

Not sure whether to open another ticket for this or not. Will try to reproduce.

@edgargabriel
Copy link
Member

edgargabriel commented Sep 27, 2021

At least for the missing libmca_common_ompio problem, the issue seems to be in your scripts:
---snip---(file debian/rules line 275)

if test -f $(DESTDIR)/$(LIBDIR)/openmpi/lib/libmca_common_ompio.so.41.29.1; then
dh_install -p libopenmpi3 $(LIBDIR)/openmpi/lib/libmca_common_ompio.so.41.29.1 $(LIBDIR) ; \

---snip---

the file generated on my system is however libmca_common_ompio.so.41.29.2

@ggouaillardet
Copy link
Contributor

@amckinstry Thanks for the report!

I am unable to reproduce the issue on my CentOS 7 x86_64 VM.

Is this issue specific to ppc64le?

FWIW

[gilles@ws mpi4py-3.1.1]$ python3 test/runtests.py -v -i test_io
[0@ws] Python 3.6 (/usr/bin/python3)
[0@ws] MPI 3.1 (Open MPI 4.1.2)
[0@ws] mpi4py 3.1.1 (build/lib.linux-x86_64-3.6/mpi4py)
testIReadIWrite (test_io.TestIOSelf) ... ok
testIReadIWriteAll (test_io.TestIOSelf) ... ok
testIReadIWriteAt (test_io.TestIOSelf) ... ok
testIReadIWriteAtAll (test_io.TestIOSelf) ... ok
testIReadIWriteShared (test_io.TestIOSelf) ... ok
testReadWrite (test_io.TestIOSelf) ... ok
testReadWriteAll (test_io.TestIOSelf) ... ok
testReadWriteAllBeginEnd (test_io.TestIOSelf) ... ok
testReadWriteAt (test_io.TestIOSelf) ... ok
testReadWriteAtAll (test_io.TestIOSelf) ... ok
testReadWriteAtAllBeginEnd (test_io.TestIOSelf) ... ok
testReadWriteOrdered (test_io.TestIOSelf) ... ok
testReadWriteOrderedBeginEnd (test_io.TestIOSelf) ... ok
testReadWriteShared (test_io.TestIOSelf) ... ok
testIReadIWrite (test_io.TestIOWorld) ... ok
testIReadIWriteAll (test_io.TestIOWorld) ... ok
testIReadIWriteAt (test_io.TestIOWorld) ... ok
testIReadIWriteAtAll (test_io.TestIOWorld) ... ok
testIReadIWriteShared (test_io.TestIOWorld) ... ok
testReadWrite (test_io.TestIOWorld) ... ok
testReadWriteAll (test_io.TestIOWorld) ... ok
testReadWriteAllBeginEnd (test_io.TestIOWorld) ... ok
testReadWriteAt (test_io.TestIOWorld) ... ok
testReadWriteAtAll (test_io.TestIOWorld) ... ok
testReadWriteAtAllBeginEnd (test_io.TestIOWorld) ... ok
testReadWriteOrdered (test_io.TestIOWorld) ... ok
testReadWriteOrderedBeginEnd (test_io.TestIOWorld) ... ok
testReadWriteShared (test_io.TestIOWorld) ... ok

----------------------------------------------------------------------
Ran 28 tests in 40.097s

OK

@drew-parsons
Copy link

drew-parsons commented Sep 28, 2021

No, it's affecting nearly all architectures, see https://buildd.debian.org/status/package.php?p=mpi4py
I can reproduce it easily on amd64 with openmpi debian package 4.1.2~rc1-2.

There's some more details at https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=995150
e.g. gdb backtrace (running test_io.py only)

Thread 1 "python3" received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()
(gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x00007fffad13ea84 in MPIOI_File_iwrite_all (fh=<optimized out>, offset=offset@entry=0, file_ptr_type=file_ptr_type@entry=101, buf=<optimized out>, count=13, datatype=0x7ffff3d27700 <ompi_mpi_signed_char>, 
    myname=0x7fffad17be80 <myname> "MPI_FILE_IWRITE_ALL", request=0x7fffad57ca10) at ../../../../../../../ompi/mca/io/romio321/romio/mpi-io/iwrite_all.c:124
#2  0x00007fffad13ebd3 in mca_io_romio_dist_MPI_File_iwrite_all (fh=<optimized out>, buf=<optimized out>, count=<optimized out>, datatype=<optimized out>, request=<optimized out>)
    at ../../../../../../../ompi/mca/io/romio321/romio/mpi-io/iwrite_all.c:58
#3  0x00007fffad13c4c3 in mca_io_romio321_file_iwrite_all (fh=<optimized out>, buf=<optimized out>, count=<optimized out>, datatype=<optimized out>, request=<optimized out>)
    at ../../../../../../ompi/mca/io/romio321/src/io_romio321_file_write.c:203
#4  0x00007ffff3c86b4c in PMPI_File_iwrite_all (fh=0x1911f40, buf=0x7fffad6072b0, count=13, datatype=<optimized out>, request=request@entry=0x7fffad57ca10) at pfile_iwrite_all.c:83
#5  0x00007ffff3e80d0c in __pyx_pf_6mpi4py_3MPI_4File_62Iwrite_all (__pyx_v_buf=<optimized out>, __pyx_v_self=0x7fffacf8d990) at src/mpi4py.MPI.c:159094
#6  __pyx_pw_6mpi4py_3MPI_4File_63Iwrite_all (__pyx_v_self=<mpi4py.MPI.File at remote 0x7fffacf8d990>, __pyx_args=__pyx_args@entry=(<array.array at remote 0x7fffad57c930>,), __pyx_kwds=__pyx_kwds@entry=0x0)
    at src/mpi4py.MPI.c:27957
#7  0x000000000052b141 in method_vectorcall_VARARGS_KEYWORDS (func=<optimized out>, args=0x1a406b8, nargsf=<optimized out>, kwnames=0x0) at ../Objects/descrobject.c:346
#8  0x00000000005127fb in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=<optimized out>, args=0x1a406b8, callable=<method_descriptor at remote 0x7ffff3f05450>, tstate=0x968fb0)
    at ../Include/cpython/abstract.h:118
#9  PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, args=0x1a406b8, callable=<method_descriptor at remote 0x7ffff3f05450>) at ../Include/cpython/abstract.h:127
#10 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, tstate=0x968fb0) at ../Python/ceval.c:5075
#11 _PyEval_EvalFrameDefault (tstate=<optimized out>, f=<optimized out>, throwflag=<optimized out>) at ../Python/ceval.c:3504
#12 0x00000000005291c3 in _PyEval_EvalFrame (throwflag=0, 
    f=Frame 0x1a404e0, for file /projects/python/build/mpi4py/test/test_io.py, line 306, in testIReadIWriteAll (self=<TestIOSelf(_testMethodName='testIReadIWriteAll', _outcome=<_Outcome(expecting_failure=False, result=<TestCaseFunction(keywords=<NodeKeywords(node=<...>, parent=<UnitTestCase(keywords=<NodeKeywords(node=<...>, parent=<Module(keywords=<NodeKeywords(node=<...>, parent=<Session(keywords=<NodeKeywords(node=<...>, parent=None, _markers={'mpi4py': True}) at remote 0x7fffb6a28b80>, own_markers=[], extra_keyword_matches=set(), testsfailed=1, testscollected=28, shouldstop=False, shouldfail=False, trace=<TagTracerSub(root=<TagTracer(_tags2proc={}, _writer=None, indent=0) at remote 0x7ffff7865be0>, tags=('collection',)) at remote 0x7fffb6a28a30>, startdir=<LocalPath(strpath='/projects/python/build/mpi4py/test') at remote 0x7fffb6a28ac0>, _initialpaths=frozenset({<LocalPath(strpath='/projects/python/build/mpi4py/test/test_io.py') at remote 0x7fffb69e7a60>}), _bestrelpathcache=<_best...(truncated), 
    tstate=0x968fb0) at ../Include/internal/pycore_ceval.h:40

Debian unstable has just upgraded to libc6 2.32, though I'm not sure that that's relevant.

@drew-parsons
Copy link

for the missing libmca_common_ompio problem, the issue seems to be in your scripts

The missing file shows also in the amd64 log, https://buildd.debian.org/status/fetch.php?pkg=mpi4py&arch=amd64&ver=3.1.1-8&stamp=1632748520&raw=0 , so it sounds like fixing the openmpi debian/rules as you suggested will get mpi4py passing tests again.

@jsquyres
Copy link
Member

@amckinstry Should we close this issue here on the Open MPI side, since it looks like the issue is in the debian packaging?

@ggouaillardet
Copy link
Contributor

Thanks @drew-parsons for the backtrace!
I was indeed incorrectly testing ...

Here is what happens:

  • because of a Debian packaging error, io/romio321 is selected instead of the default io/ompio
  • there is indeed a crash in io/romio321 because ROMIO asynchronous collectives have not been "ported" to Open MPI
  • this is not a new bug, but is likely evidenced by the packaging issue and/or new mpi4py tests

From now:

  • fixing the Debian packaging issue will have Open MPI use the io/ompio component (unless running on Lustre?) and hence hide the issue
  • I currently do not have the bandwidth to implement the missing bits in the v4.1.x series
  • I will check the status in the v5.0.x series

So the bug report is legit and the root cause is not in the Debian packaging.

@amckinstry
Copy link
Author

amckinstry commented Sep 28, 2021

There is a bug in the Debian packaging (which I'm currently fixing in the next upload).
I will leave this bug report open for the addition problem @ggouaillardet points out.

I'm happy to see this bug merged with another or closed.

@jsquyres
Copy link
Member

@ggouaillardet with the merge of #8371 is this issue now resolved?

@roblatham00
Copy link
Contributor

Huh. Romio is supposed to set some flags at configure time to not use grequest extensions . It should be a one or two line fix to romios "built as part of openmpi" case. I will look more closely at this in the morning

@ggouaillardet
Copy link
Contributor

Thanks @roblatham00

What is the expected behavior if request extensions are not available/used?
Does ROMIO still implements non blocking collectives?

If not, this is more of an Open MPI integration issue: we still have to implement the "frontend" (e.g. MPI_File_iwrite_all()) but should do a better job at supporting a backend (e.g. ROM-IO) that does not implement/support such primitives.
I looked at it a few years ago, so maybe things have changed quite a lot in that area!

@roblatham00
Copy link
Contributor

roblatham00 commented Oct 12, 2021 via email

@ggouaillardet
Copy link
Contributor

@roblatham00 I did a quick check and here is what I found/remember

MPI_File_iwrite_all() ends up calling MPIOI_File_iwrite_all() and then ADIO_IwriteStridedColl(),
this is a macro for fd->fns->ADIOI_xxx_IwriteStridedColl

In my understanding, the expectation is that should be (a pointer to) ADIOI_GEN_IwriteStridedColl() that is implemented in src/mpi/romio/adio/common/ad_iwrite_coll.c.

From the Open MPI point of view, that won't work (or even compile) because this file uses MPIX_Grequest_class_allocate() which is not (yet?) implemented in Open MPI (glue for ROMIO).

In order to move forward, I did two things

+#ifdef HAVE_MPI_GREQUEST_EXTENSIONS
 void ADIOI_GEN_IwriteStridedColl(ADIO_File fd, const void *buf, int count,
                                  MPI_Datatype datatype, int file_ptr_type,
                                  ADIO_Offset offset, MPI_Request * request, int *error_code);
+#else
+#define ADIOI_GEN_IwriteStridedColl NULL
+#endif

and add some #ifdef HAVE_MPI_GREQUEST_EXTENSIONS around the uses of the MPICH Grequest extensions
(this became dead code anyway, a better option would have been to conditionally compile these files)

So when I did the integration, I concluded ROMIO could not be compiled without the Grequest extensions.
I chose the fastest way and left non blocking collectives unimplemented in the ROMIO module for Open MPI.

Your previous reply suggests this should only be a compilation issue and I will investigate that.

@roblatham00
Copy link
Contributor

roblatham00 commented Oct 12, 2021 via email

@bwbarrett bwbarrett modified the milestones: v4.1.2, v4.1.3 Nov 24, 2021
@bwbarrett bwbarrett modified the milestones: v4.1.3, v4.1.4 Mar 31, 2022
@drew-parsons
Copy link

We're seeing a new segfault (new with OpenMPI 4.1.3) in mpi4py test test_io.TestIOSelf on amd64 (debian builds). Debian CI test log at https://ci.debian.net/data/autopkgtest/testing/amd64/m/mpi4py/20603810/log.gz

It looks like the same as a segfault already experienced with earlier OpenMPI versions on i386, discussed at mpi4py/mpi4py#105. We tested that mpich is passing the same mpi4py tests, so it seems to be problem with OpenMPI IO.

A backtrace with OpenMPI 4.1.3 indicates the problem is in romio321, so I'm wondering if it's essentially the same as the problem reported here. Valgrind output from mpi4py tests:

$ OMPI_MCA_io=romio321 valgrind -q python test/main.py -q -i test_io -k TestIOSelf.testIReadIWriteAll -v
[0@kw61149] Python 3.10.4 (/usr/bin/python)
[0@kw61149] numpy 1.21.5 (/usr/lib64/python3.10/site-packages/numpy)
[0@kw61149] MPI 3.1 (Open MPI 4.1.3)
[0@kw61149] mpi4py 4.0.0.dev0 (/home/dalcinl/Devel/mpi4py/build/lib.linux-x86_64-3.10/mpi4py)
testIReadIWriteAll (test_io.TestIOSelf) ... ==3623039== Jump to the invalid address stated on the next line
==3623039==    at 0x0: ???
==3623039==    by 0x1DA8E8B3: mca_io_romio_dist_MPI_File_iwrite_all (iwrite_all.c:58)
==3623039==    by 0x1DA8BAF5: mca_io_romio321_file_iwrite_all (io_romio321_file_write.c:204)
==3623039==    by 0x1600D5D5: PMPI_File_iwrite_all (pfile_iwrite_all.c:83)
==3623039==    by 0x15DBACA9: PyMPI_File_iwrite_all_c (largecnt.h:2377)
==3623039==    by 0x15ECAA01: __pyx_pf_6mpi4py_3MPI_4File_62Iwrite_all (MPI.c:171765)
==3623039==    by 0x15ECA843: __pyx_pw_6mpi4py_3MPI_4File_63Iwrite_all (MPI.c:171700)
==3623039==    by 0x498A50F: method_vectorcall_VARARGS_KEYWORDS (descrobject.c:344)
==3623039==    by 0x497CBA2: UnknownInlinedFun (abstract.h:114)
==3623039==    by 0x497CBA2: UnknownInlinedFun (abstract.h:123)
==3623039==    by 0x497CBA2: UnknownInlinedFun (ceval.c:5867)
==3623039==    by 0x497CBA2: _PyEval_EvalFrameDefault (ceval.c:4198)
==3623039==    by 0x497B5FF: UnknownInlinedFun (pycore_ceval.h:46)
==3623039==    by 0x497B5FF: _PyEval_Vector (ceval.c:5065)
==3623039==    by 0x49918F7: UnknownInlinedFun (call.c:342)
==3623039==    by 0x49918F7: UnknownInlinedFun (abstract.h:114)
==3623039==    by 0x49918F7: method_vectorcall (classobject.c:53)
==3623039==    by 0x497C7C6: UnknownInlinedFun (abstract.h:114)
==3623039==    by 0x497C7C6: UnknownInlinedFun (abstract.h:123)
==3623039==    by 0x497C7C6: UnknownInlinedFun (ceval.c:5867)
==3623039==    by 0x497C7C6: _PyEval_EvalFrameDefault (ceval.c:4213)
==3623039==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==3623039== 
[kw61149:3623039] *** Process received signal ***
[kw61149:3623039] Signal: Segmentation fault (11)
[kw61149:3623039] Signal code: Invalid permissions (2)
[kw61149:3623039] Failing at address: (nil)
[kw61149:3623039] [ 0] /lib64/libc.so.6(+0x42750)[0x4bff750]
[kw61149:3623039] *** End of error message ***
Segmentation fault (core dumped)

@ggouaillardet
Copy link
Contributor

@drew-parsons The root cause is still the same: non blocking MPI-IO collectives are not implemented in Open MPI ROMIO component.
That being said, the default component for MPI-IO is now the "native" ompio component regardless the filesystem (the default used to be to fallback on ROMIO on Lustre filesystem, but this is not the case any more).

Is there any reason why you are explicitly requesting the ROMIO component?

@drew-parsons
Copy link

Is there any reason why you are explicitly requesting the ROMIO component?

It's a good question, @dalcini from mpi4py raised it too. @amckinstry might be able to answer.

@drew-parsons
Copy link

We've figured out why openmpi was using ROMIO instead of ompio in the new Debian build. There was a version bump in libmca_common_ompio.so (and libmca_common_ucx.so). Symlinks were left dangling which is why openmpi didn't find ompio and therefore fell back to romio321. With that fixed, mpi4py tests are now passing with openmpi 4.1.3 on amd64. i386 continues to fail tests in MPI-IO

@bwbarrett bwbarrett modified the milestones: v4.1.4, v4.1.5 May 25, 2022
@bwbarrett bwbarrett modified the milestones: v4.1.5, v4.1.6 Feb 23, 2023
@bwbarrett bwbarrett modified the milestones: v4.1.6, v4.1.7 Sep 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants