-
Notifications
You must be signed in to change notification settings - Fork 858
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segfault in mca_io_romio321.so #9432
Comments
The romio segfault looks legit, but there is another issue here, namely why was romio used instead of ompio. [ppc64el-osuosl-01:1381766] mca_base_component_repository_open: unable to open mca_io_ompio: libmca_common_ompio.so.41: cannot open shared object file: No such file or directory (ignored) Not sure whether to open another ticket for this or not. Will try to reproduce. |
At least for the missing libmca_common_ompio problem, the issue seems to be in your scripts: if test -f ---snip--- the file generated on my system is however libmca_common_ompio.so.41.29.2 |
@amckinstry Thanks for the report! I am unable to reproduce the issue on my CentOS 7 Is this issue specific to FWIW
|
No, it's affecting nearly all architectures, see https://buildd.debian.org/status/package.php?p=mpi4py There's some more details at https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=995150
Debian unstable has just upgraded to libc6 2.32, though I'm not sure that that's relevant. |
The missing file shows also in the amd64 log, https://buildd.debian.org/status/fetch.php?pkg=mpi4py&arch=amd64&ver=3.1.1-8&stamp=1632748520&raw=0 , so it sounds like fixing the openmpi debian/rules as you suggested will get mpi4py passing tests again. |
@amckinstry Should we close this issue here on the Open MPI side, since it looks like the issue is in the debian packaging? |
Thanks @drew-parsons for the backtrace! Here is what happens:
From now:
So the bug report is legit and the root cause is not in the Debian packaging. |
There is a bug in the Debian packaging (which I'm currently fixing in the next upload). I'm happy to see this bug merged with another or closed. |
@ggouaillardet with the merge of #8371 is this issue now resolved? |
Huh. Romio is supposed to set some flags at configure time to not use grequest extensions . It should be a one or two line fix to romios "built as part of openmpi" case. I will look more closely at this in the morning |
Thanks @roblatham00 What is the expected behavior if request extensions are not available/used? If not, this is more of an Open MPI integration issue: we still have to implement the "frontend" (e.g. |
Romio will still implement the immediate routines, except all the work
happens when operation posted. Test/wait return immediately with a
completed grequest object
…On Mon, Oct 11, 2021, 20:40 Gilles Gouaillardet ***@***.***> wrote:
Thanks @roblatham00 <https://github.com/roblatham00>
What is the expected behavior if request extensions are not available/used?
Does ROMIO still implements non blocking collectives?
If not, this is more of an Open MPI integration issue: we still have to
implement the "frontend" (e.g. MPI_File_iwrite_all()) but should do a
better job at supporting a backend (e.g. ROM-IO) that does not
implement/support such primitives.
I looked at it a few years ago, so maybe things have changed quite a lot
in that area!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9432 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AARMIEMEZSL4UJS2RQASNPDUGOG2TANCNFSM5E2MDLAA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
@roblatham00 I did a quick check and here is what I found/remember
In my understanding, the expectation is that should be (a pointer to) From the Open MPI point of view, that won't work (or even compile) because this file uses In order to move forward, I did two things
and add some So when I did the integration, I concluded ROMIO could not be compiled without the Grequest extensions. Your previous reply suggests this should only be a compilation issue and I will investigate that. |
The romio non-blocking code is largely unchanged since I wrote it 15 years
ago. I'm sure we can come up with a better approach today. Will be glad
to make any changes in romio that make open-mpi life easier.
…On Mon, Oct 11, 2021 at 10:10 PM Gilles Gouaillardet < ***@***.***> wrote:
@roblatham00 <https://github.com/roblatham00> I did a quick check and
here is what I found/remember
MPI_File_iwrite_all() ends up calling MPIOI_File_iwrite_all() and then
ADIO_IwriteStridedColl(),
this is a macro for fd->fns->ADIOI_xxx_IwriteStridedColl
In my understanding, the expectation is that should be (a pointer to)
ADIOI_GEN_IwriteStridedColl() that is implemented in
src/mpi/romio/adio/common/ad_iwrite_coll.c.
From the Open MPI point of view, that won't work (or even compile) because
this file uses MPIX_Grequest_class_allocate() which is not (yet?)
implemented in Open MPI (glue for ROMIO).
In order to move forward, I did two things
+#ifdef HAVE_MPI_GREQUEST_EXTENSIONS
void ADIOI_GEN_IwriteStridedColl(ADIO_File fd, const void *buf, int count,
MPI_Datatype datatype, int file_ptr_type,
ADIO_Offset offset, MPI_Request * request, int *error_code);
+#else
+#define ADIOI_GEN_IwriteStridedColl NULL
+#endif
and add some #ifdef HAVE_MPI_GREQUEST_EXTENSIONS around the uses of the
MPICH Grequest extensions
(this became dead code anyway, a better option would have been to
conditionally compile these files)
So when I did the integration, I concluded ROMIO could not be compiled
without the Grequest extensions.
I chose the fastest way and left non blocking collectives unimplemented in
the ROMIO module for Open MPI.
Your previous reply suggests this should only be a compilation issue and I
will investigate that.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9432 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AARMIEPZ4Z3SFCMKQ53AA3LUGORKNANCNFSM5E2MDLAA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
We're seeing a new segfault (new with OpenMPI 4.1.3) in mpi4py test test_io.TestIOSelf on amd64 (debian builds). Debian CI test log at https://ci.debian.net/data/autopkgtest/testing/amd64/m/mpi4py/20603810/log.gz It looks like the same as a segfault already experienced with earlier OpenMPI versions on i386, discussed at mpi4py/mpi4py#105. We tested that mpich is passing the same mpi4py tests, so it seems to be problem with OpenMPI IO. A backtrace with OpenMPI 4.1.3 indicates the problem is in romio321, so I'm wondering if it's essentially the same as the problem reported here. Valgrind output from mpi4py tests:
|
@drew-parsons The root cause is still the same: non blocking MPI-IO collectives are not implemented in Open MPI ROMIO component. Is there any reason why you are explicitly requesting the ROMIO component? |
It's a good question, @dalcini from mpi4py raised it too. @amckinstry might be able to answer. |
We've figured out why openmpi was using ROMIO instead of ompio in the new Debian build. There was a version bump in libmca_common_ompio.so (and libmca_common_ucx.so). Symlinks were left dangling which is why openmpi didn't find ompio and therefore fell back to romio321. With that fixed, mpi4py tests are now passing with openmpi 4.1.3 on amd64. i386 continues to fail tests in MPI-IO |
Background information
This is with 4.1.2~rc1 on multiple architectures on Debian unstable (development branch).
It is triggered by the tests in mpi4py:
eg.
https://buildd.debian.org/status/fetch.php?pkg=mpi4py&arch=ppc64el&ver=3.1.1-8&stamp=1632692000&raw=0
The source as installed is at:
https://sources.debian.org/src/openmpi/4.1.2%7Erc1-2/
For configuration information, see the rules file:
https://sources.debian.org/src/openmpi/4.1.2%7Erc1-2/debian/rules/
The text was updated successfully, but these errors were encountered: