NetBSD: Runtime error using Open MPI 2.1.0rc2 #3117

hppritcha · 2017-03-06T23:07:19Z

@PHHargrove reports hitting a runtime error (using ring_c).c) on NetBSD when using the Open MPI 2.1.0rc2 tarball. Error is in PMIX.

See https://www.mail-archive.com/devel@lists.open-mpi.org//msg19987.html

rhc54 · 2017-03-07T03:33:50Z

Here is the error report:

2.1.0rc2 tarball on NetBSD7/amd64.
Configured with only --prefix=... and --disable-mpi-fortran

To get past the lack of a struct timeval definition required a small source change in a previous email.
Once past that, I can build Open MPI and compile the examples.
However, I cannot run them.

Output below.

-Paul

$ mpirun -mca btl sm,self -np 2 examples/ring_c
[netbsd-amd64.kvm:20873] PMIX ERROR: ERROR in file /home/phargrov/OMPI/openmpi-2.1.0rc2-netbsd7-amd64/openmpi-2.1.0rc2/opal/mca/pmix/pmix112/pmix/src/dstore/pmix_esh.c at line 1651
[netbsd-amd64.kvm:20873] PMIX ERROR: OUT-OF-RESOURCE in file /home/phargrov/OMPI/openmpi-2.1.0rc2-netbsd7-amd64/openmpi-2.1.0rc2/opal/mca/pmix/pmix112/pmix/src/dstore/pmix_esh.c at line 820
[netbsd-amd64.kvm:20873] PMIX ERROR: ERROR in file /home/phargrov/OMPI/openmpi-2.1.0rc2-netbsd7-amd64/openmpi-2.1.0rc2/opal/mca/pmix/pmix112/pmix/src/dstore/pmix_esh.c at line 1468
[netbsd-amd64.kvm:20873] PMIX ERROR: ERROR in file /home/phargrov/OMPI/openmpi-2.1.0rc2-netbsd7-amd64/openmpi-2.1.0rc2/opal/mca/pmix/pmix112/pmix/src/server/pmix_server.c at line 592
[netbsd-amd64.kvm:16632] PMIX ERROR: ERROR in file /home/phargrov/OMPI/openmpi-2.1.0rc2-netbsd7-amd64/openmpi-2.1.0rc2/opal/mca/pmix/pmix112/pmix/src/dstore/pmix_esh.c at line 349
[netbsd-amd64.kvm:16632] PMIX ERROR: ERROR in file /home/phargrov/OMPI/openmpi-2.1.0rc2-netbsd7-amd64/openmpi-2.1.0rc2/opal/mca/pmix/pmix112/pmix/src/dstore/pmix_esh.c at line 839
[netbsd-amd64.kvm:16632] PMIX ERROR: ERROR in file /home/phargrov/OMPI/openmpi-2.1.0rc2-netbsd7-amd64/openmpi-2.1.0rc2/opal/mca/pmix/pmix112/pmix/src/dstore/pmix_esh.c at line 1021
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_rte_init failed
  --> Returned "(null)" (-43) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
[netbsd-amd64.kvm:19230] PMIX ERROR: ERROR in file /home/phargrov/OMPI/openmpi-2.1.0rc2-netbsd7-amd64/openmpi-2.1.0rc2/opal/mca/pmix/pmix112/pmix/src/dstore/pmix_esh.c at line 349
[netbsd-amd64.kvm:19230] PMIX ERROR: ERROR in file /home/phargrov/OMPI/openmpi-2.1.0rc2-netbsd7-amd64/openmpi-2.1.0rc2/opal/mca/pmix/pmix112/pmix/src/dstore/pmix_esh.c at line 839
[netbsd-amd64.kvm:19230] PMIX ERROR: ERROR in file /home/phargrov/OMPI/openmpi-2.1.0rc2-netbsd7-amd64/openmpi-2.1.0rc2/opal/mca/pmix/pmix112/pmix/src/dstore/pmix_esh.c at line 1021
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_rte_init failed
  --> Returned "(null)" (-43) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[netbsd-amd64.kvm:19230] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
[netbsd-amd64.kvm:16632] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[31721,1],0]
  Exit code:    1
--------------------------------------------------------------------------

@artpol84 Can you please take a look?

artpol84 · 2017-03-07T15:33:14Z

The only place that can cause this error is posix_fallocate.
https://github.com/open-mpi/ompi/blob/v2.x/opal/mca/pmix/pmix112/pmix/src/sm/pmix_mmap.c#L69

@PHHargrove can you check /tmp directory to have a leftover?
is there any restrictions on that dir? Like it is a ramdisk with some fixed memory size?

PHHargrove · 2017-03-07T15:46:10Z

Thanks, Artem.
Indeed /tmp is a ram disk that is too small.
Setting TMPDIR resolved the issue.

I would suggest that in the future a better error message would be helpful, but is not critical.

PHHargrove · 2017-03-07T15:59:00Z

Setting TMPDIR resolved the issue.

I spoke too soon.
Setting TMPDIR=$HOME (on a filesystem with 24GB of free space) did not help.

PHHargrove · 2017-03-07T16:04:44Z

I see a part of the problem: posix_fallocate does not set errno on failure.

Here is a relevant portion of the posix_fallocate manpage on NetBSD:

RETURN VALUES
     If successful, the posix_fallocate() function will return zero.
     Otherwise an error number will be returned, without setting errno.

And Linux says much the same:

RETURN VALUE
       posix_fallocate()  returns  zero on success, or an error number on failure.  Note that errno
       is not set.

artpol84 · 2017-03-07T16:12:32Z

But still it fails for some reason.
Can you try to output what posix_fallocate is returned?

PHHargrove · 2017-03-07T16:33:47Z

But still it fails for some reason.
Can you try to output what posix_fallocate is returned?

I am working on it, but I am traveling this week (sitting in the middle of a large meeting).

PHHargrove · 2017-03-07T17:06:08Z

But still it fails for some reason.
Can you try to output what posix_fallocate is returned?

I am working on it, but I am traveling this week (sitting in the middle of a large meeting).

The return value is EOPNOTSUPP, which POSIX documents as "Operation not supported on socket". This makes no sense to me.
It is possible that ENOTSUP (Operation not supported) was intended.

Perhaps the configure test for HAVE_POSIX_FALLOCATE should to verify that the call actually succeeds.
Alternatively (likely simpler) the PMIX code could ignore ENOTSUP and EOPNOTSUPP return codes (not errno). This would result in behavior indistinguishable to the case when HAVE_POSIX_FALLOCATE is not defined.

PHHargrove · 2017-03-07T17:42:01Z

To completely discharge my role as the "portability police", I want to note when HAVE_POSIX_FALLOCATE is not defined the code is using ftruncate() instead.
However, this is a non-portable use of that function.
Here is what the NetBSD manpage says on the subject:

STANDARDS
     Use of truncate() to extend a file is an IEEE Std 1003.1-2004
     (``POSIX.1'') extension, and is thus not portable.  Files can be extended
     in a portable way seeking (using lseek(2)) to the required size and
     writing a single character with write(2).

The Mac OS X, FreeBSD and OpenBSD manpages say something similar in their "BUGS" section:

Use of truncate() to extend a file is not portable.

The Linux manpage provides the most useful information on the subject:

       [...] the
       POSIX standard allows two behaviors for ftruncate() when  length  exceeds  the  file  length
       (note  that  truncate() is not specified at all in such an environment): either returning an
       error, or extending the file.

So, note that "returning an error" is a POSX-compliant behavior when ftruncate() is used to extend a file.

However, in reality this is likely not an issue for 2 reasons.

I have yet to see any system on which ftruncate() failed to extend except for FAT/VFAT filesystems on Linux (which is not a suitable $TMPDIR anyway).
The current POSIX.1 (IEEE Std 1003.1-2008, 2016 Edition) no longer lists extension as optional If the file previously was smaller than this size, ftruncate() shall increase the size of the file.

rhc54 · 2017-03-07T18:06:21Z

So it sounds like we need to do one or two things, and perhaps a third:

update the check for HAVE_POSIX_FALLOCATE to test functionality as well as presence
check the return code instead of errno when calling it - we should do this regardless, though it may also serve as the only required action
perhaps look at using lseek to extend the file instead of ftruncate, though as you say, ftruncate extending the file may no longer be optional anyway. I'd put this at low priority

PHHargrove · 2017-03-07T18:59:45Z

@rhc54 I agree on all three points above, including placing low priority on allowing for failure of ftruncate().

I agree that with the proper checks item 2 alone should be sufficient. There is a 50%+ chance that a tested-and-signed-off patch will appear here in the next two hour that implements item 2.

PHHargrove · 2017-03-07T20:25:35Z

Proposed patch which has been tested (via ring_c) on the NetBSD/amd64 system where the problem was first observed.

artpol84 · 2017-03-07T20:48:12Z

thanks Paul, can you create a PR against PMIx master or you want me to do that?

PHHargrove · 2017-03-07T20:54:35Z

@artpol84 I would really appreciate it if you could handle the PR.
I am at (or over?) the limit of what I can do while still keeping up with the meeting I am sitting in.

artpol84 · 2017-03-07T20:56:15Z

Thanks, Paul. We will take care.

artpol84 · 2017-03-08T06:31:14Z

@karasevb please create PR

hppritcha · 2017-03-09T15:02:10Z

@jsquyres I vote for moving this to 2.1.1

PHHargrove · 2017-03-09T15:04:49Z

I only test on the effect system(s) - I don't use them.
So waiting for 2.1.1 is not a problem for me.

jsquyres · 2017-03-09T15:06:28Z

@hppritcha I just moved the milestone, but then I looked closer / remembered what this one was -- wasn't it fixed in #3130 ?

rhc54 · 2017-03-09T15:28:06Z

yes, it was, so closing this one

guziy · 2020-03-03T15:01:43Z

Thanks @artpol84 this saved me a lot of time and frustration. Redefined TMPDIR and it works fine.

Cheers

hppritcha added the bug label Mar 6, 2017

hppritcha added this to the v2.1.0 milestone Mar 6, 2017

jjhursey assigned artpol84 Mar 7, 2017

rhc54 mentioned this issue Mar 8, 2017

Fix a problem that surfaced on NetBSD/AMD64. #3127

Closed

karasevb mentioned this issue Mar 8, 2017

dstore/sm: added the check posix_fallocate return code openpmix/openpmix#328

Merged

jsquyres modified the milestones: v2.1.1, v2.1.0 Mar 9, 2017

rhc54 closed this as completed Mar 9, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NetBSD: Runtime error using Open MPI 2.1.0rc2 #3117

NetBSD: Runtime error using Open MPI 2.1.0rc2 #3117

hppritcha commented Mar 6, 2017

rhc54 commented Mar 7, 2017

artpol84 commented Mar 7, 2017

PHHargrove commented Mar 7, 2017

PHHargrove commented Mar 7, 2017

PHHargrove commented Mar 7, 2017

artpol84 commented Mar 7, 2017

PHHargrove commented Mar 7, 2017

PHHargrove commented Mar 7, 2017

PHHargrove commented Mar 7, 2017

rhc54 commented Mar 7, 2017

PHHargrove commented Mar 7, 2017

PHHargrove commented Mar 7, 2017

artpol84 commented Mar 7, 2017

PHHargrove commented Mar 7, 2017

artpol84 commented Mar 7, 2017

artpol84 commented Mar 8, 2017

hppritcha commented Mar 9, 2017

PHHargrove commented Mar 9, 2017

jsquyres commented Mar 9, 2017

rhc54 commented Mar 9, 2017

guziy commented Mar 3, 2020

NetBSD: Runtime error using Open MPI 2.1.0rc2 #3117

NetBSD: Runtime error using Open MPI 2.1.0rc2 #3117

Comments

hppritcha commented Mar 6, 2017

rhc54 commented Mar 7, 2017

artpol84 commented Mar 7, 2017

PHHargrove commented Mar 7, 2017

PHHargrove commented Mar 7, 2017

PHHargrove commented Mar 7, 2017

artpol84 commented Mar 7, 2017

PHHargrove commented Mar 7, 2017

PHHargrove commented Mar 7, 2017

PHHargrove commented Mar 7, 2017

rhc54 commented Mar 7, 2017

PHHargrove commented Mar 7, 2017

PHHargrove commented Mar 7, 2017

artpol84 commented Mar 7, 2017

PHHargrove commented Mar 7, 2017

artpol84 commented Mar 7, 2017

artpol84 commented Mar 8, 2017

hppritcha commented Mar 9, 2017

PHHargrove commented Mar 9, 2017

jsquyres commented Mar 9, 2017

rhc54 commented Mar 9, 2017

guziy commented Mar 3, 2020