Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NetBSD: Runtime error using Open MPI 2.1.0rc2 #3117

Closed
hppritcha opened this issue Mar 6, 2017 · 21 comments
Closed

NetBSD: Runtime error using Open MPI 2.1.0rc2 #3117

hppritcha opened this issue Mar 6, 2017 · 21 comments
Assignees
Labels
Milestone

Comments

@hppritcha
Copy link
Member

@PHHargrove reports hitting a runtime error (using ring_c).c) on NetBSD when using the Open MPI 2.1.0rc2 tarball. Error is in PMIX.

See https://www.mail-archive.com/devel@lists.open-mpi.org//msg19987.html

@hppritcha hppritcha added the bug label Mar 6, 2017
@hppritcha hppritcha added this to the v2.1.0 milestone Mar 6, 2017
@rhc54
Copy link
Contributor

rhc54 commented Mar 7, 2017

Here is the error report:

2.1.0rc2 tarball on NetBSD7/amd64.
Configured with only --prefix=... and --disable-mpi-fortran

To get past the lack of a struct timeval definition required a small source change in a previous email.
Once past that, I can build Open MPI and compile the examples.
However, I cannot run them.

Output below.

-Paul

$ mpirun -mca btl sm,self -np 2 examples/ring_c
[netbsd-amd64.kvm:20873] PMIX ERROR: ERROR in file /home/phargrov/OMPI/openmpi-2.1.0rc2-netbsd7-amd64/openmpi-2.1.0rc2/opal/mca/pmix/pmix112/pmix/src/dstore/pmix_esh.c at line 1651
[netbsd-amd64.kvm:20873] PMIX ERROR: OUT-OF-RESOURCE in file /home/phargrov/OMPI/openmpi-2.1.0rc2-netbsd7-amd64/openmpi-2.1.0rc2/opal/mca/pmix/pmix112/pmix/src/dstore/pmix_esh.c at line 820
[netbsd-amd64.kvm:20873] PMIX ERROR: ERROR in file /home/phargrov/OMPI/openmpi-2.1.0rc2-netbsd7-amd64/openmpi-2.1.0rc2/opal/mca/pmix/pmix112/pmix/src/dstore/pmix_esh.c at line 1468
[netbsd-amd64.kvm:20873] PMIX ERROR: ERROR in file /home/phargrov/OMPI/openmpi-2.1.0rc2-netbsd7-amd64/openmpi-2.1.0rc2/opal/mca/pmix/pmix112/pmix/src/server/pmix_server.c at line 592
[netbsd-amd64.kvm:16632] PMIX ERROR: ERROR in file /home/phargrov/OMPI/openmpi-2.1.0rc2-netbsd7-amd64/openmpi-2.1.0rc2/opal/mca/pmix/pmix112/pmix/src/dstore/pmix_esh.c at line 349
[netbsd-amd64.kvm:16632] PMIX ERROR: ERROR in file /home/phargrov/OMPI/openmpi-2.1.0rc2-netbsd7-amd64/openmpi-2.1.0rc2/opal/mca/pmix/pmix112/pmix/src/dstore/pmix_esh.c at line 839
[netbsd-amd64.kvm:16632] PMIX ERROR: ERROR in file /home/phargrov/OMPI/openmpi-2.1.0rc2-netbsd7-amd64/openmpi-2.1.0rc2/opal/mca/pmix/pmix112/pmix/src/dstore/pmix_esh.c at line 1021
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_rte_init failed
  --> Returned "(null)" (-43) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
[netbsd-amd64.kvm:19230] PMIX ERROR: ERROR in file /home/phargrov/OMPI/openmpi-2.1.0rc2-netbsd7-amd64/openmpi-2.1.0rc2/opal/mca/pmix/pmix112/pmix/src/dstore/pmix_esh.c at line 349
[netbsd-amd64.kvm:19230] PMIX ERROR: ERROR in file /home/phargrov/OMPI/openmpi-2.1.0rc2-netbsd7-amd64/openmpi-2.1.0rc2/opal/mca/pmix/pmix112/pmix/src/dstore/pmix_esh.c at line 839
[netbsd-amd64.kvm:19230] PMIX ERROR: ERROR in file /home/phargrov/OMPI/openmpi-2.1.0rc2-netbsd7-amd64/openmpi-2.1.0rc2/opal/mca/pmix/pmix112/pmix/src/dstore/pmix_esh.c at line 1021
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_rte_init failed
  --> Returned "(null)" (-43) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[netbsd-amd64.kvm:19230] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
[netbsd-amd64.kvm:16632] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[31721,1],0]
  Exit code:    1
--------------------------------------------------------------------------

@artpol84 Can you please take a look?

@artpol84
Copy link
Contributor

artpol84 commented Mar 7, 2017

The only place that can cause this error is posix_fallocate.
https://github.com/open-mpi/ompi/blob/v2.x/opal/mca/pmix/pmix112/pmix/src/sm/pmix_mmap.c#L69

@PHHargrove can you check /tmp directory to have a leftover?
is there any restrictions on that dir? Like it is a ramdisk with some fixed memory size?

@PHHargrove
Copy link
Member

Thanks, Artem.
Indeed /tmp is a ram disk that is too small.
Setting TMPDIR resolved the issue.

I would suggest that in the future a better error message would be helpful, but is not critical.

@PHHargrove
Copy link
Member

Setting TMPDIR resolved the issue.

I spoke too soon.
Setting TMPDIR=$HOME (on a filesystem with 24GB of free space) did not help.

@PHHargrove
Copy link
Member

I see a part of the problem: posix_fallocate does not set errno on failure.

Here is a relevant portion of the posix_fallocate manpage on NetBSD:

RETURN VALUES
     If successful, the posix_fallocate() function will return zero.
     Otherwise an error number will be returned, without setting errno.

And Linux says much the same:

RETURN VALUE
       posix_fallocate()  returns  zero on success, or an error number on failure.  Note that errno
       is not set.

@artpol84
Copy link
Contributor

artpol84 commented Mar 7, 2017

But still it fails for some reason.
Can you try to output what posix_fallocate is returned?

@PHHargrove
Copy link
Member

But still it fails for some reason.
Can you try to output what posix_fallocate is returned?

I am working on it, but I am traveling this week (sitting in the middle of a large meeting).

@PHHargrove
Copy link
Member

But still it fails for some reason.
Can you try to output what posix_fallocate is returned?

I am working on it, but I am traveling this week (sitting in the middle of a large meeting).

The return value is EOPNOTSUPP, which POSIX documents as "Operation not supported on socket". This makes no sense to me.
It is possible that ENOTSUP (Operation not supported) was intended.

Perhaps the configure test for HAVE_POSIX_FALLOCATE should to verify that the call actually succeeds.
Alternatively (likely simpler) the PMIX code could ignore ENOTSUP and EOPNOTSUPP return codes (not errno). This would result in behavior indistinguishable to the case when HAVE_POSIX_FALLOCATE is not defined.

@PHHargrove
Copy link
Member

To completely discharge my role as the "portability police", I want to note when HAVE_POSIX_FALLOCATE is not defined the code is using ftruncate() instead.
However, this is a non-portable use of that function.
Here is what the NetBSD manpage says on the subject:

STANDARDS
     Use of truncate() to extend a file is an IEEE Std 1003.1-2004
     (``POSIX.1'') extension, and is thus not portable.  Files can be extended
     in a portable way seeking (using lseek(2)) to the required size and
     writing a single character with write(2).

The Mac OS X, FreeBSD and OpenBSD manpages say something similar in their "BUGS" section:

Use of truncate() to extend a file is not portable.

The Linux manpage provides the most useful information on the subject:

       [...] the
       POSIX standard allows two behaviors for ftruncate() when  length  exceeds  the  file  length
       (note  that  truncate() is not specified at all in such an environment): either returning an
       error, or extending the file.

So, note that "returning an error" is a POSX-compliant behavior when ftruncate() is used to extend a file.

However, in reality this is likely not an issue for 2 reasons.

  1. I have yet to see any system on which ftruncate() failed to extend except for FAT/VFAT filesystems on Linux (which is not a suitable $TMPDIR anyway).
  2. The current POSIX.1 (IEEE Std 1003.1-2008, 2016 Edition) no longer lists extension as optional If the file previously was smaller than this size, ftruncate() shall increase the size of the file.

@rhc54
Copy link
Contributor

rhc54 commented Mar 7, 2017

So it sounds like we need to do one or two things, and perhaps a third:

  1. update the check for HAVE_POSIX_FALLOCATE to test functionality as well as presence
  2. check the return code instead of errno when calling it - we should do this regardless, though it may also serve as the only required action
  3. perhaps look at using lseek to extend the file instead of ftruncate, though as you say, ftruncate extending the file may no longer be optional anyway. I'd put this at low priority

@PHHargrove
Copy link
Member

@rhc54 I agree on all three points above, including placing low priority on allowing for failure of ftruncate().

I agree that with the proper checks item 2 alone should be sufficient. There is a 50%+ chance that a tested-and-signed-off patch will appear here in the next two hour that implements item 2.

@PHHargrove
Copy link
Member

Proposed patch which has been tested (via ring_c) on the NetBSD/amd64 system where the problem was first observed.

@artpol84
Copy link
Contributor

artpol84 commented Mar 7, 2017

thanks Paul, can you create a PR against PMIx master or you want me to do that?

@PHHargrove
Copy link
Member

@artpol84 I would really appreciate it if you could handle the PR.
I am at (or over?) the limit of what I can do while still keeping up with the meeting I am sitting in.

@artpol84
Copy link
Contributor

artpol84 commented Mar 7, 2017

Thanks, Paul. We will take care.

@artpol84
Copy link
Contributor

artpol84 commented Mar 8, 2017

@karasevb please create PR

@hppritcha
Copy link
Member Author

@jsquyres I vote for moving this to 2.1.1

@PHHargrove
Copy link
Member

I only test on the effect system(s) - I don't use them.
So waiting for 2.1.1 is not a problem for me.

@jsquyres jsquyres modified the milestones: v2.1.1, v2.1.0 Mar 9, 2017
@jsquyres
Copy link
Member

jsquyres commented Mar 9, 2017

@hppritcha I just moved the milestone, but then I looked closer / remembered what this one was -- wasn't it fixed in #3130 ?

@rhc54
Copy link
Contributor

rhc54 commented Mar 9, 2017

yes, it was, so closing this one

@rhc54 rhc54 closed this as completed Mar 9, 2017
@guziy
Copy link

guziy commented Mar 3, 2020

Thanks @artpol84 this saved me a lot of time and frustration. Redefined TMPDIR and it works fine.

Cheers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants