Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue #6258 still occurring on 4.0.1 - When on OSX hang when message size larger than a few MB. #6568

Closed
steve-ord opened this issue Apr 4, 2019 · 16 comments

Comments

@steve-ord
Copy link

steve-ord commented Apr 4, 2019

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

4.0.1 for the release

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

I edited the brew formula to pull the 4.0.1 instead of 4.0.0 and build from source. The compiler flags are those as default for the brew formula for 4.0.0

Please describe the system on which you are running

  • Operating system/version:
  • Computer hardware:
  • Network type:

OSX Mojave 10.14.3

Details of the problem

This is a continuation of ticket #6258 which is now closed and was advised to open a new one I still see exactly the same issue. If I drop the size of the messages the application runs fine.

I replicated the last test and the magic number for failure is 19145881 bytes - the test code replicated from #6258 is:

#include <stdio.h>

#include "mpi.h"

static const MPI_Datatype Datatype = MPI_PACKED;
static const int Tag = 42;
static const int RecvProc = 0;
static const int SendProc = 1;

// 19145880 does not hang 19145881  does hang in the Wait()s
#define MessageSize (19145880+1)
static unsigned char data[MessageSize] = {0};

int main(int argc, char *argv[])
{
	MPI_Init(&argc, &argv);
	MPI_Comm comm = MPI_COMM_WORLD;

	int myID = 0;
	int nProcs = 1;
	MPI_Comm_size(comm, &nProcs);
	MPI_Comm_rank(comm, &myID);

	if (nProcs != 2)
	{
		if (myID == 0)
			printf("Must be run on 2 procs\n");
		MPI_Finalize();
		return -1;
	}

	int result = 0;
	if (myID == RecvProc)
	{MPI_Status probeStatus;
		result = MPI_Probe(SendProc, MPI_ANY_TAG, comm, &probeStatus);
		printf("[%i] MPI_Probe => %i\n", myID, result);
		int size = 0;
		result = MPI_Get_count(&probeStatus, Datatype, &size);
		printf("[%i] MPI_Get_count => %i, size = %i\n", myID, result, size);

		MPI_Request recvRequest;
		result = MPI_Irecv(data, size, Datatype, SendProc, Tag, comm, &recvRequest);
		printf("[%i] MPI_Irecv(size = %i) => %i\n", myID, size, result);
		MPI_Status recvStatus;
		result = MPI_Wait(&recvRequest, &recvStatus);
		printf("[%i] MPI_Wait => %i\n", myID, result);
	}
	else
	{ // myID == SendProc
		MPI_Request sendRequest;
		result = MPI_Isend(data, MessageSize, Datatype, RecvProc, Tag, comm, &sendRequest);
		printf("[%i] MPI_Isend(size = %i) => %i\n", myID, MessageSize, result);
		MPI_Status sendStatus;
		result = MPI_Wait(&sendRequest, &sendStatus);
		printf("[%i] MPI_Wait => %i\n", myID, result);
	}

	printf("[%i] Done\n", myID);
	MPI_Finalize();
	return 0;
}
@hppritcha
Copy link
Member

hppritcha commented Apr 10, 2019

#6550 went in to v4.0.x a couple of days ago sorry wrong issue - ignore.

@jsquyres
Copy link
Member

You should see the fix starting in the v4.0.x nightly snapshot starting as of last night (i.e., the April 10 or later snapshot on https://www.open-mpi.org/nightly/v4.0.x/).

@gpaulsen
Copy link
Member

@hppritcha misspoke about #6550 fixing this, This issue probably persists on v4.0.x branch, and may not be fixed in the nightly snapshot.

@hppritcha
Copy link
Member

@steve-ord could you test with the 4.0.x nightly tarball? https://www.open-mpi.org/nightly/v4.0.x/
? Your test case passes for me at 97aa434 on v4.0.x

@jsquyres
Copy link
Member

I confirm:

  • I see this problem happen on my MacOS 10.14.3 laptop with the 4.0.1 release
  • I also see this problem happen on my MacOS 10.14.3 laptop with the 4.0.x branch HEAD

Per openpmix/openpmix#1210, I can't test the Open MPI master at the moment. But I wanted to file that I confirm that I can reproduce the issue on both v4.0.1 and the HEAD of the v4.0.x branch.

@hppritcha
Copy link
Member

ufff... picked up an incorrect libmpi. I see the problem now with HEAD of v4.0.x.

@jsquyres
Copy link
Member

jsquyres commented May 2, 2019

@bosilca mentioned to me verbally the other day that he was able to replicate this issue.

@bosilca
Copy link
Member

bosilca commented May 2, 2019

The problem appears in mca_pml_ob1_recv_request_progress_rget due to the decision to divide the incoming request into smaller fragments (due to a small btl_get_limit). As an example, vader has a 32k limit for get, so any large message quickly translate into an explosion of fragments on the receiver side. Unfortunately, vader seems to be overwhelmed by such large number of fragments, and stop progressing (not sure if it is really a deadlock, but the fragment rate trickles down to nothing). So this issue is in fact 2 independent issues:

  • at the PML level we should limit the injection rate for fragments in the RGET protocol
  • in vader we should fix the deadlock when fragments are issued at a fast rate.

I can provide a fix for 1, but I am not familiar enough with vader to wander there. In any case, a fix at the PML level should be enough to address this issue, but I think we can replicate the vader deadlock with an intensive loop of slightly larger than eager isends (basically an injection rate test).

@leopoldcambier
Copy link

leopoldcambier commented May 3, 2019

Just to add a datapoint, since I believe I encountered this bug this week.

I use Homebrew's lastest version on OSX 10.14.4

mpirun --version
mpirun (Open MPI) 4.0.1

I use this elementary code (that I hope is correct...)

#include <vector>
#include <stdio.h>
#include <mpi.h>

using namespace std;

int main(int argc, char** argv) {

    assert(argc == 3);
    int size = atoi(argv[1]);
    int repeat = atoi(argv[2]);
    printf("size = %d, repeat = %d\n", size, repeat);

    int rank;
    int err = MPI_Init(NULL, NULL);
    assert(err == MPI_SUCCESS);    
    err = MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    assert(err == MPI_SUCCESS);
    vector<MPI_Request> rqsts(repeat);
    vector<bool> rqst_OK(repeat, false);
    vector<int> buffer(size);
    int nrqsts_OK = 0;
    int other = (rank + 1)%2;
    if(rank == 0) {
        printf("[%d] Started!\n", rank);
        for(int i = 0; i < repeat; i++) {
            err = MPI_Isend(buffer.data(), size, MPI_INT, other, 0, MPI_COMM_WORLD, &rqsts[i]);
            assert(err == MPI_SUCCESS);
        }
    } else {
        printf("[%d] Started!\n", rank);
        for(int i = 0; i < repeat; i++) {
            err = MPI_Irecv(buffer.data(), size, MPI_INT, other, 0, MPI_COMM_WORLD, &rqsts[i]);
            assert(err == MPI_SUCCESS);
        }
    }
    int flag;
    while(nrqsts_OK != repeat) {
        printf("[%d] nrqsts_OK %d\n", rank, nrqsts_OK);
        for(int i = 0; i < repeat; i++) {
            if(!rqst_OK[i]) {
                err = MPI_Test(&rqsts[i], &flag, MPI_STATUS_IGNORE);
                assert(err == MPI_SUCCESS);
                if(flag) {
                    rqst_OK[i] = true;
                    nrqsts_OK++;
                }
            }
        }
    }
    printf("[%d] OK!\n", rank);
    MPI_Finalize();
}

Basically rank 0 Isends repeat messages of size ints that are Ireceived by rank 1, and then both MPI_Test all the requests one by one.

This hangs
mpirun -n 2 ./mpi_bug 4790000 1

This works
mpirun -n 2 ./mpi_bug 4780000 1

But those also hangs for instance
mpirun -n 2 ./mpi_bug 50000 100
mpirun -n 2 ./mpi_bug 50000 10000
The last 5 messages never gets processed (MPI_Test never succeeds).

Doesn't seem like it ever shows up for size 5000 or less.

@jsquyres
Copy link
Member

We need to audit and see if this issue is on v2.1.x, v3.0.x, and v3.1.x.

@gpaulsen
Copy link
Member

gpaulsen commented Jun 7, 2019

We don't know when this might get fixed. Removing Milestone 4.0.2

@gpaulsen gpaulsen removed this from the v4.0.2 milestone Jun 7, 2019
@jsquyres
Copy link
Member

We discussed this again on 25 June 2019 webex: there is still an outstanding OB1 PUT pipelining protocol issue, not addressed by:

@hjelmn
Copy link
Member

hjelmn commented Sep 5, 2019

Feel free to tag me on issues like these. This is a bug and didn't come up during testing. Will take a look and see if something can be done to fix the issue in btl/vader.

@hjelmn
Copy link
Member

hjelmn commented Sep 6, 2019

This is why you need tag me on these bugs.. Fix is being tested now and PR incoming after the testing is complete.

I had no idea there was a problem until @jsquyres pinged me about it yesterday.

@hjelmn
Copy link
Member

hjelmn commented Sep 6, 2019

hjelmn-macbookpro:build hjelmn$ mpirun -n 2 ./6568
[0] MPI_Probe => 0
[0] MPI_Get_count => 0, size = 19145881
[1] MPI_Isend(size = 19145881) => 0
[0] MPI_Irecv(size = 19145881) => 0
[0] MPI_Wait => 0
[1] MPI_Wait => 0
[1] Done
[0] Done

@hjelmn hjelmn self-assigned this Sep 6, 2019
hjelmn added a commit to hjelmn/ompi that referenced this issue Sep 6, 2019
This commit changes how the single-copy emulation in the vader btl
operates. Before this change the BTL set its put and get limits
based on the max send size. After this change the limits are unset
and the put or get operation is fragmented internally.

References open-mpi#6568

Signed-off-by: Nathan Hjelm <hjelmn@google.com>
hppritcha pushed a commit to hppritcha/ompi that referenced this issue Sep 18, 2019
This commit changes how the single-copy emulation in the vader btl
operates. Before this change the BTL set its put and get limits
based on the max send size. After this change the limits are unset
and the put or get operation is fragmented internally.

References open-mpi#6568

Signed-off-by: Nathan Hjelm <hjelmn@google.com>
(cherry picked from commit ae91b11)
bosilca pushed a commit to bosilca/ompi that referenced this issue Dec 27, 2019
This commit changes how the single-copy emulation in the vader btl
operates. Before this change the BTL set its put and get limits
based on the max send size. After this change the limits are unset
and the put or get operation is fragmented internally.

References open-mpi#6568

Signed-off-by: Nathan Hjelm <hjelmn@google.com>
mgduda added a commit to mgduda/SMIOL that referenced this issue Mar 19, 2020
The OpenMPI library has a bug on macOS in which large messages
cause the code to hang; see, e.g.,
open-mpi/ompi#6568

This commit reduces the size of some arrays used in unit tests
for SMIOL_create_decomp to work around this issue.
mgduda added a commit to mgduda/SMIOL that referenced this issue Mar 20, 2020
The OpenMPI library has a bug on macOS in which large messages
cause the code to hang; see, e.g.,
open-mpi/ompi#6568

This commit reduces the size of some arrays used in unit tests
for SMIOL_create_decomp to work around this issue.
mgduda added a commit to mgduda/SMIOL that referenced this issue Mar 21, 2020
The OpenMPI library has a bug on macOS in which large messages
cause the code to hang; see, e.g.,
open-mpi/ompi#6568

This commit reduces the size of some arrays used in unit tests
for SMIOL_create_decomp to work around this issue.
mgduda added a commit to mgduda/SMIOL that referenced this issue Mar 23, 2020
The OpenMPI library has a bug on macOS in which large messages
cause the code to hang; see, e.g.,
open-mpi/ompi#6568

This commit reduces the size of some arrays used in unit tests
for SMIOL_create_decomp to work around this issue.
mgduda added a commit to mgduda/SMIOL that referenced this issue Mar 23, 2020
The OpenMPI library has a bug on macOS in which large messages
cause the code to hang; see, e.g.,
open-mpi/ompi#6568

This commit reduces the size of some arrays used in unit tests
for SMIOL_create_decomp to work around this issue.
mgduda added a commit to mgduda/SMIOL that referenced this issue Mar 24, 2020
The OpenMPI library has a bug on macOS in which large messages
cause the code to hang; see, e.g.,
open-mpi/ompi#6568

This commit reduces the size of some arrays used in unit tests
for SMIOL_create_decomp to work around this issue.
mgduda added a commit to mgduda/SMIOL that referenced this issue Mar 25, 2020
The OpenMPI library has a bug on macOS in which large messages
cause the code to hang; see, e.g.,
open-mpi/ompi#6568

This commit reduces the size of some arrays used in unit tests
for SMIOL_create_decomp to work around this issue.
mgduda added a commit to mgduda/SMIOL that referenced this issue Mar 27, 2020
The OpenMPI library has a bug on macOS in which large messages
cause the code to hang; see, e.g.,
open-mpi/ompi#6568

This commit reduces the size of some arrays used in unit tests
for SMIOL_create_decomp to work around this issue.
mgduda added a commit to mgduda/SMIOL that referenced this issue Mar 27, 2020
The OpenMPI library has a bug on macOS in which large messages
cause the code to hang; see, e.g.,
open-mpi/ompi#6568

This commit reduces the size of some arrays used in unit tests
for SMIOL_create_decomp to work around this issue.
@gpaulsen
Copy link
Member

@steve-ord, We think this has been fixed back in Sept of 2019. Closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants