Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Neighborhood Collectives #2324

Closed
dalcinl opened this issue Oct 31, 2016 · 11 comments
Closed

Neighborhood Collectives #2324

dalcinl opened this issue Oct 31, 2016 · 11 comments
Labels
Milestone

Comments

@dalcinl
Copy link
Contributor

dalcinl commented Oct 31, 2016

@jsquyres I've found serious issues in the implementation of neighborhood collectives. These issues showed up after a memory allocation optimization I've implemented in mpi4py in commit c2eaa292, which failed badly with Open MPI 2.0.1 in this Bitbucket Pipelines build.

After checking the source code in ompi/v2.x, I found a major issue for the case of intracommunicators with a topology. The code simply do not check the topology to determine the number or sources and destinations, and use ompi_comm_size() instead. This leads to wrong checks and reads after end of arrays.

In mpi4py, I'm using the following code to determine the number of sources and destinations. This piece of code only handles intracommunicators, but could be extended to intercommunicators.

BTW, the wording of the MPI standard seems to require a topology for neighborhood collectives, although supporting intracomms and intercomms seems possible and the result should be equivalent to regular allgather and alltoall. Or maybe the right thing to do is to just error in case of intercomms or intracomms with no topology.

Am I missing something?

PS: MPICH seems to implement neighbor collectives just for Cartesian/graph topologies, the helper routine MPIR_Topo_canon_nhb_count is quite similar to what I'm using in mpi4py (though the MPICH one fails in case of no topology). @roblatham00 Any comments to you competitors :-) about this?

$[dalcinl@localhost mpich.git]$ git grep MPIR_Topo_canon_nhb_count
src/include/mpir_topo.h:int MPIR_Topo_canon_nhb_count(MPIR_Comm *comm_ptr, int *indegree, int *outdegree, int *weighted);
src/mpi/topo/inhb_allgather.c:    mpi_errno = MPIR_Topo_canon_nhb_count(comm_ptr, &indegree, &outdegree, &weighted);
src/mpi/topo/inhb_allgatherv.c:    mpi_errno = MPIR_Topo_canon_nhb_count(comm_ptr, &indegree, &outdegree, &weighted);
src/mpi/topo/inhb_alltoall.c:    mpi_errno = MPIR_Topo_canon_nhb_count(comm_ptr, &indegree, &outdegree, &weighted);
src/mpi/topo/inhb_alltoallv.c:    mpi_errno = MPIR_Topo_canon_nhb_count(comm_ptr, &indegree, &outdegree, &weighted);
src/mpi/topo/inhb_alltoallw.c:    mpi_errno = MPIR_Topo_canon_nhb_count(comm_ptr, &indegree, &outdegree, &weighted);
src/mpi/topo/topoutil.c:#define FUNCNAME MPIR_Topo_canon_nhb_count
src/mpi/topo/topoutil.c:int MPIR_Topo_canon_nhb_count(MPIR_Comm *comm_ptr, int *indegree, int *outdegree, int *weighted)
@dalcinl dalcinl changed the title Neighborhood Collective totally broken Neighborhood Collectives Oct 31, 2016
@dalcinl
Copy link
Contributor Author

dalcinl commented Oct 31, 2016

Oh, sorry. The problem is just in MPI_Neighbor_allgatherv() and MPI_Ineighbor_allgatherv(). You guys should implement it using ompi_comm_neighbors_count(), like you do in MPI_Neighbor_alltoallv().

@ggouaillardet
Copy link
Contributor

@dalcinl can you please point to the test case ?
/* a C version is ideal, but i guess i will be able to figure it out with a python version */

i briefly check the source code, and https://github.com/open-mpi/ompi/blob/v2.x/ompi/mca/coll/libnbc/nbc_ineighbor_allgatherv.c nor https://github.com/open-mpi/ompi/blob/v2.x/ompi/mca/coll/basic/coll_basic_neighbor_alltoallv.c use ompi_comm_size()
do you have some other piece(s) of code in mind ?

i do not think Neighbor collective work on intercomm.
in Open MPI, MPI_Cart_create MPI_Graph_create MPI_Dist_graph_create nor MPI_Dist_graph_create_adjacent can accept an inter communicator, so i do not think you can
create an intercomm with a topology.

@hjelmn
Copy link
Member

hjelmn commented Nov 1, 2016

Are are attempting a neighborhood collective on a communicator without a topology?

@ggouaillardet
Copy link
Contributor

@hjelmn this erroneous case is tested when checking params

if (! OMPI_COMM_IS_TOPO(comm)) {
OMPI_ERRHANDLER_INVOKE(MPI_COMM_WORLD, MPI_ERR_TOPOLOGY, FUNC_NAME);

@hjelmn
Copy link
Member

hjelmn commented Nov 1, 2016

@ggouaillardet That is only enabled if the mpi parameter check is turned on. That is not on by default.

@hjelmn hjelmn closed this as completed Nov 1, 2016
@hjelmn hjelmn reopened this Nov 1, 2016
@hjelmn
Copy link
Member

hjelmn commented Nov 1, 2016

Hate that close and comment button.

@jsquyres
Copy link
Member

jsquyres commented Nov 1, 2016

@hjelmn I don't think you're right -- I believe the MPI param check is set to "runtime" by default, and is enabled at runtime.

@dalcinl
Copy link
Contributor Author

dalcinl commented Nov 1, 2016

A reproducer is a little hard to write. Moreover, I'm still not sure why my Python tests are failing, maybe there are additional issues. Graph topologies seem to work, but I get errors with Cartesian topologies. I think there is a bug in Open MPI in the case of Cartesian topologies, but I have to investigate further and open a new issue.

Meanwhile, please see the following, you still have to fix neighbor allgatherv for reads past the end of the recvcount array.

See the C reproducer at the end. The code is a bit contrived, but it is attempting to send and receive zero length messages. I get the following running under valgrind control. Indeed, valgrind complains in the loop checking for the recvcounts array.

[dalcinl@kw14821 openmpi]$ mpicc ngh_allgatherv.c 
[dalcinl@kw14821 openmpi]$ mpiexec -n 2 valgrind -q ./a.out 
==17973== Invalid read of size 4
==17973==    at 0x4EC9592: PMPI_Neighbor_allgatherv (pneighbor_allgatherv.c:125)
==17973==    by 0x400B74: main (in /home/dalcinl/Devel/BUGS-MPI/openmpi/a.out)
==17973==  Address 0x131ad104 is 0 bytes after a block of size 4 alloc'd
==17973==    at 0x4C2BBAD: malloc (vg_replace_malloc.c:299)
==17973==    by 0x400B1B: main (in /home/dalcinl/Devel/BUGS-MPI/openmpi/a.out)
==17973== 
==17974== Invalid read of size 4
==17974==    at 0x4EC9592: PMPI_Neighbor_allgatherv (pneighbor_allgatherv.c:125)
==17974==    by 0x400B74: main (in /home/dalcinl/Devel/BUGS-MPI/openmpi/a.out)
==17974==  Address 0x6a6e7a4 is 0 bytes after a block of size 4 alloc'd
==17974==    at 0x4C2BBAD: malloc (vg_replace_malloc.c:299)
==17974==    by 0x400B1B: main (in /home/dalcinl/Devel/BUGS-MPI/openmpi/a.out)
==17974== 
#include <mpi.h>
#include <stdlib.h>
#include <assert.h>

int main(int argc, char *argv[])
{
  MPI_Init(&argc, &argv);

  int size,rank;
  MPI_Comm_size(MPI_COMM_WORLD, &size);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);

  int left  = rank!=0      ? rank-1 : size-1;
  int right = rank!=size-1 ? rank+1 : 0;

  MPI_Comm comm;
  int indegree  = 1, sources[]      = { left  };
  int outdegree = 1, destinations[] = { right };
  MPI_Dist_graph_create_adjacent(MPI_COMM_WORLD,
                                 indegree,  sources,      MPI_UNWEIGHTED,
                                 outdegree, destinations, MPI_UNWEIGHTED,
                                 MPI_INFO_NULL, 0, &comm); 

  int sendbuf[2] = { rank, -1 };
  int sendcount  = 0;
  int recvbuf[2] = { -1, -1};
  int *recvcounts = malloc(1*sizeof(int)); recvcounts[0] = 0;
  int *recvdispls = malloc(1*sizeof(int)); recvdispls[0] = 0;
  MPI_Neighbor_allgatherv(sendbuf, sendcount,              MPI_INT,
                          recvbuf, recvcounts, recvdispls, MPI_INT,
                          comm);
  assert(recvbuf[0] == -1);
  assert(recvbuf[1] == -1);

  MPI_Comm_free(&comm);
  MPI_Finalize();
  return 0;
}

@hjelmn
Copy link
Member

hjelmn commented Nov 1, 2016

Ah, ok. This is a bug up in the binding itself. Looks like a C&P error.

@hjelmn
Copy link
Member

hjelmn commented Nov 1, 2016

Shows how well-used these functions are :D.

hjelmn added a commit to hjelmn/ompi that referenced this issue Nov 1, 2016
This commit adds a helper function to get the inbound and outbound
neighbor count and updates the neighbor_allgatherv bindings to use the
correct count when checking the input parameters.

Fixes open-mpi#2324

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
@hppritcha hppritcha added the bug label Nov 12, 2016
@hppritcha hppritcha added this to the v2.0.2 milestone Nov 12, 2016
dalcinl added a commit to mpi4py/mpi4py that referenced this issue Nov 14, 2016
@jsquyres jsquyres modified the milestones: v2.0.3, v2.0.2 Dec 13, 2016
hjelmn added a commit to hjelmn/ompi that referenced this issue May 26, 2017
This commit adds a helper function to get the inbound and outbound
neighbor count and updates the neighbor_allgatherv bindings to use the
correct count when checking the input parameters.

Fixes open-mpi#2324

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
@hppritcha hppritcha modified the milestones: v2.0.3, v2.0.4 Jun 1, 2017
@hppritcha
Copy link
Member

hppritcha commented Jul 3, 2017

@hjelmn could you open a PR based off of #2334 on 2.x, 2.0.x, and 3.0 to fix this issue? Or if you've already done so, please close this issue.

hjelmn added a commit to hjelmn/ompi that referenced this issue Jul 12, 2017
This commit adds a helper function to get the inbound and outbound
neighbor count and updates the neighbor_allgatherv bindings to use the
correct count when checking the input parameters.

Fixes open-mpi#2324

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
hjelmn added a commit that referenced this issue Jul 12, 2017
This commit adds a helper function to get the inbound and outbound
neighbor count and updates the neighbor_allgatherv bindings to use the
correct count when checking the input parameters.

Fixes #2324

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
hjelmn added a commit to hjelmn/ompi that referenced this issue Jul 12, 2017
This commit adds a helper function to get the inbound and outbound
neighbor count and updates the neighbor_allgatherv bindings to use the
correct count when checking the input parameters.

Fixes open-mpi#2324

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit 3c0e94a)
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
hjelmn added a commit to hjelmn/ompi that referenced this issue Jul 12, 2017
This commit adds a helper function to get the inbound and outbound
neighbor count and updates the neighbor_allgatherv bindings to use the
correct count when checking the input parameters.

Fixes open-mpi#2324

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit 3c0e94a)
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
hjelmn added a commit to hjelmn/ompi that referenced this issue Jul 12, 2017
This commit adds a helper function to get the inbound and outbound
neighbor count and updates the neighbor_allgatherv bindings to use the
correct count when checking the input parameters.

Fixes open-mpi#2324

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit 3c0e94a)
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
hjelmn added a commit to hjelmn/ompi that referenced this issue Jul 12, 2017
This commit adds a helper function to get the inbound and outbound
neighbor count and updates the neighbor_allgatherv bindings to use the
correct count when checking the input parameters.

Fixes open-mpi#2324

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit 3c0e94a)
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
hjelmn added a commit to hjelmn/ompi that referenced this issue Jul 12, 2017
This commit adds a helper function to get the inbound and outbound
neighbor count and updates the neighbor_allgatherv bindings to use the
correct count when checking the input parameters.

Fixes open-mpi#2324

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit 3c0e94a)
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
hjelmn added a commit to hjelmn/ompi that referenced this issue Jul 18, 2017
This commit adds a helper function to get the inbound and outbound
neighbor count and updates the neighbor_allgatherv bindings to use the
correct count when checking the input parameters.

Fixes open-mpi#2324

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit 3c0e94a)
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants