Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

opal/datatype: fix handling of unused opt_desc #3439

Conversation

ggouaillardet
Copy link
Contributor

No description provided.

@ggouaillardet
Copy link
Contributor Author

@hppritcha this issue was reported on the ML so i put the "blocker" label. i have no objection you update the milestone if you want to release a new version asap

@ggouaillardet
Copy link
Contributor Author

@bosilca could you please review this (related to #3438)

here is a reproducer

#include <stdio.h>

#include <mpi.h>

int
main (int argc, char **argv)
{
  int rank;
  int bsize = 1024 * 1024;
  MPI_Datatype vec_type;
  MPI_Datatype sub_type;
  MPI_Datatype dup_type;

  MPI_Init(&argc, &argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  if (0 != rank) bsize *= 1024;


  MPI_Type_contiguous(bsize, MPI_BYTE, &vec_type);
  MPI_Type_commit(&vec_type);

  int sizes[1] = {4};
  int subsizes[1] = {4};
  int starts[1] = { 0 };

  MPI_Type_create_subarray(1, sizes, subsizes, starts, MPI_ORDER_C,
                           vec_type, &sub_type);
  MPI_Type_commit(&sub_type);

  MPI_Type_dup(sub_type, &dup_type);
  MPI_Type_commit(&dup_type);

  MPI_Type_free(&vec_type);
  MPI_Type_free(&sub_type);
  MPI_Type_free(&dup_type);

  MPI_Finalize();
}

currently, the program fails on non zero rank.
i initially wrote the first commit (which is correct imho), but after a bit of digging, i also wrote the second one (which is correct too imho).
the second commit could make the first one useless, or the first commit could be replaced by some assert() since we should not run into this case (e.g. unused 0 == opt_desc.used but non NULL != opt_desc.desc)

@bosilca
Copy link
Member

bosilca commented May 1, 2017

@ggouaillardet the original design always initialized the desc (it is basically built while we construct that datatype, and should always be used in a heterogenous case). As a result, the desc was also always supposed to be released.

The opt_desc should only be constructed if there is any benefit of building one. Moreover, it is set to NULL during the OBJ_NEW, so there should be no need to set it to NULL in other part of the code.

@ggouaillardet
Copy link
Contributor Author

@bosilca, so that means that the first commit is an overkill, right ?

opal_datatype_commit() invokes (unless a bozo case) opal_datatype_optimize_short(&pData->opt_desc) which first malloc() the (optimized) description and might set the used field to 0 later, so i guess the second commit (free and nulllify the optimized description) is correct, i will update the patch and add an assert() from now

Thanks George for the guidance

Refs open-mpi#3438

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
@ggouaillardet ggouaillardet force-pushed the topic/opal_datatype_opt_desc branch from 4358233 to 83706c0 Compare May 1, 2017 04:15
@bosilca
Copy link
Member

bosilca commented May 1, 2017

I think the problem is a little deeper than that. If the datatype is contiguous, then there is no point of having an optimized description, as the default one (the desc) should be good enough. Thus, we should have the opt_desc point back to the desc, instead to pointing to an allocated but empty description.

I'm trying to see where we miss the detection of the contiguous data to fix it there.

@bosilca
Copy link
Member

bosilca commented May 1, 2017

In addition to the previous comment, as your example is only using subarray all ranks should generate exactly the same datatype, thus they are all supposed to segfault.

@ggouaillardet
Copy link
Contributor Author

in my example, bsize is different on rank 0 (1024*1024) and other ranks (1024*1024*1024).
on non zero rank, there is an internal integer overflow that results in the optimized description not being used. (e.g. the optimized description would be 4G OPAL_UNIT8, but 4G does not fit an uint32_t)

@bosilca
Copy link
Member

bosilca commented May 1, 2017

I see now the problem. I missed the fact that the length was different on the other ranks.

@bosilca bosilca mentioned this pull request May 2, 2017
@bosilca bosilca closed this May 5, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants