Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SEACAS: crashes on KNL with OpenMP enabled during compilation. #125

Closed
crtrott opened this issue Feb 2, 2016 · 28 comments
Closed

SEACAS: crashes on KNL with OpenMP enabled during compilation. #125

crtrott opened this issue Feb 2, 2016 · 28 comments
Labels
pkg: seacas stage: in progress Work on the issue has started

Comments

@crtrott
Copy link
Member

crtrott commented Feb 2, 2016

This is the start for tracking an issue on KNL where the mesh load in Nalu crashes when seacas got compiled with OpenMP enabled (even though it seems Seacas doesn't actually use OpenMP). I tracked it down to libIoss.a. If I just link Nalu against the serial version of that library (while using the OpenMP ones for everything else) it works.

@nmhamster
Copy link
Contributor

There is a NALU bug filed with SIERRA on this from two weeks ago as well.

@gsjaardema
Copy link
Contributor

Is that bug: https://prod.sandia.gov/sierra-trac/ticket/14339 (Duplicate node global id in NALU/KNL)? I had not been looking into that any more since I thought it was related to the general build/configuration issues related to netcdf, hdf5, and pnetcdf.

Is that bug still relevant even after the TPL configuration issues were resolved?
@trilinos/seacas

@nmhamster
Copy link
Contributor

Yes. We are working through some more on this but I think it may apply broadly to threaded code with Intel compiler (may not be just KNL).

@crtrott
Copy link
Member Author

crtrott commented Feb 2, 2016

I am going to test with Intel 15 on my box with OpenMP on. That said can we get the same Compiler/TPL configs on Shepard to test there on Haswell?

@nmhamster
Copy link
Contributor

You have most of the TPLs on Shepard as needed.

@gsjaardema
Copy link
Contributor

I've been able to replicate on my blade. Will debug and see what I can find.

@crtrott
Copy link
Member Author

crtrott commented Feb 3, 2016

Great. Hopefully we (I mean you ;-) ) can track that down.

@gsjaardema
Copy link
Contributor

Have you had any issues with std::sort ? I have this code:

  ReverseMapContainer new_ids(num_to_get); // vector of pairs
  for (int64_t i=0; i < num_to_get; i++) {
    int64_t local_id = offset + i + 1;
    new_ids[i] = std::make_pair(map[local_id], local_id);
    }
  std::sort(new_ids.begin(), new_ids.end(), IdPairCompare());
 verify_no_duplicate_ids(new_ids, processor, entityType);

And it is failing -- finding a duplicate local_id. I replaced std::sort with my own qsort and the tests are now passing... This is with intel-16.0, openmpi-1.8.8 and passing the "-fopenmp" option.

@gsjaardema
Copy link
Contributor

This is definitely looking like a std::sort issue. Was still getting some failing tests (but different error messages). Replaced more std::sort with my qsort and they are now passing...

@gsjaardema
Copy link
Contributor

I changed two files packages/seacas/libraries/ioss/src/Ioss_Map.C and packages/seacas/libraries/ioss/src/exo_par/Iopx_DecompositionData.C. The modified versions are in the fix_sort.zip file it you want to see if this fixes the issues for you also.

fix-sort.zip

@nmhamster
Copy link
Contributor

Greg, thanks for the fix. I tried merging this into my Trilinos checkout, I'm getting errors. Haven't spent too long drilling into this yet but wanted to check do I need anything else?

/home/sdhammo/git/trilinos-github-repo/packages/seacas/libraries/ioss/src/exo_par/Iopx_DecompositionData.C(1809): error: namespace "Ioss" has no member "mpi_type"
MPI_Alltoall(TOPTR(recv_count), 1, Ioss::mpi_type((INT)0),
^
detected during:
instantiation of "void Iopx::DecompositionData::decompose_model(int) [with INT=int]" at line 390
instantiation of "Iopx::DecompositionData::DecompositionData(const Ioss::PropertyManager &, MPI_Comm) [with INT=int]" at line 381

/home/sdhammo/git/trilinos-github-repo/packages/seacas/libraries/ioss/src/exo_par/Iopx_DecompositionData.C(1810): error: namespace "Ioss" has no member "mpi_type"
TOPTR(send_count), 1, Ioss::mpi_type((INT)0), comm_);
^
detected during:
instantiation of "void Iopx::DecompositionData::decompose_model(int) [with INT=int]" at line 390
instantiation of "Iopx::DecompositionData::DecompositionData(const Ioss::PropertyManager &, MPI_Comm) [with INT=int]" at line 381

/home/sdhammo/git/trilinos-github-repo/packages/seacas/libraries/ioss/src/exo_par/Iopx_DecompositionData.C(1845): error: namespace "Ioss" has no member "MY_Alltoallv"
Ioss::MY_Alltoallv(node_comm_recv, recv_count, recv_disp,
^
detected during:
instantiation of "void Iopx::DecompositionData::decompose_model(int) [with INT=int]" at line 390
instantiation of "Iopx::DecompositionData::DecompositionData(const Ioss::PropertyManager &, MPI_Comm) [with INT=int]" at line 381

/home/sdhammo/git/trilinos-github-repo/packages/seacas/libraries/ioss/src/exo_par/Iopx_DecompositionData.C(1892): error: namespace "Ioss" has no member "MY_Alltoallv"
Ioss::MY_Alltoallv(coord_send, send_count, send_disp,
^
detected during:
instantiation of "void Iopx::DecompositionData::decompose_model(int) [with INT=int]" at line 390
instantiation of "Iopx::DecompositionData::DecompositionData(const Ioss::PropertyManager &, MPI_Comm) [with INT=int]" at line 381

/home/sdhammo/git/trilinos-github-repo/packages/seacas/libraries/ioss/src/exo_par/Iopx_DecompositionData.C(995): error: namespace "Ioss" has no member "mpi_type"
MPI_Alltoall(TOPTR(export_conn_size), 1, Ioss::mpi_type((INT)0),
^
detected during:
instantiation of "void Iopx::DecompositionData::decompose_model(int) [with INT=int]" at line 390
instantiation of "Iopx::DecompositionData::DecompositionData(const Ioss::PropertyManager &, MPI_Comm) [with INT=int]" at line 381

/home/sdhammo/git/trilinos-github-repo/packages/seacas/libraries/ioss/src/exo_par/Iopx_DecompositionData.C(996): error: namespace "Ioss" has no member "mpi_type"
TOPTR(import_conn_size), 1, Ioss::mpi_type((INT)0), comm_);
^
detected during:
instantiation of "void Iopx::DecompositionData::decompose_model(int) [with INT=int]" at line 390
instantiation of "Iopx::DecompositionData::DecompositionData(const Ioss::PropertyManager &, MPI_Comm) [with INT=int]" at line 381

@gsjaardema
Copy link
Contributor

Bummer. I forgot that trilinos was not current with Sierra. I will get you
a consistent set tomorrow. Probably if I do a patch instead of raw files
it might work better. Sorry for the misstep.

On Wednesday, February 3, 2016, Si Hammond notifications@github.com wrote:

Greg, thanks for the fix. I tried merging this into my Trilinos checkout,
I'm getting errors. Haven't spent too long drilling into this yet but
wanted to check do I need anything else?

/home/sdhammo/git/trilinos-github-repo/packages/seacas/libraries/ioss/src/exo_par/Iopx_DecompositionData.C(1809):
error: namespace "Ioss" has no member "mpi_type"
MPI_Alltoall(TOPTR(recv_count), 1, Ioss::mpi_type((INT)0),
^
detected during:
instantiation of "void Iopx::DecompositionData::decompose_model(int) [with
INT=int]" at line 390
instantiation of "Iopx::DecompositionData::DecompositionData(const
Ioss::PropertyManager &, MPI_Comm) [with INT=int]" at line 381

/home/sdhammo/git/trilinos-github-repo/packages/seacas/libraries/ioss/src/exo_par/Iopx_DecompositionData.C(1810):
error: namespace "Ioss" has no member "mpi_type"
TOPTR(send_count), 1, Ioss::mpi_type((INT)0), comm_);
^
detected during:
instantiation of "void Iopx::DecompositionData::decompose_model(int) [with
INT=int]" at line 390
instantiation of "Iopx::DecompositionData::DecompositionData(const
Ioss::PropertyManager &, MPI_Comm) [with INT=int]" at line 381

/home/sdhammo/git/trilinos-github-repo/packages/seacas/libraries/ioss/src/exo_par/Iopx_DecompositionData.C(1845):
error: namespace "Ioss" has no member "MY_Alltoallv"
Ioss::MY_Alltoallv(node_comm_recv, recv_count, recv_disp,
^
detected during:
instantiation of "void Iopx::DecompositionData::decompose_model(int) [with
INT=int]" at line 390
instantiation of "Iopx::DecompositionData::DecompositionData(const
Ioss::PropertyManager &, MPI_Comm) [with INT=int]" at line 381

/home/sdhammo/git/trilinos-github-repo/packages/seacas/libraries/ioss/src/exo_par/Iopx_DecompositionData.C(1892):
error: namespace "Ioss" has no member "MY_Alltoallv"
Ioss::MY_Alltoallv(coord_send, send_count, send_disp,
^
detected during:
instantiation of "void Iopx::DecompositionData::decompose_model(int) [with
INT=int]" at line 390
instantiation of "Iopx::DecompositionData::DecompositionData(const
Ioss::PropertyManager &, MPI_Comm) [with INT=int]" at line 381

/home/sdhammo/git/trilinos-github-repo/packages/seacas/libraries/ioss/src/exo_par/Iopx_DecompositionData.C(995):
error: namespace "Ioss" has no member "mpi_type"
MPI_Alltoall(TOPTR(export_conn_size), 1, Ioss::mpi_type((INT)0),
^
detected during:
instantiation of "void Iopx::DecompositionData::decompose_model(int) [with
INT=int]" at line 390
instantiation of "Iopx::DecompositionData::DecompositionData(const
Ioss::PropertyManager &, MPI_Comm) [with INT=int]" at line 381

/home/sdhammo/git/trilinos-github-repo/packages/seacas/libraries/ioss/src/exo_par/Iopx_DecompositionData.C(996):
error: namespace "Ioss" has no member "mpi_type"
TOPTR(import_conn_size), 1, Ioss::mpi_type((INT)0), comm_);
^
detected during:
instantiation of "void Iopx::DecompositionData::decompose_model(int) [with
INT=int]" at line 390
instantiation of "Iopx::DecompositionData::DecompositionData(const
Ioss::PropertyManager &, MPI_Comm) [with INT=int]" at line 381


Reply to this email directly or view it on GitHub
#125 (comment).

@gsjaardema
Copy link
Contributor

[Modified comment to attach patch here. Was attached to email and didn't show here]
Try the attached patch. I think it will get you going.
..Greg

0001-IOSS-Potential-fix-for-intel-openmp-issues-with-std-.patch.txt

@nmhamster
Copy link
Contributor

Greg, this seems to be working in my initial test with NALU. I will let the runs make some more progress and update you further. Thanks for the patch!

@gsjaardema
Copy link
Contributor

Good. Sorry it took so long to get started on the bug and then the mishaps with the patches.

@nmhamster
Copy link
Contributor

Greg, runs completed successfully for the NALU milestone inputs provided by Stefan. Thanks again for your help.

@gsjaardema
Copy link
Contributor

I will try to create a small code that illustrates the bug so it can be submitted to intel. Would be good to get this fixed if possible.

@nmhamster
Copy link
Contributor

I wonder if its that we are somehow enabling the GNU threaded/PARALLEL algorithms in our builds. I don't know for sure how this would happen but if it did we might get the weird behavior. If you get a small test case we should be able to take a look at it and just ensure we aren't doing something odd in the build system.

@gsjaardema
Copy link
Contributor

Have a relatively small code that seems to replicate the issue on my blade:

test-std-sort.C.txt [Use new version below. This one has bugs]

//
// Compile line:  mpiCC -DUSE_MPI -fopenmp -std=c++11  -O2  -o test-std-sort test-std-sort.C
//
// Compiler:
// ceerws2703a:io(master)> which icpc
//      /sierra/sntools/SDK/compilers/intel/composer_xe_2016.1.150/compilers_and_libraries/linux/bin/intel64/icpc
// ceerws2703a:io(master)> which mpiCC
//      /sierra/sntools/SDK/mpi/openmpi/1.8.8-intel-16.0-2016.1.150-RHEL6/bin/mpiCC
// ceerws2703a:io(master)> mpiCC --version
//      icpc (ICC) 16.0.1 20151021
//      Copyright (C) 1985-2015 Intel Corporation.  All rights reserved.
//
// 
// ceerws2703a:io(master)> ./test-std-sort 420000
//
// Sorting 420000 values on processor 0
//  ==================================================================================
//  using std::sort ==================================================================
//  
//  ERROR: map detected non-unique global id/local id pair on processor 0.
//  
//  ERROR: Duplicate global id detected on processor 0.
//         Global id 0 assigned to local 420002 and 420000.
//  
//  ERROR: Duplicate global id detected on processor 0.
//         Global id 0 assigned to local 420003 and 420002.
//  
//    .... delete ...
//  
//  ERROR: Duplicate global id detected on processor 0.
//         Global id 180697 assigned to local 118396 and 118396.
//  
//  ERROR: Duplicate global id detected on processor 0.
//         Global id 180697 assigned to local 118396 and 118396.
//  
//  ERROR: Duplicate global id detected on processor 0.
//         Global id 180697 assigned to local 118396 and 118396.
//  
//  ERROR: Duplicate global id detected on processor 0.
//         Global id 180814 assigned to local 116874 and 116874.
//  
//  ERROR: Duplicate global id detected on processor 0.
//         Global id 180814 assigned to local 116874 and 116874.
//  
//  ERROR: Duplicate global id detected on processor 0.
//         Global id 181804 assigned to local 240255 and 240255.
//  
//  ERROR: Duplicate global id detected on processor 0.
//         Global id 181804 assigned to local 240255 and 240255.
//  ==================================================================================
//  using gds_qsort ==================================================================

NOTE:

  • If compiled without -DUSE_MPI, then works correctly.
  • If compiled with -DUSE_MPI, fails for certain sizes even if run on single processor.
  • Fails with -O2; works correctly with -O1 and -O0
  • Works with g++
  • Works at 323507, fails at 323508, then starts working again at higher values (e.g. 520000)

@ambrad
Copy link
Contributor

ambrad commented Feb 4, 2016

Greg, in your example, I think there is an access error; on lines 188-189 and again on 210-211, map[local_id], where map is a std::vector, is accessed for local_id = offset + i, where offset is 42 and i can be as large as num_to_get-1. But map.size() == num_to_get, so there are invalid reads. I assert this condition as follows:

for (int64_t i=0; i < num_to_get; i++) {
  int64_t local_id = offset + i ;
  assert(local_id >= 0 && local_id < map.size());
  new_ids.push_back(std::make_pair(map[local_id], local_id));
}

@gsjaardema
Copy link
Contributor

You are right. Thanks for catching that. I think there is also a problem in that the values in map run from 1..num_to_get inclusive instead of 0..num_to_get-1. I removed the offset and fixed he map seeding and I still get the invalid behavior...

@gsjaardema
Copy link
Contributor

Here is the new version... See how many bugs I can have in 250 lines of code...

test-std-sort.C.txt

@gsjaardema
Copy link
Contributor

Has anyone had a chance to look at this? Should we report to intel?

@nmhamster
Copy link
Contributor

We are working on this with Intel.

@gsjaardema
Copy link
Contributor

OK, thanks. I didn't want to assume someone was looking into it while they were assuming I was doing something.

@gsjaardema gsjaardema added pkg: seacas stage: in progress Work on the issue has started labels Feb 11, 2016
@crtrott
Copy link
Member Author

crtrott commented Feb 17, 2016

The error only seems to happen with GCC 4.7.2 but not with GCC 4.8.4 and above loaded in addition to Intel.

@gsjaardema
Copy link
Contributor

Is there a reason to keep this issue open? It has been 2 years since last comment and I don't think we are using gcc-4.7.2 anymore especially since I don't think it has the C++11 support that is needed to build Trilnos.

@gsjaardema
Copy link
Contributor

I don't think this is an issue anymore and there has been no comments in over 2 years. Closing, but feel free to reopen if anyone thinks it should stay open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pkg: seacas stage: in progress Work on the issue has started
Projects
None yet
Development

No branches or pull requests

4 participants