-
Notifications
You must be signed in to change notification settings - Fork 577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SEACAS: crashes on KNL with OpenMP enabled during compilation. #125
Comments
There is a NALU bug filed with SIERRA on this from two weeks ago as well. |
Is that bug: https://prod.sandia.gov/sierra-trac/ticket/14339 (Duplicate node global id in NALU/KNL)? I had not been looking into that any more since I thought it was related to the general build/configuration issues related to netcdf, hdf5, and pnetcdf. Is that bug still relevant even after the TPL configuration issues were resolved? |
Yes. We are working through some more on this but I think it may apply broadly to threaded code with Intel compiler (may not be just KNL). |
I am going to test with Intel 15 on my box with OpenMP on. That said can we get the same Compiler/TPL configs on Shepard to test there on Haswell? |
You have most of the TPLs on Shepard as needed. |
I've been able to replicate on my blade. Will debug and see what I can find. |
Great. Hopefully we (I mean you ;-) ) can track that down. |
Have you had any issues with std::sort ? I have this code:
And it is failing -- finding a duplicate |
This is definitely looking like a std::sort issue. Was still getting some failing tests (but different error messages). Replaced more std::sort with my qsort and they are now passing... |
I changed two files |
Greg, thanks for the fix. I tried merging this into my Trilinos checkout, I'm getting errors. Haven't spent too long drilling into this yet but wanted to check do I need anything else? /home/sdhammo/git/trilinos-github-repo/packages/seacas/libraries/ioss/src/exo_par/Iopx_DecompositionData.C(1809): error: namespace "Ioss" has no member "mpi_type" /home/sdhammo/git/trilinos-github-repo/packages/seacas/libraries/ioss/src/exo_par/Iopx_DecompositionData.C(1810): error: namespace "Ioss" has no member "mpi_type" /home/sdhammo/git/trilinos-github-repo/packages/seacas/libraries/ioss/src/exo_par/Iopx_DecompositionData.C(1845): error: namespace "Ioss" has no member "MY_Alltoallv" /home/sdhammo/git/trilinos-github-repo/packages/seacas/libraries/ioss/src/exo_par/Iopx_DecompositionData.C(1892): error: namespace "Ioss" has no member "MY_Alltoallv" /home/sdhammo/git/trilinos-github-repo/packages/seacas/libraries/ioss/src/exo_par/Iopx_DecompositionData.C(995): error: namespace "Ioss" has no member "mpi_type" /home/sdhammo/git/trilinos-github-repo/packages/seacas/libraries/ioss/src/exo_par/Iopx_DecompositionData.C(996): error: namespace "Ioss" has no member "mpi_type" |
Bummer. I forgot that trilinos was not current with Sierra. I will get you On Wednesday, February 3, 2016, Si Hammond notifications@github.com wrote:
|
[Modified comment to attach patch here. Was attached to email and didn't show here] 0001-IOSS-Potential-fix-for-intel-openmp-issues-with-std-.patch.txt |
Greg, this seems to be working in my initial test with NALU. I will let the runs make some more progress and update you further. Thanks for the patch! |
Good. Sorry it took so long to get started on the bug and then the mishaps with the patches. |
Greg, runs completed successfully for the NALU milestone inputs provided by Stefan. Thanks again for your help. |
I will try to create a small code that illustrates the bug so it can be submitted to intel. Would be good to get this fixed if possible. |
I wonder if its that we are somehow enabling the GNU threaded/PARALLEL algorithms in our builds. I don't know for sure how this would happen but if it did we might get the weird behavior. If you get a small test case we should be able to take a look at it and just ensure we aren't doing something odd in the build system. |
Have a relatively small code that seems to replicate the issue on my blade: test-std-sort.C.txt [Use new version below. This one has bugs]
NOTE:
|
Greg, in your example, I think there is an access error; on lines 188-189 and again on 210-211, map[local_id], where map is a std::vector, is accessed for local_id = offset + i, where offset is 42 and i can be as large as num_to_get-1. But map.size() == num_to_get, so there are invalid reads. I assert this condition as follows:
|
You are right. Thanks for catching that. I think there is also a problem in that the values in map run from 1..num_to_get inclusive instead of 0..num_to_get-1. I removed the offset and fixed he map seeding and I still get the invalid behavior... |
Here is the new version... See how many bugs I can have in 250 lines of code... |
Has anyone had a chance to look at this? Should we report to intel? |
We are working on this with Intel. |
OK, thanks. I didn't want to assume someone was looking into it while they were assuming I was doing something. |
The error only seems to happen with GCC 4.7.2 but not with GCC 4.8.4 and above loaded in addition to Intel. |
Is there a reason to keep this issue open? It has been 2 years since last comment and I don't think we are using gcc-4.7.2 anymore especially since I don't think it has the C++11 support that is needed to build Trilnos. |
I don't think this is an issue anymore and there has been no comments in over 2 years. Closing, but feel free to reopen if anyone thinks it should stay open. |
This is the start for tracking an issue on KNL where the mesh load in Nalu crashes when seacas got compiled with OpenMP enabled (even though it seems Seacas doesn't actually use OpenMP). I tracked it down to libIoss.a. If I just link Nalu against the serial version of that library (while using the OpenMP ones for everything else) it works.
The text was updated successfully, but these errors were encountered: