-
Notifications
You must be signed in to change notification settings - Fork 576
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
STK_Mesh failures in Nalu code base #7828
Comments
@trilinos/stk |
@alanw0 and @jhux2, clang has been a struggle to test for the Nalu-based codes with Clang seemingly revolting. My guess is that we will learn more tomorrow morning and after Nalu and Nalu-Wind's regression test suite report. However, these are widespread on Nalu and I will guess Nalu-Wind will again track. The bisect definitely points to this jvo push. We may want to hope for a gcc/intel fail since it will be easier to access at SNL. I will report back tomorrow as the Clang state is a bit odd. |
@alanw0, otherwise, I have a full installation of Trilinos and other TPLs on gcc/intel and it's easy to build Nalu by pointing to this pre-installed path. |
@spdomin ok, let's see what tomorrow brings, and if possible I'll run this through totalview with a gcc build and see what's going on. jvo's stk snapshot was primarily to fix a few clang compiler errors, but it obviously also brought along other recent commits and I suspect something's wrong with one of those. Although it's puzzling because all of our commits go through gerrit which tests a full slate of sierra tests on 3 platforms... Anyway, we'll sort it out. |
@alanw0 - gcc seems fine after this commit: However, I thought that this change was incompatible with Clang. I probably need to look at EntitySorterBase to see if something is up there... Perhaps in the morning, things will be more clear for me:) |
Yes, resolving the missing header in gcc results in the following for Clang: In file included from /Users/naluIt/gitHubWork/nightlyBuildAndTest/Nalu/src/Realm.C:21: |
@spdomin ok that's a good clue. EntitySorterBase was moved from EntityLess.hpp into its own header. In sierra, EntitySorterBase no longer appears in EntityLess.hpp. |
Interesting. How could that happen if a sync was just processed? There must be some subtlety in how these two repos communicate:) Let me know when I can test something. I think that when I remove the header include, the code builds on clang however segfaults. However let’s wait until we have a clean build Under clang before we worry about its testing behavior. |
It looks like develop was merged into master just after midnight, 12:07 this morning. So I guess it was just a window where the stk update had gone into develop and hadn't made it into master yet. But they appear to be in sync now. So we should be ready to debug the runtime error you show at the top of this bug report, assuming that is still happening and wasn't caused by a bad build. |
I will check the Mac build now and report back. Debugging off this platform could be painful. |
I think so: https://my.cdash.org/index.php?project=Nalu-Wind |
Looks like the build is clean, however, clear failures noted. I thought the pattern may have been due to the face consolidated approach where we sort by exposed face topology, however, I see edge-based and non-consolidated element-based cases failing. As such, it is hard to see any pattern aside from the fact that its the same STK signature. Do we have any ability to debug Clang off of the Mac? What about loading a Clang module off of my RHEL7 box? Is that an option?
0% tests passed, 2 tests failed out of 2 |
Also, I am using a rather new clang (11) while I see Nalu-Wind is testing with 9-ish. |
@spdomin is this run-time error only happening with clang? Or is it also happening with gcc etc? |
Only clang. The gcc 8.3 was clean. |
ok, I think my first move will be to do a sierra stk build with clang. We have a number of clang modules for sierra, but no dashboard lines for it. If I can reproduce there, that's the easiest for me. I'll let you know asap what I come up with. |
I will check intel 19 now. I generally only run gcc and mac nightly. However it’s no prob to check intel. Sierra looks like it may use clang 10. I guess I would use the highest one. |
Looks like Intel 19 is also clean. There seems to be a secrete recipe to map between Mac clang version and official clang version. Perhaps someone could fire off a Nalu-Wind Darwin build/test to see if (after the latest header PR), the suite is clean? |
I am also trying to update open mpi on my Mac platform. |
Too bad:( I upgraded my Mac build to the latest open MPI (4.0.4) and still see these core dumps. Note that if I run the failing test in serial, it runs and passes. I am also looking into a similar Clang version on my linux RHEL7 system. Here, it seems that the conversions are Mac/Clang 11.0.3 --> Clang 9.0 We may want to wait to see what Nalu-Wind does, although the fact Nalu-proper is failing on this platform suggests something off. As far as i can tell, my Mac environment is sane and the new change definitely caused the failures. Perhaps I should try running an STK unit test on my Trilinos build? How might I do that? |
@spdomin I believe the stk_mesh unit tests get built under |
Can you run totalview on that mac machine? If you can get a debug build and run totalview, maybe I could log in and try to see what's going on. I agree it seems like a code problem, I'm just getting low on ideas for reproducing it.. |
I do not have an effective debugger on my Mac system. I circled this topic with Anthony for a while, however, was never able to obtain a viable TV option. This fact makes it very hard to debug. I may have ddd:) I will work on the stk unit tests. I seem to recall I had to turn these off to avoid adding more dependencies to my build, e.g., loadbal. |
Final report for the day: My RHEL7-based Clang 9.0, open mpi 4.0.3 also behaves well. Thus far, gcc 8.3.0 + open mpi 4.0.3, Intel 19 + open mpi 4.0.3, and clang 9.0 + open mpi 4.0.3 all work. It is only the Darwin build using Clang 11.0.3 + a few flavors of open mpi (tested the most recent 4.0.4 version) that are failing. I plan on trying to run the STK unit test suite tomorrow PM and reporting back. In the meantime, we will see what the nightly Nalu-wind tests results show. |
ok, tomorrow is another day, hopefully we can get to the bottom of this... |
@spdomin valgrind is supposed to run on MacOS, maybe that would be helpful? Maybe old fashioned gdb? |
It looks like nalu-wind tests are failing on a linux clang platform with address sanitizer. I'm now trying a build with address sanitizer, hopefully that will reproduce this. |
@alanw0, yes:
The Darwin platform for Nalu-Wind worked fine. Argh - at least we have something common here. |
I'm currently beefing up the unit-test for that function a little more, and also trying an address sanitizer build. |
Looks promising... After cloning Alan's fork and checking out his patch, it looks like the selective testing (spot checking here and there) resulted in the previous failing tests to pass. I suppose all this means is that the selective revert based on the call stack resulted in a sane build. As you (the STK team) debug this clang issue, let me know if I can help.
|
@spdomin thanks for checking that. Yes I will keep working to reproduce the problem with a clang build in the sierra system, because I would like to get that code un-reverted if possible, since it was a nice optimization. |
Agreed. I will work on that debug-option that Si suggested. There is still the complexity of getting that Mac platform to your eyes.... I think that the local Clang replication is our best hope. If we find the exact details of that platform, I can replicate it locally. |
This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity. |
This issue was closed due to inactivity for 395 days. |
Bug Report
@trilinos/stk
@alanw0 and @jvo1012 - new clang core dumps - will know tomorrow if this extends to other platforms
Bad: commit 1f7e138
Author: Johnathan Vo jvo@sandia.gov
Good: commit 3c206dd
typical dump:
[Nalus:21801] [ 0] 0 libsystem_platform.dylib 0x00007fff72abc5fd _sigtramp + 29
[Nalus:21801] [ 1] 0 ??? 0x00007ff100001000 0x0 + 140673063849984
[Nalus:21801] [ 2] 0 naluX 0x0000000109ef34d4 _ZNK3stk4mesh8BulkData25shared_procs_intersectionERKNSt3__16vectorINS0_6EntityENS2_9allocatorIS4_EEEERNS3_IiNS5_IiEEEE + 100
[Nalus:21801] [ 3] 0 naluX 0x000000010a06dd41 _ZN3stk4mesh37fill_shared_entities_that_need_fixingERKNS0_8BulkDataE + 1009
[Nalus:21801] [ 4] 0 naluX 0x0000000109f0636b ZN3stk4mesh8BulkData33resolve_parallel_side_connectionsERNSt3__16vectorINS0_15SideSharingDataENS2_9allocatorIS4_EEEES8 + 59
[Nalus:21801] [ 5] 0 naluX 0x0000000109f06c0b _ZN3stk4mesh8BulkData48use_elem_elem_graph_to_determine_shared_entitiesERNSt3__16vectorINS0_6EntityENS2_9allocatorIS4_EEEE + 59
[Nalus:21801] [ 6] 0 naluX 0x0000000109f06e74 _ZN3stk4mesh8BulkData56fill_shared_entities_of_rank_while_updating_sharing_infoENS_8topology6rank_tERNSt3__16vectorINS0_6EntityENS4_9allocatorIS6_EEEE + 68
[Nalus:21801] [ 7] 0 naluX 0x0000000109f08cd8 _ZN3stk4mesh8BulkData32internal_resolve_parallel_createERKNSt3__16vectorINS_8topology6rank_tENS2_9allocatorIS5_EEEE + 504
[Nalus:21801] [ 8] 0 naluX 0x0000000109f088ab _ZN3stk4mesh8BulkData48internal_resolve_parallel_create_edges_and_facesEv + 331
[Nalus:21801] [ 9] 0 naluX 0x000000010a04201f _ZN3stk4mesh4impl16MeshModification55internal_modification_end_after_node_sharing_resolutionENS2_25modification_optimizationE + 47
[Nalus:21801] [10] 0 naluX 0x0000000109ebd061 _ZN3stk2io15StkMeshIoBroker22populate_mesh_sidesetsEb + 321
[Nalus:21801] [11] 0 naluX 0x00000001093d2b1a _ZN6sierra4nalu5Realm10initializeEv + 362
[Nalus:21801] [12] 0 naluX 0x00000001093f667a _ZN6sierra4nalu6Realms10initializeEv + 42
[Nalus:21801] [13] 0 naluX 0x00000001094144d2 _ZN6sierra4nalu10Simulation10initializeEv + 18
[Nalus:21801] [14] 0 naluX 0x000000010914e8ef main + 5727
[Nalus:21801] [15] 0 libdyld.dylib 0x00007fff728bfcc9 start + 1
Description
Steps to Reproduce
The text was updated successfully, but these errors were encountered: