Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Framework: Create Build for ETI OFF #5729

Closed
ZUUL42 opened this issue Aug 16, 2019 · 52 comments
Closed

Framework: Create Build for ETI OFF #5729

ZUUL42 opened this issue Aug 16, 2019 · 52 comments
Assignees
Labels

Comments

@ZUUL42
Copy link
Contributor

ZUUL42 commented Aug 16, 2019

Enhancement

@trilinos/framework

Create build that sets disables explicit instantiation (ETI).
This may first be tried in dev->master.

@ZUUL42 ZUUL42 added the type: enhancement Issue is an enhancement, not a bug label Aug 16, 2019
@ZUUL42 ZUUL42 self-assigned this Aug 16, 2019
@mhoemmen
Copy link
Contributor

mhoemmen commented Aug 16, 2019

btw Sierra now enables ETI by default (as of Tue 13 Aug 2019), though this would still be helpful as long as we permit users to disable ETI. @rrdrake @sebrowne

@jwillenbring
Copy link
Member

With changing the Trilinos default to ETI ON, we will need to add a build with ETI off, or we will lose the ability to build with ETI off. @kddevin mentioned this recently as well. I suggest one of the dev->master builds. I propose the Clang 9 build, but I am open to any other suggestions.

@ZUUL42
Copy link
Contributor Author

ZUUL42 commented Nov 16, 2020

We discussed this at the framework stand up today.
We're going to try a couple test builds with ETI OFF to add to dev->master initially, gcc 7 and/or 8 & clang 9.
With those results, we'll decide what we want in place long term.

@mhoemmen
Copy link
Contributor

Best practice with an ETI OFF test would be to enable "weird" types, just to make sure that downstream dependency inversion for factories is working correctly.

@ZUUL42
Copy link
Contributor Author

ZUUL42 commented Dec 2, 2020

@mhoemmen what are you referring to exactly with "weird types"?

@ZUUL42
Copy link
Contributor Author

ZUUL42 commented Dec 2, 2020

I've tested a few things with ETI OFF using GCC 7.2.0 and clang 9.0.0. There are a significant amount of errors for both.
To minimize the number coming from GCC, I turned off WError, but I'm still getting 50+ build errors.
GCC 7.2.0 ETI OFF CDash
clang 9.0.0 ETI OFF CDash

It seems we're primarily running into issues in @trilinos/tempus with a few problems in @trilinos/piro and @trilinos/rol and some odd issue in @trilinos/muelu, @trilinos/shylu, @trilinos/ifpack2, and @trilinos/xpetra. Though there are likely more hiding behind the 50 error limiter.
Could the teams take a look and let me know if there is something I can change in my configuration or if there are changes we're going to need to make in the packages. Thanks

@jhux2
Copy link
Member

jhux2 commented Dec 2, 2020

@ZUUL42 I randomly sampled the MueLu errors, which are all in the tests. They are of the type

g++: internal compiler error: Killed (program cc1plus)
Please submit a full bug report,
with preprocessed source if appropriate.
See <https://gcc.gnu.org/bugs/> for instructions.

I don't know what could be causing this. You might try reducing the make parallelism, just an idea.

@ccober6
Copy link
Contributor

ccober6 commented Dec 3, 2020

FYI, I built Tempus with ETI=OFF on my mac and got the same errors that are produced above. We are investigating.

@mhoemmen
Copy link
Contributor

mhoemmen commented Dec 3, 2020

@ZUUL42 wrote:

what are you referring to exactly with "weird types"?

I should have clarified: types that are not enabled by default in ETI=ON builds. This will test run-time registration of solver factories for the ETI=OFF case.

@ZUUL42
Copy link
Contributor Author

ZUUL42 commented Feb 4, 2021

@jhux2, I've trying setting CMAKE_BUILD_PARALLEL_LEVEL to 2 with no change in results.
Build 1468 w/Tempus OFF
Build 1471 w/Tempus OFF & parallel = 2

@ccober6
Copy link
Contributor

ccober6 commented Feb 4, 2021

FYI, the Tempus failures should be fixed in the next day or so with #8549. Just waiting on a review.

@ccober6
Copy link
Contributor

ccober6 commented Feb 11, 2021

#8549 has been merged, so you should not see any Tempus failures for ETI=OFF.

@ZUUL42
Copy link
Contributor Author

ZUUL42 commented May 24, 2021

I've run a build of ETI Off with the Master Merge PR Clang 9.0.0 build.
Here are the results: CDash
Unfortunately it hit the 50 error limiter, but we can see build error counts for:

  • @trilinos/muelu - 24
  • @trilinos/zoltan2 - 13
  • @trilinos/stk - 6
  • @trilinos/shylu _DD - 2
  • @trilinos/ifpack2 - 1
  • @trilinos/panzer - 1
  • @trilinos/piro - 1
  • @trilinos/trilinoscouplings - 1
  • @trilinos/xpetra - 1
  • (other build errors that didn't appear in the CDash report)

@cgcgcg
Copy link
Contributor

cgcgcg commented May 25, 2021

@ZUUL42 MueLu build errors should be fixed.

@ZUUL42
Copy link
Contributor Author

ZUUL42 commented May 25, 2021

@cgcgcg thanks.
I'll run it again today and see where we are.

@ZUUL42
Copy link
Contributor Author

ZUUL42 commented May 25, 2021

Query to show all eti_off test builds. CDash

@cgcgcg
Copy link
Contributor

cgcgcg commented May 26, 2021

Looks like some issue in KokkosKernels is now causing a lot of stuff in MueLu to fail?
@trilinos/kokkos-kernels

@ZUUL42
Copy link
Contributor Author

ZUUL42 commented Jun 3, 2021

@cgcgcg
Building with ETI OFF for the latest update to develop. CDash

Looks like we are getting close on this now. Down to on 2 Build Errors from @trilinos/trilinoscouplings and a some test failures from @trilinos/muelu, @trilinos/trilinoscouplings, & @trilinos/panzer.

If we can at least get the @trilinos/trilinoscouplings error fixed, we have the option of fixing or turning off the tests failing for this build.

@jhux2
Copy link
Member

jhux2 commented Jun 4, 2021

@ZUUL42 These cmake warnings look a bit scary:


Total time to configure Trilinos: 1m16.166s
-- Configuring done
CMake Warning:
  Value of Trilinos_ENABLE_TESTS contained a newline; truncating


CMake Warning:
  Value of Trilinos_PARALLEL_LINK_JOBS_LIMIT contained a newline; truncating


-- Generating done
CMake Warning:
  Value of Trilinos_ENABLE_TESTS contained a newline; truncating


CMake Warning:
  Value of Trilinos_PARALLEL_LINK_JOBS_LIMIT contained a newline; truncating

@ZUUL42
Copy link
Contributor Author

ZUUL42 commented Jun 4, 2021

@jhux2, I'll look into that.
Though they seem to have been there without this change at least as far back as the CDash reports on record, 2021 Feb 04.

@cgcgcg
Copy link
Contributor

cgcgcg commented Jun 8, 2021

@ZUUL42 I fixed one of the 2 build errors. The other one looks like the OOM killer might have gotten the compiler. On my workstation, I had to reduce the build parallelism to avoid the issue.

@cgcgcg
Copy link
Contributor

cgcgcg commented Jun 8, 2021

@trilinos/belos @hkthorn @jennloe @iyamazaki I don't understand the Belos solver registration well enough to fix these test failures. Looks like essentially no Belos solvers are registered, yet none of the Belos tests fail. Are Xpetra solvers different from Epetra/Tpetra solvers?

@ikalash
Copy link
Contributor

ikalash commented Jun 15, 2021

FYI, it appears there are new errors from Ifpack2 if tests are enabled - it's due to the unit tests. I opened a separate issue on this #9280 . These just cropped up end of last week with an Ifpack2 PR that went in.

@ZUUL42
Copy link
Contributor Author

ZUUL42 commented Jun 22, 2021

I ran another test build the other day, forgot to post right away. (whoops, sorry) This is from Jun 18.
CDash
I can kick off another as well.

@cgcgcg
Copy link
Contributor

cgcgcg commented Jun 22, 2021

Looks like OOM, no?

@ZUUL42
Copy link
Contributor Author

ZUUL42 commented Jun 30, 2021

@cgcgcg, that's what I'm finding.
I've tried setting parallel_level to 8 and then 2 but still getting the same errors.
Might you have any other suggestions?
Otherwise, it seems we'll have to turn off packages or some other creative solution that won't make for a very good PR build.

@jhux2
Copy link
Member

jhux2 commented Jul 1, 2021

@ZUUL42 I assume parallel_level is akin to make -j. Here a couple ideas to try:

  1. Try gcc instead of clang to see if the former is better with memory.
  2. Try building with "-O0" and without "-g". Perhaps this would need less memory then the current RELEASE build.

@jwillenbring
Copy link
Member

Based on feedback from the Trilinos Product Lead team, we are pressing pause on this effort and will likely discontinue trying to support this option. We need to socialize this more before making a final decision.

@jwillenbring
Copy link
Member

I was unable to see the latest CDash results. @ZUUL42 do you remember how many packages were failing? Could we turn a few packages off and get a clean build? There is renewed concern about this.

@ZUUL42
Copy link
Contributor Author

ZUUL42 commented Jul 31, 2021

@jwillenbring here's a couple new ETI OFF builds.
The clang 9 build still has parallel level set to 1. The clang 10 build has no other changes besides having ETI set to off.
CDash

@ZUUL42
Copy link
Contributor Author

ZUUL42 commented Aug 2, 2021

Kicked off a couple more builds. The same link will pick them up.
The Clang 10 build has the 34 failing tests turned off.
The Clang 9 build, so far, just has the parallel level line removed.

@jhux2
Copy link
Member

jhux2 commented Aug 2, 2021

@ZUUL42 You might try a different compiler, such as gcc, which presumably would have different memory requirements (if memory is what's causing the internal compiler error).

@ZUUL42
Copy link
Contributor Author

ZUUL42 commented Aug 2, 2021

@jhux2 I can do that. Though I'm not sure if there was a reason Clang was specifically chosen for the ETI OFF build or not.
@jwillenbring it appears you had proposed clang, at least in this thread. Do you recall if there was a specific reason for that?
The test results are gone, but it appears I had run ETI OFF with clang 9 & gcc 7.2. I seem to remember clang 9 appeared to be closer to building clean, so I went that direction. However there have been a significant amount of change since then.
I can run it through with GCC 7.2 & 8.3 and see what we get.

@jhux2
Copy link
Member

jhux2 commented Aug 3, 2021

@ZUUL42 I'm just thinking of how to avoid what appears to be an out-of-memory error with the compiler.

The builds are all RELEASE. A clang build with -O0 and without -g might relieve some of the memory pressure and allow you to reenable those MueLu tests.

@cgcgcg
Copy link
Contributor

cgcgcg commented Aug 3, 2021

The MueLu tests were failing due to a bug in Belos, not due to OOM, no?

@jhux2
Copy link
Member

jhux2 commented Aug 3, 2021

The MueLu tests were failing due to a bug in Belos, not due to OOM, no?

@cgcgcg The results in @ZUUL42's comment still have internal compiler errors for MueLu, which I interpret as OOM.

@ZUUL42
Copy link
Contributor Author

ZUUL42 commented Aug 3, 2021

@jhux2 I'm not familiar with those flags.
For CMake, would it be:
set(CMAKE_CXX_FLAGS "-O0 -g" CACHE STRING)
in PullRequestLinuxClang10.0.0TestingSettings.cmake

@jhux2
Copy link
Member

jhux2 commented Aug 3, 2021

@ZUUL42 You only want -O0, without the -g option. O0 sets the compiler optimization level to zero, i.e., no optimization' . Adding -g would cause the compiler to include symbols and thus increase the executable size. My thinking is that the compiler needs additional memory to do code optimizations.

@cgcgcg
Copy link
Contributor

cgcgcg commented Aug 3, 2021

Looks like debug symbols are also switched on here:

set (Trilinos_ENABLE_DEBUG_SYMBOLS ON CACHE BOOL "Set by default for PR testing")

@ZUUL42
Copy link
Contributor Author

ZUUL42 commented Aug 4, 2021

Clang 10 Build 3272 adds:
set(CMAKE_CXX_FLAGS_RELEASE "-O0")
and removes:
set (Trilinos_ENABLE_DEBUG_SYMBOLS ON CACHE BOOL "Set by default for PR testing")
With tests back on ... 35 of them still fail. See CDash

@jhux2
Copy link
Member

jhux2 commented Aug 5, 2021

Clang 10 Build 3272 adds:
set(CMAKE_CXX_FLAGS_RELEASE "-O0")
and removes:
set (Trilinos_ENABLE_DEBUG_SYMBOLS ON CACHE BOOL "Set by default for PR testing")
With tests back on ... 35 of them still fail. See CDash

@ZUUL42 This is progress, right? The MueLu tests that previously didn't compile now do, right?

@wadeburgess
Copy link

Migrated to Jira.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests