Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set up new Intel 17.x build to use as auto PR build #2463

Closed
bartlettroscoe opened this issue Mar 27, 2018 · 14 comments
Closed

Set up new Intel 17.x build to use as auto PR build #2463

bartlettroscoe opened this issue Mar 27, 2018 · 14 comments
Assignees
Labels
client: ATDM Any issue primarily impacting the ATDM project type: enhancement Issue is an enhancement, not a bug

Comments

@bartlettroscoe
Copy link
Member

bartlettroscoe commented Mar 27, 2018

CC: @trilinos/framework, @mhoemmen, @rppawlo, @ibaned, @crtrott

Next Action Status

Intel 17.0.1 PR builds running since 6/1/2018

Description

This Issue is to scope out and track efforts to create an Intel 17.x build that matches the auto PR build described in #2317 (comment).

The settings for this build are:

  • Intel 17.x with GCC 4.9.x standard C++ headers using the SEMS env
  • TPL_ENABLE_MPI=ON (OpenMPI 2.x)
  • Primary Tested Packages
  • Primary Tested TPLs
  • BUILD_SHARED_LIBS=ON
  • CMAKE_BUILD_TYPE=RELEASE
  • Trilinos_ENABLE_DEBUG=OFF
  • Trilinos_ENABLE_EXPLICIT_TEMPLATE_INSTANTIATION=ON
  • Xpetra_ENABLE_Experimental=ON
  • MueLu_ENABLE_Experimental=ON
  • Trilinos_TRACE_ADD_TEST=ON
  • Trilinos_TEST_CATEGORIES=BASIC

The existing GCC 4.8.4 CI build shown here that has been running for 1.5+ years may be a good foundation for this build since it has most of the options already set and the Trilinos/cmake/load_sems_env.sh script already allows setting different compilers.

Tasks:

  1. Select the version of Intel and OpenMPI from the SEMS env:
    a. NOTE: SEMS only provides sems-intel/17.0.1.
    b. NOTE: The highest version of OpenMPI provided by SEMS in sems-openmpi/1.10.1.
  2. Set up a trial build using these settings and test locally ...
  3. Set up a Nightly Jenkins build submitting to the "Specialized" CDash Track/Group ...
  4. Clean up all failures in the new build ...
  5. ???

Related Issues:

@bartlettroscoe
Copy link
Member Author

Found a problem with this plan for the Intel 17.x build. It looks like the SEMS env does not provide any builds of OpenMPI 2.x. All that seems to be available is:

sems-openmpi/1.10.1                                                                                                                         
sems-openmpi/1.6.5                                                                                                                          
sems-openmpi/1.8.7                                                                                                                          

Now they do have:

sems-mpich/3.2

Seems like it would be a good idea to us a different MPI implementation for one of the builds. I know in CASL that we switched to MPICH because it caught more errors than OpenMPI at the time. Is this still true?

As for Intel 17.x, it looks like the only version supported by SEMS is:

sems-intel/17.0.1

Is it a problem then if we go with Intel 17.0.1 + OpenMPI 1.8.7? Is there really any value in going with OpenMPI 1.10.1? I can try it if people think that is useful. Otherwise, should we ask SEMS to install a software stack with OpenMPI 2.x? That could take a while and it seems like they need to retire an OpenMPI version (like 1.6.5) before they add another OpenMPI version.

@nmhamster
Copy link
Contributor

@bartlettroscoe - we have generally found OpenMPI to be every bit as good as MPICH. More important, we use OpenMPI on the testbeds, CTS and it will underpin IBM Spectrum MPI on ATS2. In an ideal world we would test both MPI variants, but if we have to pick one I would select OpenMPI because of the CTS use.

@bartlettroscoe
Copy link
Member Author

I just looked and it seems that openmpi/1.10.4 is being used for the ATDM build on 'white' and 'ride' and openmpi/2.1.1 is being used on 'hansen' and 'shiller'. So it would be nice to be able to test with OpenMPI 2.x in this Intel build but is not available. So in that case, it seems we should use OpenMPI 1.10.1 which is provided by the SEMS env for this Intel 17.x build.

@bartlettroscoe
Copy link
Member Author

we have generally found OpenMPI to be every bit as good as MPICH.

@nmhamster, how do you define "good"? In the CASL case, we found that MPICH found errors in the usage of MPI that OpenMPI did not. I would have to dig up what versions those were that that was our experience. We did not care if OpenMPI ran faster than MPICH or visa versa because this was just for our test env. If I remember right (since it was many years ago), there was a defect in Tpetra MPI usage that OpenMPI let pass but when we ran CASL VERA on another machine, it bombed. It took a long time to debug to find the issue.

Anyway, given that OpenMPI is the target for ATS-2, it seems like a good choice for our testing.

@mhoemmen
Copy link
Contributor

mhoemmen commented Mar 28, 2018

OpenMPI 1.10.x implements the bits of MPI 3 that Tpetra optionally uses (with macros for the MPI version). For GPU builds, it's better to use newer versions of OpenMPI, but for CPU builds, I'm less worried about that for now.

@rppawlo
Copy link
Contributor

rppawlo commented Mar 28, 2018

In the CASL case, we found that MPICH found errors in the usage of MPI that OpenMPI did not.

I can't remember all of the CASL errors, but one of the easier ones to diagnose was that openmpi allowed for aliasing of arrays which was technically not allowed in the standard and mpich automatically flagged those uses in casl code.

@mhoemmen
Copy link
Contributor

@rppawlo That's a good point -- it would be helpful to have an extra Dashboard test for other MPI implementations.

@bartlettroscoe
Copy link
Member Author

That's a good point -- it would be helpful to have an extra Dashboard test for other MPI implementations.

So should we try MPICH for this Intel 17.0.1 build or the GCC 4.8.4 build? Note that OpenMPI 1.8.7 causes 30 test timeouts with the GCC 4.8.4 build as described in #2462 (comment). I am currently testing OpenMPI 1.10.1 with that GCC 4.8.4 build.

@mhoemmen
Copy link
Contributor

@bartlettroscoe I very deliberately said "Dashboard" not necessarily PR ;-) . I would welcome more MPI options for PR testing, but I would rather have mandatory PR testing sooner than have multiple MPIs in PR testing later :-) .

I would say, OpenMPI 1.10.x w/ GCC 4.8.4, and MPICH w/ Intel 17.0.1.

@bartlettroscoe bartlettroscoe added the type: enhancement Issue is an enhancement, not a bug label Apr 3, 2018
@prwolfe
Copy link
Contributor

prwolfe commented Apr 17, 2018

Working on this now - notes

Matches our setup:

Intel 17.x with GCC 4.9.x standard C++ headers using the SEMS env
TPL_ENABLE_MPI=ON (mopish 3.2)
Primary Tested Packages
Primary Tested TPLs
BUILD_SHARED_LIBS=ON
CMAKE_BUILD_TYPE=RELEASE
Trilinos_ENABLE_DEBUG=_ON_
Trilinos_ENABLE_EXPLICIT_TEMPLATE_INSTANTIATION=ON (This is actually Trilinos_ENABLE_EXPLICIT_INSTANTIATION=ON)
Trilinos_TRACE_ADD_TEST=ON
Trilinos_TEST_CATEGORIES=BASIC

New stuff to look at. Why experimental code in PR instead of specialized or experimental tracks?

Xpetra_ENABLE_Experimental=ON
MueLu_ENABLE_Experimental=ON

I will take a few days to get this much set up and working as I want to refactor the driver script as it is.

Paul

@mhoemmen
Copy link
Contributor

@prwolfe I don't think we actually need the MueLu and Xpetra "experimental" options enabled:

#2317 (comment)

but I'm not sure if the ATDM builds have disabled these options (yet).

@prwolfe
Copy link
Contributor

prwolfe commented Apr 19, 2018

Thanks for the reference to that discussion @mhoemmen. That matches my instincts as well!

@prwolfe prwolfe self-assigned this Apr 19, 2018
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Apr 21, 2018
This could be used, for example, for the Intel 17 build in trilinos#2463.
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Apr 23, 2018
Kokkos is not using Pthread so don't name the build 'Pthread'.  The Pthread
TPL is enabled to allow other testing.
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Apr 23, 2018
This could be used, for example, for the Intel 17 build in trilinos#2463.
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Apr 24, 2018
It was requested that we use GCC 4.9.3 heades with Intel 17.0.1 builds of
Trilinos (see trilinos#2317 and trilinos#2463).
@bartlettroscoe
Copy link
Member Author

One option for this Intel build is to use the SEMS Dev Env setup documented in:

You basically just source:

$ source <trilinos-base-dir>/cmake/load_sems_dev_env.sh sems-intel/17.0.1

and then configure Trilinos using the option:

  -C <trilinos-base-dir>/Trilinos/cmake/std/MpiReleaseSharedPtSerial.cmake \

Using the new aggregate file MpiReleaseSharedPtSerial.cmake in PR #2609, you just configure with:

$ source <trilinos-base-dir>/cmake/load_sems_dev_env.sh sems-intel/17.0.1

$ cmake \
  -C <trilinos-base-dir>/cmake/std/MpiReleaseSharedPtSerial.cmake \
  -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_<PKG0> -DTrilinos_ENABLE_<PKG1> ... \
  -DTrilinos_ENABLE_ALL_FORWARD_DEP_PACKAGES=ON \
  <trilinos-base-dir>

and that is it!

If we want to allow for tweaks (like some specific tests that need to be temporarily disabled), then we might create a new file called something like Intel-17.0.1-PrBuild.cmake that contains the same includes as MpiReleaseSharedPtSerial.cmake.

Yesterday I tested the full set of Primary Tested packages and TPLs on my machine crf450 with an experimental all-at-once build, test, and summit (with CMake 3.10.1) and it submitted to:

This showed 7 failing tests for the packages:

  • MueLu: 4
  • Kokkos: 1
  • Zoltan: 1
  • Zoltan2: 1

That build took:

  • Configure: 2m 41s
  • Build: 3h 27m 19s
  • Test: 22m 5s

That is nearly 4 hours to run. Is that too long for an auto PR build?

That ran the test categories NIGHTLY. The Zoltan tests alone took a "Proc Time" of 1h 38m. Should we run the test categories BASIC tests instead? That would cut down on the time a little bit. But perhaps saving that little bit of time is not worth it since we are building from scratch every time?

See details below.

We could set up a "Specialized" build for this that runs nightly and then get this cleaned up.

DETAILS: (click to expand)
$ cd ~/Trilinos.base/BUILDS/INTEL-17.0.1/MPI_RELEASE_DEBUG_SHARED_PT/

$ rm -r CMake*

$ source ~/Trilinos.base/Trilinos/cmake/load_sems_dev_env.sh sems-intel/17.0.1
        WARNING: sems-gcc dependency already found but does not match listed dependency sems-gcc/4.7.2
        I will use the sems-gcc you have loaded but correct behavior is not guaranteed

$ export PATH=/home/vera_env/common_tools/cmake-3.10.1/bin:$PATH

$ which cmake
/home/vera_env/common_tools/cmake-3.10.1/bin/cmake

$ cmake --version
cmake version 3.10.1

$ time cmake \
  -C ../../../Trilinos/cmake/std/MpiReleaseDebugSharedPtSerial.cmake \
  -DTrilinos_CTEST_DO_ALL_AT_ONCE=ON \
  -DTrilinos_CTEST_USE_NEW_AAO_FEATURES=ON \
  -DCTEST_BUILD_FLAGS=-j16 \
  -DCTEST_PARALLEL_LEVEL=16 \
  -DTrilinos_ENABLE_ALL_PACKAGES=ON \
  ../../../Trilinos \
  &> configure.out

real    5m30.441s
user    0m22.995s
sys     0m17.851s

$ time make dashboard &> make.dashboard.out

real    234m34.838s
user    2752m21.065s
sys     143m9.997s

That submitted to:

@bartlettroscoe
Copy link
Member Author

This has been done since about 6/1/2018 as shown in this query run just now.

Closing as complete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
client: ATDM Any issue primarily impacting the ATDM project type: enhancement Issue is an enhancement, not a bug
Projects
None yet
Development

No branches or pull requests

5 participants