Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Select set of builds for initial mandatory auto PR testing process #2317

Closed
bartlettroscoe opened this issue Mar 1, 2018 · 65 comments
Closed
Labels
client: ATDM Any issue primarily impacting the ATDM project Framework tasks Framework tasks (used internally by Framework team) type: enhancement Issue is an enhancement, not a bug

Comments

@bartlettroscoe
Copy link
Member

bartlettroscoe commented Mar 1, 2018

CC: @trilinos/framework

Next Action Status

This ship has sailed on "Initial" a long time ago. The only remaining build in the CUDA PR build and that is being tracked in #2464.

Description

This story is to involve interested members of the Trilinos team to collaborate to select the best build configurations in order to allow the new Trilinos auto PR testing and merging tool and process (#1155) to become the mandatory way to test and push to the main Trilinos ‘develop’ branch (#2312). This selection must be done considering that there are limited computing resources (on the Jenkins build farms) to run automated builds. So while we would like to run many different useful builds as part of pre-push automated PR testing, we have to be strategic about what builds we run where to not overwhelm current capacity. As more build machines are added to the Jenkins build farm, more builds can be added to the auto PR testing process.

This story is a follow on from the action item:

  • ACTION (Ross): Set up a separate meeting to discuss what that build / those builds (if more than one) should be

in the 2018-02-26 Trilinos Planning Meeting.

The set of Trilinos team members interested in being part of this discussion (and meetings) include:

Since this selection of these builds will impact every Trilinos developer and every close customer and collaboration of Trilinos, it is important that we get input from many different people in making this selection.

Definition of Done

  • Document conversation between Trilinos developers on the selection of these builds
  • New build configurations selected (with actual Trilinos configuration files)
  • Trial builds of Trilinos posted to CDash for the chosen configurations

Related Issues

Task

  1. Find initial selection of team members interested to discuss this topic [DONE]
  2. Set up and have meeting with working group [DONE]
  3. Create initial set of builds in meeting [DONE]
  4. Create concrete *.cmake files for each proposed configuration and set up nightly builds sumitting to "Specialized" CDash Track/Group:
    a. GCC 4.8.4: See Modify existing GCC 4.8.4 CI build to match selected auto PR build #2462
    b. Intel 17.x: See Set up new Intel 17.x build to use as auto PR build #2463
    c. CUDA: See Set up a CUDA build for an auto PR build #2464
  5. ???
@bartlettroscoe
Copy link
Member Author

bartlettroscoe commented Mar 1, 2018

I sent the following email. I will give it until the week of 3/12/2-18 after the SIAM PP conference to have this meeting. That should be urgent enough.


To: sandia-trilinos-developers@software.sandia.gov
Cc: trilinos-framework@software.sandia.gov
Subject: [Trilinos-Framework] Interested in selection of builds used in mandatory auto PR testing and merge process?

Hello Sandia Trilinos Developers,

Following up on the action item I was given at the Trilinos Developers Planning meeting on Monday with notes listed at:

https://docs.google.com/document/d/1JClLSR3n79XJT_yLPPoQv_e3rJToHxl8fUzRYojtY5Y

I am going to organize a discussion on the selection of the set of builds the auto PR testing process should be using before it becomes the required way to test and push to the Trilinos ‘develop’ branch.

Therefore, if you are interested in being part of this discussion, please add your name in the first comment in the existing list under the “Description” section in the new issue:

#2317

and then I will try to organize a meeting of all the interested parties.

Since this selection of build will impact every Trilinos developer and every close customer and collaboration of Trilinos, it is important that we get input from many different people in making this selection.

Thanks,

-Ross

@csiefer2
Copy link
Member

csiefer2 commented Mar 2, 2018

Any machine used for PR testing must be accessible (either directly or by an identical system) by all developers so they can fix issues that PR testing exposes.

@mhoemmen
Copy link
Contributor

mhoemmen commented Mar 3, 2018

I'm with @csiefer2 -- for example, if we intend to support Windows builds, the Windows Dashboard build should be a VM that we can access (either log into, or download and run on our own) and use for testing.

@bartlettroscoe
Copy link
Member Author

I don't think there is no way that we would make a Windows build a auto PR build or even a CI build anytime soon. There are so many other builds that more important than that.

But I think the rule should be that before any build can enter the "Nightly", "Clean", or "Auto PR" category, that it must be easy for any Sandia staff member to access a machine where the build can be exactly reproduced. I agree with @csiefer2 and @mhoemmen on this point.

@jwillenbring
Copy link
Member

@bartlettroscoe Thank you for the reminder about the MueLu, Ifpack2, and Anasazi tests. These packages had test failures a while back and were disabled for that reason (needed 100% passing). Some of these issues I believe have been resolved. I am testing that now.

@bartlettroscoe
Copy link
Member Author

Thank you for the reminder about the MueLu, Ifpack2, and Anasazi tests. These packages had test failures a while back and were disabled for that reason (needed 100% passing). Some of these issues I believe have been resolved. I am testing that now.

@jwillenbring,

Just the failing tests should be disabled, not all the tests for a package. You write a GitHub issue for the failing test(s), then disable just those tests. Generally one wants to disable with a scapulae, not a machete.

@jwillenbring
Copy link
Member

Just the failing tests should be disabled, not all the tests for a package. You write a GitHub issue for the failing test(s), then disable just those tests. Generally one wants to disable with a scapulae, not a machete.

Agreed. At the time I didn't want the issues to be lost and not notice that the tests were disabled. Much of this has to do with ETI.

@bartlettroscoe
Copy link
Member Author

At the time I didn't want the issues to be lost and not notice that the tests were disabled. Much of this has to do with ETI.

If the developers of those packages don't care to fix the broken tests for GitHub issues associated with their packages then that is on them. And you only disable the tests just for that particular build, not for all builds. Look at the git log file for cmake/std/BasicCiTestingSettings.cmake to see how to do this. These tests will still show up failing in "Clean", "Nightly" and other builds but at least they will not mess up other developers PR testing. That way, no one will forget these failed tests. But if they let the tests fail for more than a day or two, then these tests need to be disabled in "Clean" and "Nightly" as well.

It is important to fix this quickly because developers might be thinking that the auto PR builds are testing Ifpack2, Anasasi, and MueLu but they are not. This means that the auto PR builds have even a higher chance of breaking tests in these packages.

@mhoemmen
Copy link
Contributor

mhoemmen commented Mar 7, 2018

Wait, what?!? Ifpack2 and MueLu tests were totally disabled?!?

@bartlettroscoe
Copy link
Member Author

Hello, it is after SIAM PP. Can we set up a meeting on this?

@bartlettroscoe
Copy link
Member Author

@jwillenbring and/or @bmpersc,

Do you guys want to be included in this meeting? If so, please add your GitHub IDs to the list in the above Description field. I will then try to set up a meeting using people's SNL calendars to select a time that works for everyone. (Or if I can't find a time, I will set up a dreaded doodle.com poll.)

@bartlettroscoe
Copy link
Member Author

One issue to consider is that according to:

GCC 4.8.4 should implement OpenMP 3.1 while GCC 4.9.3 should implement OpenMP 4.0.

Is that an issue for testing OpenMP code with Trilinos? How much difference is there in the Kokkos implementation of threading of OpenMP 3.1 vs OpenMP 4.0? Or is this not a concern because updates to Kokkos do testing with many different compilers before merging to Trilinos 'develop'?

If we are not worried about the OpenMP 3.1 vs. OpenMP 4.0 issue, then a single GCC 4.8.4 build for auto PR testing would seem to be okay.

@nmhamster
Copy link
Contributor

@bartlettroscoe - OpenMP3.1 would be good for now because we have a range of platforms where this is the most up to date complete supported standard.

@bartlettroscoe
Copy link
Member Author

@trilinos/shylu developers,

Why are there no tests being run for any of the ShyLU packages in the existing auto PR and CI builds:

?

These packages are listed a Primary Tested (PT) packages but they don't have any tests?

@ndellingwood
Copy link
Contributor

ndellingwood commented Mar 13, 2018

@bartlettroscoe each package in each of ShyLU's subpackages is marked either ST or EX:

shylu_dd

shylu_node

I'm not sure why they are being listed as PT packages, the use of

TRIBITS_PACKAGE_DEFINE_DEPENDENCIES(
  SUBPACKAGES_DIRS_CLASSIFICATIONS_OPTREQS
...

in the shylu_node and shylu_dd subpackages seems to fit the example from tribits dev guide example. Does anything stand out that clearly needs to be changed?

Opening a new issue after speaking with @srajama1 to get packages migrated to PT testing, #2375

Edit: Added reference to ShyLU issue.

@bartlettroscoe
Copy link
Member Author

NOTE: Defects like #2391 show why we need to be enabling OpenMP in our auto PR build.

@bartlettroscoe
Copy link
Member Author

bartlettroscoe commented Mar 16, 2018

CC: @trilinos/zoltan2

I was just reminded today by #2397 that the auto PR builds based on SEMS should also enable the Scotch TPL. That CI build (and the builds derived from it) are the only automated builds of Trilinos that enable Scotch and therefore are the only builds that run these tests that depend on Scotch. See #2065.

@bartlettroscoe
Copy link
Member Author

@mhoemmen said:

The Pthread TPL gets automatically detected and enabled by default. (There are reasons to enable it that have nothing to do with thread parallelism.) Users found it unpleasantly surprising to get thread parallelism when they didn't ask for it explicitly. Thus, Kokkos and Tpetra historically did not enable the Pthreads back-end unless explicitly requested. Tpetra treats Kokkos::Threads as a last resort in terms of defaults.

That makes sense. I just remembered that the GCC C++11 threads library requires you link in the pthread lib in order to work. This is used, for example, in the threaded testing of the new thread-safe Teuchos Memory Management class mode.

What is the impact then of enabling the Pthread TPL and OpenMP as long as Kokkos just uses OpenMP?

@ibaned
Copy link
Contributor

ibaned commented Apr 23, 2018

@bartlettroscoe enabling the Pthread TPL and (only) the Kokkos OpenMP backend should be just fine, I think thats what currently happens in most Trilinos builds that enable OpenMP.

Edit: the most common configuration enables the Kokkos OpenMP and Serial backends.

@bartlettroscoe
Copy link
Member Author

Edit: the most common configuration enables the Kokkos OpenMP and Serial backends.

@ibaned, does this mean that we should or should not allow the enable of the Pthread TPL when we configure Trilinos with OpenMP or Serial Kokkos backends?

@mhoemmen, other than some than some tests that need std:: C++11 threading to run, what other reasons are there for enabling the Pthread TPL when configuring Trilinos, even if Kokkos will use a different (or no) threading backend?

bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Apr 24, 2018
It was requested that we use GCC 4.9.3 heades with Intel 17.0.1 builds of
Trilinos (see trilinos#2317 and trilinos#2463).
@ibaned
Copy link
Contributor

ibaned commented Apr 24, 2018

The Pthread TPL is fine to enable regardless of Kokkos backends. It doesn't cause any issues with OpenMP or otherwise. It is also fine to disable, unless the Kokkos Threads backend is used.

@mhoemmen
Copy link
Contributor

mhoemmen commented Apr 24, 2018

@bartlettroscoe wrote:

... what other reasons are there for enabling the Pthread TPL when configuring Trilinos, even if Kokkos will use a different (or no) threading backend?

I actually searched Trilinos for pthread_ and found nothing. Teuchos requires C++11 so we should all use the C++ Standard Library stuff like std::call_once and std::mutex if we need coarse-grained thread synchronization.

@mhoemmen
Copy link
Contributor

@bartlettroscoe My above comments suggest that maybe we don't need the Pthread TPL at all. I've been fine with it enabled but perhaps it's not necessary. On the other hand, sometimes OpenMP implementations or other TPLs need it, so I'm not sure we should just turn it off by default.

@bartlettroscoe
Copy link
Member Author

My above comments suggest that maybe we don't need the Pthread TPL at all. I've been fine with it enabled but perhaps it's not necessary. On the other hand, sometimes OpenMP implementations or other TPLs need it, so I'm not sure we should just turn it off by default.

@mhoemmen, the experience with the Teuchos MM classes thread-safe work with that you need to explicitly link in the -lpthread lib if you want to use C++11 threading support. This is needed to run the multi-threaded unit tests.

bartlettroscoe added a commit that referenced this issue Apr 24, 2018
@mhoemmen
Copy link
Contributor

@bartlettroscoe good point; not sure if actual C++11 implementations are supposed to require that (vs. the -std=gnu-c++0x stuff) but it's worth checking

@bartlettroscoe
Copy link
Member Author

FYI: As pointed out by @etphipp in #2628 (comment), setting:

 -D MPI_EXEC_PRE_NUMPROCS_FLAGS="--bind-to;none"

seems to fix the problem of OpenMP threads all binding to the same core on a RHEL6 machine. Perhaps this will let us enable OpenMP for the GCC 4.8.4 auto PR build being set up in #2462?

@bartlettroscoe bartlettroscoe added Framework tasks Framework tasks (used internally by Framework team) client: ATDM Any issue primarily impacting the ATDM project labels May 3, 2018
@bartlettroscoe
Copy link
Member Author

FYI: As mentioned in the new issue #2674, EMPIRE is now enabling the options:

-D MueLu_ENABLE_Kokkos_Refactor:BOOL=ON \
-D Xpetra_ENABLE_Kokkos_Refactor:BOOL=ON \
-D MueLu_ENABLE_Kokkos_Refactor_Use_By_Default:BOOL=ON \

Therefore, the PR builds should enable these as well. This is the same argument for why we agreed to add the "Experimental" enables for Xpetra and MueLu described above.

bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue May 4, 2018
This was the agreement as part trilinos#2317.

NOTE: This is using 'mpiexec --bind-to none ...' to avoid pinning the threads
in differnet MPI ranks to the same cores.  See trilinos#2422.
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue May 4, 2018
This was the agreement as part trilinos#2317.

NOTE: This is using 'mpiexec --bind-to none ...' to avoid pinning the threads
in differnet MPI ranks to the same cores.  See trilinos#2422.
bartlettroscoe added a commit that referenced this issue May 8, 2018
….1-and-ninja

Use atdm-cmake/3.11.1 module and Ninja for GCC 4.8.4 + OpenMPI 1.10.1 + OpenMP build.  This should be the build that satisfies the GCC auto PR build in #2317 and #2462.
@bartlettroscoe bartlettroscoe added the type: enhancement Issue is an enhancement, not a bug label May 22, 2018
@jwillenbring
Copy link
Member

@bartlettroscoe Since the PR builds have been running for some time I propose we close this ticket and deal with additional issues in other tickets.

@bartlettroscoe
Copy link
Member Author

@bartlettroscoe Since the PR builds have been running for some time I propose we close this ticket and deal with additional issues in other tickets.

@jwillenbring, sure, the only remaining build is the CUDA build in #2464 and we have a plan for that (see the issue).

Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
client: ATDM Any issue primarily impacting the ATDM project Framework tasks Framework tasks (used internally by Framework team) type: enhancement Issue is an enhancement, not a bug
Projects
None yet
Development

No branches or pull requests

9 participants