Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tpetra::defaultArgNode fails in serial build #3033

Closed
kddevin opened this issue Jun 28, 2018 · 13 comments
Closed

Tpetra::defaultArgNode fails in serial build #3033

kddevin opened this issue Jun 28, 2018 · 13 comments
Labels
pkg: Kokkos pkg: Teuchos Issues primarily dealing with the Teuchos Package pkg: Tpetra

Comments

@kddevin
Copy link
Contributor

kddevin commented Jun 28, 2018

@trilinos/tpetra @trilinos/teuchos @trilinos/kokkos

Expectations

Calls to defaultArgNode should run whether Teuchos::Comm is serial or MPI.

Current Behavior

A simple test program (attached) calling defaultArgNode works when built with MPI, but not when built without MPI.

Without MPI, I get
terminate called after throwing an instance of 'std::system_error'
what(): Unknown error 18446744073709551615

Here are the relevant bits of the stack trace:
0x0000000001862a1c in KokkosCompat::Details::initializeKokkos() ()
at /home/.../packages/teuchos/kokkoscompat/src/KokkosCompat_Details_KokkosInit.cpp:95
0x000000000185fbdb in Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace>::KokkosDeviceWrapperNode(Teuchos::ParameterList&) ()
at /home/.../packages/teuchos/kokkoscompat/src/KokkosCompat_ClassicNodeAPI_Wrapper.cpp:171
0x00000000016aaaf9 in Teuchos::RCP<Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> > KokkosClassic::Details::getNode<Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >(Teuchos::RCPTeuchos::ParameterList const&) ()
at /home/.../packages/tpetra/classic/NodeAPI/Kokkos_DefaultNode.cpp:56
0x00000000010cd808 in Teuchos::RCP<Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> > Tpetra::defaultArgNode<Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >() ()
at /home/.../packages/tpetra/core/src/Tpetra_Map_decl.hpp:91

Motivation and Context

Tpetra::MatrixMarket::Reader::readSparseFile calls this function when it attempts to create maps (e.g., makeRangeMap()).
Thus, the reader does not work for my serial build.

Definition of Done

Maybe my serial environment is wrong -- please advise. My script used to work, so if the environment is wrong, backward compatibility was lost somewhere along the line.
Otherwise, the test program should run with TPL_ENABLE_MPI=ON or OFF.

Possible Solution

Steps to Reproduce

See attached test program, which demonstrates the fault.
It reads a matrix-market file simple.mtx; you can use any matrix-market file with this name.
The test calls defaultArgNode directly, which a user wouldn't usually do. readSparseFile calls it internally when it creates Maps.
mmReader.cpp.txt

Your Environment

module purge
module load sems-env
module load sems-gcc/4.9.3
cmake
-D TPL_ENABLE_Pthread:BOOL=OFF
-D CMAKE_BUILD_TYPE:STRING="DEBUG"
-D CMAKE_VERBOSE_MAKEFILE:BOOL=OFF
-D TPL_ENABLE_MPI:BOOL=OFF

-D Trilinos_ENABLE_EXPLICIT_INSTANTIATION:BOOL=ON
-D Trilinos_ENABLE_ALL_OPTIONAL_PACKAGES:BOOL=OFF
-D Trilinos_ENABLE_TESTS:BOOL=OFF
-D Trilinos_ENABLE_EXAMPLES:BOOL=OFF
-D Trilinos_VERBOSE_CONFIGURE:BOOL=OFF

-D Trilinos_ENABLE_Zoltan2:BOOL=ON
-D Zoltan2_ENABLE_TESTS:BOOL=ON
..

Related Issues

  • Blocks
  • Is blocked by
  • Follows
  • Precedes
  • Related to
  • Part of
  • Composed of

Additional Information

@kddevin kddevin added pkg: Kokkos pkg: Tpetra pkg: Teuchos Issues primarily dealing with the Teuchos Package labels Jun 28, 2018
@mhoemmen
Copy link
Contributor

Nobody -- including Matrix Market I/O -- should be creating Node instances explicitly. I can fix Matrix Market I/O not to do this, and I think I can also get rid of the Node instances so they have no power to affect things.

@mhoemmen
Copy link
Contributor

Oh wait, nothing in the Matrix Market I/O routines calls defaultArgNode explicitly.

@kddevin The issue could be that you're creating Tpetra objects at main() scope. Try doing this:

#include <Teuchos_Comm.hpp>
#include <Teuchos_DefaultComm.hpp>
#include <Teuchos_RCP.hpp>

#include <Tpetra_CrsMatrix.hpp>
#include <MatrixMarket_Tpetra.hpp>

#include <string>
#include <sstream>

int main(int narg, char *arg[])
{
  Teuchos::GlobalMPISession session(&narg, &arg, NULL);
  { 
    Teuchos::RCP<const Teuchos::Comm<int> > comm = Teuchos::DefaultComm<int>::getComm();
    int rank = comm->getRank();

    typedef Tpetra::CrsMatrix<> tcrsMatrix_t;
    typedef typename tcrsMatrix_t::node_type node_t;
    typedef Tpetra::MatrixMarket::Reader<tcrsMatrix_t> reader_t;

    try {
      std::cout << rank << " Calling defaultArgNode" << std::endl;
      Teuchos::RCP<node_t> defNode = Tpetra::defaultArgNode<node_t>();
    }
    catch (std::exception &e) {
      std::cout << "FAIL  Exception caught: " << e.what() << std::endl;
      return -1;
    }

    std::string basename("simple");
    std::ostringstream fname;
    fname << basename << ".mtx";

    Teuchos::RCP<tcrsMatrix_t> mat;
    try{
      if (rank == 0) 
        std::cout << "Trying to read file " << fname.str() << std::endl;
      mat = reader_t::readSparseFile(fname.str(), comm, true, false, false);
    }
    catch (std::exception &e) {
      std::cout << "FAIL  Exception caught: " << e.what() << std::endl;
      return -1;
    }

    std::cout << "PASS Matrix #rows=" << mat->getGlobalNumRows() 
              << " " << mat->getNodeNumRows() << std::endl;
  }
  return 0;
}

@kddevin
Copy link
Contributor Author

kddevin commented Jun 29, 2018

Thanks, @mhoemmen . No, the problem isn't the scoping. With the suggestion above, I still get the error (and, I assume, I would have gotten it in the MPI case as well if scoping were the problem).

readSparseFile creates the node as an argument to the map constructor.

How does the Kokkos node differ between MPI builds and non-MPI builds?

If the node instances are not needed in readSparseFile, I can remove them. Were they there just to ensure Kokkos::initialize got called?

@kddevin
Copy link
Contributor Author

kddevin commented Jun 29, 2018

This serial test on the dashboard is passing. Let me see where my environment differs.
https://testing.sandia.gov/cdash/viewTest.php?buildid=3658811

@mhoemmen
Copy link
Contributor

@kddevin wrote:

How does the Kokkos node differ between MPI builds and non-MPI builds?

It doesn't at all. Tpetra's Node creation initializes Kokkos if it hasn't already been initialized. It tries to get command-line arguments from Teuchos::GlobalMPISession, and if there are any, it passes them down into Kokkos::initialize.

@kddevin
Copy link
Contributor Author

kddevin commented Jul 2, 2018

Oddly, the code works without error if I build with -DBUILD_SHARED_LIBS:BOOL=ON as on the nightly test dashboard, but fails without that build option.

Is BUILD_SHARED_LIBS=ON now required for serial builds? (If so, backward compatibility was broken somewhere along the line.) And if it is required, why is the default FALSE?

@kddevin
Copy link
Contributor Author

kddevin commented Jul 2, 2018

Here's the configuration that worked; without the BUILD_SHARED_LIBS line, the test throws an exception. Note that I am not using #3044 for these tests.
cmake
-DBUILD_SHARED_LIBS:BOOL=ON
-DTPL_ENABLE_Pthread:BOOL=OFF
-DTrilinos_ENABLE_SECONDARY_TESTED_CODE:BOOL=ON
-DTrilinos_ENABLE_ALL_OPTIONAL_PACKAGES:BOOL=OFF
-DTrilinos_ENABLE_TESTS:BOOL=OFF
-DTrilinos_ENABLE_EXPLICIT_INSTANTIATION:BOOL=ON
-DTrilinos_ENABLE_Zoltan2:BOOL=ON
-DZoltan2_ENABLE_TESTS:BOOL=ON
..

mhoemmen added a commit that referenced this issue Jul 2, 2018
@mhoemmen
Copy link
Contributor

mhoemmen commented Jul 2, 2018

@kddevin wrote:

Oddly, the code works without error if I build with -DBUILD_SHARED_LIBS:BOOL=ON as on the nightly test dashboard, but fails without that build option.

That's really quite weird. The current defaultArgNode implementation relies on a static RCP<Node> in a function that lives in a different (upstream) package, but that should work independently of whether using dynamic shared libraries. I'm hoping that #3044 kills this issue, because it removes the static RCP<Node> in favor of letting Kokkos manage its own "Am I initialized?" state.

@kddevin
Copy link
Contributor Author

kddevin commented Jul 3, 2018

After further investigation, I think the problem is that I have TPL_ENABLE_Pthread=OFF.
The error appears to come from std::call_once in the initializeKokkos functionality. Indeed, a small test program exercising call_once works only with the -pthread option to g++.

I tested #3044 with and without shared libraries; both cases threw an unknown error.
With TPL_ENABLE_Pthread=OFF, the test compiles but throws an error.
Without TPL_ENABLE_Pthread=OFF, the test succeeds, with or without shared libraries.

Is Pthread now required to build Trilinos? If so, backward compatibility was broken somewhere along the line. Also, we shouldn't give users the option to disable it if it is now required (especially since the code compiles without it). If not, perhaps there is still something wrong with my configuration.

@mhoemmen
Copy link
Contributor

mhoemmen commented Jul 3, 2018

@kddevin There should be no need to set TPL_ENABLE_Pthread explicitly. Kokkos' Pthreads-based back-end is disabled by default.

std::call_once is in C++11, but it could be that GCC nevertheless requires libpthread to make it work. It may be overkill to do std::call_once, and I could protect it with a macro.

@kddevin
Copy link
Contributor Author

kddevin commented Jul 3, 2018

Thanks, @mhoemmen . Are there cases where std::call_once is needed to handle threading? If not, could we instead check whether Kokkos is initialized and, if not, initialize it?

@mhoemmen
Copy link
Contributor

mhoemmen commented Jul 3, 2018

@kddevin wrote:

Are there cases where std::call_once is needed to handle threading?

The only use case that would require std::call_once, is if multiple user threads were each to create an independent Tpetra::Map, without first having called Kokkos::initialize. I've never seen anyone exercise that use case. I actually don't think the code is correct for that use case anyway, since it would register Kokkos::finalize multiple times as atexit handlers.

@kddevin
Copy link
Contributor Author

kddevin commented Jul 3, 2018

Closing; #3057 contains the true issue. The Tpetra behavior was only a result.

@kddevin kddevin closed this as completed Jul 3, 2018
mhoemmen pushed a commit that referenced this issue Jul 3, 2018
@trilinos/tpetra @trilinos/teuchos

Tpetra::Map was and still is responsible for initializing Kokkos, if
the user hasn't done it already.  This commit moves the initialization
code out of Teuchos into Tpetra.  It also removes the dependency on
std::call_once.  It appears that with GCC, std::call_once only works
if linking with libpthread.  Thus, setting TPL_ENABLE_Pthread=OFF
(which we don't recommend -- Trilinos autodetects this) could break
std::call_once.

This change could break a possible use case in which
Kokkos::initialize has not been called, and different user threads
each create different Tpetra::Map instances.  However, Trilinos does
not test this use case, nor do the applications we support appear to
exercise it.

I also took the liberty to purge some unnecessary header includes.
This was referenced Jul 3, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pkg: Kokkos pkg: Teuchos Issues primarily dealing with the Teuchos Package pkg: Tpetra
Projects
None yet
Development

No branches or pull requests

2 participants