Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added timeout back to MpiHandshake #2359

Merged
merged 3 commits into from
Jul 6, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 2 additions & 11 deletions docs/user_guide/source/engines/ssc.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,23 +6,14 @@ The SSC engine is designed specifically for strong code coupling. Currently SSC

The SSC engine takes the following parameters:

1. ``RendezvousAppCount``: Default **2**. The number of applications, including both writers and readers, that will work on this stream. The SSC engine's open function will block until all these applications reach the open call. If there are multiple applications in a workflow, this parameter needs to be set respectively for every application. For example, in a three-app coupling scenario: App 0 writes Stream A to App 1; App 1 writes Stream B to App 0; App 2 writes Stream C to App 1; App 1 writes Stream D to App 2, the parameter RendezvousAppCount for engine instances of every stream should be all set to 2, because for each of the streams, two applications will work on it. In another example, where App 0 writes Stream A to App 1 and App 2; App 1 writes Stream B to App 2, the parameter RendezvousAppCount for engine instances of Stream A and B should be set to 3 and 2 respectively, because three applications will work on Stream A, while two applications will work on Stream B.
1. ``OpenTimeoutSecs``: Default **10**. Timeout in seconds for opening a stream. The SSC engine's open function will block until the RendezvousAppCount is reached, or timeout, whichever comes first. If it reaches the timeout, SSC will throw an exception.

2. ``MaxStreamsPerApp``: Default **1**. The maximum number of streams that all applications sharing this MPI_COMM_WORLD can possibly open. It is required that this number is consistent across all ranks from all applications. This is used for pre-allocating the vectors holding MPI handshake informations and due to the fundamental communication mechanism of MPI, this information must be set statically through engine parameters, and the SSC engine cannot provide any mechanism to check if this parameter is set correctly. If this parameter is wrongly set, the SSC engine's open function will either exit early than expected without gathering all applications' handshake information, or it will block until timeout. It may cause other unpredictable errors too.

3. ``OpenTimeoutSecs``: Default **10**. Timeout in seconds for opening a stream. The SSC engine's open function will block until the RendezvousAppCount is reached, or timeout, whichever comes first. If it reaches the timeout, SSC will throw an exception.

4. ``MaxFilenameLength``: Default **128**. The maximum length of filenames across all ranks from all applications. It is used for allocating the handshake buffer. Due to the limitation of MPI communication, this number must be set statically. The default number should work for most use cases. SSC will throw an exception if any rank opens a stream with a filename longer than this number.

5. ``MpiMode``: Default **TwoSided**. MPI communication modes to use. Besides the default TwoSided mode using two sided MPI communications, MPI_Isend and MPI_Irecv, for data transport, there are four one sided MPI modes: OneSidedFencePush, OneSidedPostPush, OneSidedFencePull, and OneSidedPostPull. Modes with **Push** are based on the push model and use MPI_Put for data transport, while modes with **Pull** are based on the pull model and use MPI_Get. Modes with **Fence** use MPI_Win_fence for synchronization, while modes with **Post** use MPI_Win_start, MPI_Win_complete, MPI_Win_post and MPI_Win_wait.
2. ``MpiMode``: Default **TwoSided**. MPI communication modes to use. Besides the default TwoSided mode using two sided MPI communications, MPI_Isend and MPI_Irecv, for data transport, there are four one sided MPI modes: OneSidedFencePush, OneSidedPostPush, OneSidedFencePull, and OneSidedPostPull. Modes with **Push** are based on the push model and use MPI_Put for data transport, while modes with **Pull** are based on the pull model and use MPI_Get. Modes with **Fence** use MPI_Win_fence for synchronization, while modes with **Post** use MPI_Win_start, MPI_Win_complete, MPI_Win_post and MPI_Win_wait.

=============================== ================== ================================================
**Key** **Value Format** **Default** and Examples
=============================== ================== ================================================
RendezvousAppCount integer **2**, 3, 5, 10
MaxStreamsPerApp integer **1**, 2, 4, 8
OpenTimeoutSecs integer **10**, 2, 20, 200
MaxFilenameLength integer **128**, 32, 64, 512
MpiMode string **TwoSided**, OneSidedFencePush, OneSidedPostPush, OneSidedFencePull, OneSidedPostPull
=============================== ================== ================================================

Expand Down
6 changes: 0 additions & 6 deletions source/adios2/engine/ssc/SscReader.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -30,12 +30,6 @@ SscReader::SscReader(IO &io, const std::string &name, const Mode mode,

helper::GetParameter(m_IO.m_Parameters, "MpiMode", m_MpiMode);
helper::GetParameter(m_IO.m_Parameters, "Verbose", m_Verbosity);
helper::GetParameter(m_IO.m_Parameters, "MaxFilenameLength",
m_MaxFilenameLength);
helper::GetParameter(m_IO.m_Parameters, "RendezvousAppCount",
m_RendezvousAppCount);
helper::GetParameter(m_IO.m_Parameters, "MaxStreamsPerApp",
m_MaxStreamsPerApp);
helper::GetParameter(m_IO.m_Parameters, "OpenTimeoutSecs",
m_OpenTimeoutSecs);

Expand Down
3 changes: 0 additions & 3 deletions source/adios2/engine/ssc/SscReader.h
Original file line number Diff line number Diff line change
Expand Up @@ -90,9 +90,6 @@ class SscReader : public Engine
ssc::RankPosMap &allOverlapRanks);

int m_Verbosity = 0;
int m_MaxFilenameLength = 128;
int m_MaxStreamsPerApp = 1;
int m_RendezvousAppCount = 2;
int m_OpenTimeoutSecs = 10;
};

Expand Down
6 changes: 0 additions & 6 deletions source/adios2/engine/ssc/SscWriter.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -28,12 +28,6 @@ SscWriter::SscWriter(IO &io, const std::string &name, const Mode mode,

helper::GetParameter(m_IO.m_Parameters, "MpiMode", m_MpiMode);
helper::GetParameter(m_IO.m_Parameters, "Verbose", m_Verbosity);
helper::GetParameter(m_IO.m_Parameters, "MaxFilenameLength",
m_MaxFilenameLength);
helper::GetParameter(m_IO.m_Parameters, "RendezvousAppCount",
m_RendezvousAppCount);
helper::GetParameter(m_IO.m_Parameters, "MaxStreamsPerApp",
m_MaxStreamsPerApp);
helper::GetParameter(m_IO.m_Parameters, "OpenTimeoutSecs",
m_OpenTimeoutSecs);

Expand Down
3 changes: 0 additions & 3 deletions source/adios2/engine/ssc/SscWriter.h
Original file line number Diff line number Diff line change
Expand Up @@ -88,9 +88,6 @@ class SscWriter : public Engine
ssc::RankPosMap &allOverlapRanks);

int m_Verbosity = 0;
int m_MaxFilenameLength = 128;
int m_MaxStreamsPerApp = 1;
int m_RendezvousAppCount = 2;
int m_OpenTimeoutSecs = 10;
};

Expand Down
11 changes: 11 additions & 0 deletions source/adios2/helper/adiosMpiHandshake.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -68,11 +68,22 @@ const std::vector<std::vector<int>> Handshake(const std::string &filename,
fsc << "completed";
fsc.close();

auto startTime = std::chrono::system_clock::now();
while (true)
{
std::ifstream fs;
try
{
auto nowTime = std::chrono::system_clock::now();
auto duration =
std::chrono::duration_cast<std::chrono::seconds>(
nowTime - startTime);
if (duration.count() > timeoutSeconds)
{
throw(std::runtime_error(
"Mpi handshake timeout for Stream " + filename));
}

fs.open(filename + ".w.c");
std::string line;
std::getline(fs, line);
Expand Down