Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[IDAG] Communication & Receive Arbitration #252

Merged
merged 4 commits into from
May 21, 2024
Merged

[IDAG] Communication & Receive Arbitration #252

merged 4 commits into from
May 21, 2024

Conversation

fknorr
Copy link
Contributor

@fknorr fknorr commented Mar 12, 2024

This is the second PR in the Instruction Graph series, and the first one to touch the execution side of the new runtime. It adds the necessary infrastructure to perform in-place sends and receives from (host) buffers and to match pilots to their corresponding payloads. See #249 for context.

This does not really interfere with #249 nor is it based on that PR, the only overlap being pilot.h.

communicator

class communicator is an abstraction of MPI_Comm that provides all operations to transmit and receive pilots and perform asynchronous sends / receives on strided buffer-allocation data with known message ids. It does not spawn its own thread but relies on polling from the executor thread (which currently is the main thread in tests).

Edit: clang-tidy gives me a heap of MPI check warnings about re-using requests and memory leaks, but I'm convinced all of these are false positives (see the second github-actions review below).

receive_arbiter

Instead of linearizing and packing transfer data into frames, the IDAG runtime transfers metadata ahead-of-time through pilot_messages and performs in-place MPI_Isend/Irecv operations (via the communicator abstraction). Since receive_instructions and pilots, and MPI_Recv'd data can appear in any order, a state machine must be maintained for every transfer at execution time to drive every receive operation to completion.

I call this method receive arbitration. It requires some added complexity to deal with the fact that incoming send-boxes do not need to match received regions in the IDAG exactly. In this regard, it is similar to the legacy buffer_transfer_manager. Like communicators, the receive arbiter is also polled from the executor thread and does not add its own concurrency.

The receive-arbiter tests mock the communicator to generate all possible sequences of events observable at execution time for each test case, to account for the non-deterministic behavior in the cluster.

async_event

Type-erased events are needed for many different situations in IDAG execution. Since this PR is the first to require them, it adds async_event as a generic construct.

@fknorr fknorr added this to the 0.6.0 milestone Mar 12, 2024
@fknorr fknorr requested review from psalz and PeterTh March 12, 2024 16:23
@fknorr fknorr self-assigned this Mar 12, 2024
Copy link

Check-perf-impact results: (19e9c0e63e2eff8d602cc7a81b622c19)

❓ No new benchmark data submitted. ❓
Please re-run the microbenchmarks and include the results if your commit could potentially affect performance.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

src/mpi_communicator.cc Show resolved Hide resolved
src/mpi_communicator.cc Show resolved Hide resolved
src/mpi_communicator.cc Show resolved Hide resolved
src/mpi_communicator.cc Show resolved Hide resolved
src/mpi_communicator.cc Show resolved Hide resolved
src/receive_arbiter.cc Outdated Show resolved Hide resolved
test/receive_arbiter_tests.cc Show resolved Hide resolved
test/receive_arbiter_tests.cc Show resolved Hide resolved
test/system/mpi_tests.cc Show resolved Hide resolved
test/system/mpi_tests.cc Show resolved Hide resolved
@coveralls
Copy link

coveralls commented Mar 12, 2024

Pull Request Test Coverage Report for Build 8340717687

Details

  • 299 of 301 (99.34%) changed or added relevant lines in 6 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage increased (+0.09%) to 94.724%

Changes Missing Coverage Covered Lines Changed/Added Lines %
src/mpi_communicator.cc 138 139 99.28%
src/receive_arbiter.cc 129 130 99.23%
Files with Coverage Reduction New Missed Lines %
include/grid.h 1 96.81%
Totals Coverage Status
Change from base Build 8329571092: 0.09%
Covered Lines: 6872
Relevant Lines: 7071

💛 - Coveralls

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

src/mpi_communicator.cc Show resolved Hide resolved
src/mpi_communicator.cc Show resolved Hide resolved
src/mpi_communicator.cc Show resolved Hide resolved
src/mpi_communicator.cc Show resolved Hide resolved
src/mpi_communicator.cc Show resolved Hide resolved
src/mpi_communicator.cc Show resolved Hide resolved
src/mpi_communicator.cc Show resolved Hide resolved
src/mpi_communicator.cc Show resolved Hide resolved
Copy link

Check-perf-impact results: (7b849de16ff11660b98988ab0b032db7)

❓ No new benchmark data submitted. ❓
Please re-run the microbenchmarks and include the results if your commit could potentially affect performance.

...if corresponding fragments have already been received.
Copy link
Member

@psalz psalz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I particularly like the combinatorial testing setup for the receive_arbiter, very elegant!

include/mpi_communicator.h Show resolved Hide resolved
test/system/mpi_tests.cc Show resolved Hide resolved
src/mpi_communicator.cc Show resolved Hide resolved
src/mpi_communicator.cc Show resolved Hide resolved
test/system/mpi_tests.cc Outdated Show resolved Hide resolved
test/system/mpi_tests.cc Outdated Show resolved Hide resolved
include/receive_arbiter.h Outdated Show resolved Hide resolved
src/receive_arbiter.cc Outdated Show resolved Hide resolved
src/receive_arbiter.cc Outdated Show resolved Hide resolved
src/receive_arbiter.cc Show resolved Hide resolved
Copy link
Contributor

@PeterTh PeterTh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

src/mpi_communicator.cc Outdated Show resolved Hide resolved
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

src/mpi_communicator.cc Show resolved Hide resolved
@fknorr
Copy link
Contributor Author

fknorr commented May 21, 2024

CI is unreliable at the moment (either a Github or slurmactiond issue or both). Everything passes locally for me and formatting is also correct, so I'm doing a blind merge here.

@fknorr fknorr merged commit d91cfea into master May 21, 2024
39 of 46 checks passed
@fknorr fknorr deleted the idag-comm branch May 21, 2024 16:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants