Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to distributed scheduling model #186

Merged
merged 2 commits into from
Jul 13, 2023
Merged

Conversation

psalz
Copy link
Member

@psalz psalz commented Jun 26, 2023

As discussed today this is an early draft of the new distributed scheduling model implementation. I apologize for the state of the codebase right now, but maybe you can still take a quick look if you can spot anything major. As I'm still in the process of migrating all existing command-graph-related tests to the new architecture (with the help of @fknorr - thanks!), the codebase currently has to support both the legacy graph_generator and the new distributed_graph_generator side by side, so there is quite a bit of code duplication and hackery.

The most interesting/relevant parts to review are probably distributed_graph_generator (obviously), as well as maybe buffer_transfer_manager and the new testing infrastructure. Figuring out all data dependencies is still quite the mess, although one could argue it's slightly better because we have one level of indirection less (no "per node" data maps anymore). I hear @fknorr has some ideas of how this could be improved at some point in the future.

@psalz psalz added this to the 0.4.0 milestone Jun 26, 2023
github-actions[bot]

This comment was marked as outdated.

Copy link
Contributor

@PeterTh PeterTh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have gone through everything (that isn't already covered by the r-tree PR), but not fully in-depth. I have purposefully not commented on most things already marked as TODO or NOCOMMIT in the code, since that is obviously already known anyway.

Overall, this looks good for the alpha release to me. distributed_graph_generator::generate_distributed_commands has lots of excellent comments, but it's still not easy to follow because it does so much. Post-0.4 it would be nice if we managed to somehow refactor or split that up.

include/distributed_graph_generator.h Outdated Show resolved Hide resolved
assert(pkg.get_command_type() == command_type::await_push);
const auto& data = std::get<await_push_data>(pkg.data);

GridRegion<3> expected_region;
for(size_t i = 0; i < data.num_subranges; ++i) {
expected_region = GridRegion<3>::merge(expected_region, subrange_to_grid_box(data.region[i]));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This warning seems a bit extreme for an HPC (rather than security critical domain) codebase.

src/distributed_graph_generator.cc Outdated Show resolved Hide resolved
test/accessor_tests.cc Outdated Show resolved Hide resolved
test/graph_gen_transfer_tests_2.cc Outdated Show resolved Hide resolved
@psalz psalz force-pushed the distributed-scheduling branch from 4655582 to 48f293c Compare June 27, 2023 23:18
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

src/graph_serializer.cc Show resolved Hide resolved
test/distributed_graph_generator_test_utils.h Outdated Show resolved Hide resolved
test/distributed_graph_generator_test_utils.h Show resolved Hide resolved
test/distributed_graph_generator_test_utils.h Show resolved Hide resolved
test/distributed_graph_generator_test_utils.h Show resolved Hide resolved
test/distributed_graph_generator_test_utils.h Show resolved Hide resolved
test/distributed_graph_generator_test_utils.h Outdated Show resolved Hide resolved
Copy link
Contributor

@fknorr fknorr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately I do not have time at the moment to review this. One thing I would like to see for the final release is distributed_graph_generator being renamed to command_graph_generator (or similar, +bikeshedding) because we have two (and in the future three) different graph representations!

include/distributed_graph_generator.h Show resolved Hide resolved
include/executor.h Show resolved Hide resolved
@psalz psalz force-pushed the distributed-scheduling branch from 48f293c to 4dde523 Compare June 28, 2023 20:07
Copy link
Contributor

@fknorr fknorr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without reviewing the tests:

include/distributed_graph_generator.h Outdated Show resolved Hide resolved
include/distributed_graph_generator.h Show resolved Hide resolved
src/executor.cc Outdated Show resolved Hide resolved
src/print_graph.cc Show resolved Hide resolved
src/print_graph.cc Outdated Show resolved Hide resolved
test/distributed_graph_generator_test_utils.h Outdated Show resolved Hide resolved
test/distributed_graph_generator_test_utils.h Show resolved Hide resolved
test/distributed_graph_generator_test_utils.h Outdated Show resolved Hide resolved
test/distributed_graph_generator_test_utils.h Show resolved Hide resolved
test/distributed_graph_generator_test_utils.h Outdated Show resolved Hide resolved
@github-actions
Copy link

Check-perf-impact results: (cc07e36dc65a2e9d9a3deec584afa9e9)

❓ No new benchmark data submitted. ❓
Please re-run the microbenchmarks and include the results if your commit could potentially affect performance.

@psalz psalz marked this pull request as ready for review July 13, 2023 11:48
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

test/distributed_graph_generator_test_utils.h Show resolved Hide resolved
@psalz psalz force-pushed the distributed-scheduling branch from 0d327c1 to c1f8111 Compare July 13, 2023 13:26
@github-actions
Copy link

Check-perf-impact results: (9e900a306dc9f17e4a27439205a7680c)

⚠️ Significant slowdown in some microbenchmark results: 5 individual benchmarks affected
🚀 Significant speedup in some microbenchmark results: 38 individual benchmarks affected

Overall relative execution time: 0.72x (mean of relative medians)

Base automatically changed from 0.4.0-alpha to master July 13, 2023 14:02
@psalz psalz force-pushed the distributed-scheduling branch from c1f8111 to 2fbc0ff Compare July 13, 2023 15:03
psalz and others added 2 commits July 13, 2023 17:56
This replaces Celerity's master/worker scheduling model with a fully
distributed approach: Each node still processes all chunks, but commands
are only generated for locally executed chunks, as well as to push data
to other nodes as needed. Instead of tracking where the newest data
resides within the distributed system at all times, each node now only
determines which parts of the data are currently up to date (based on
task-level write requirements) as well as which parts it "owns" (based
on writes by locally executed chunks). Read requirements on stale data
by local chunks generate await push commands, which now can be satisfied
by any number of incoming transfers (as nodes do not know who currently
owns the data). Conversely, reads on locally owned data by remote chunks
generate push commands.

Co-authored-by: Fabian Knorr <fabian.knorr@dps.uibk.ac.at>
@psalz psalz force-pushed the distributed-scheduling branch from 2fbc0ff to 68433b4 Compare July 13, 2023 15:57
@psalz psalz merged commit b0ebd00 into master Jul 13, 2023
@psalz psalz deleted the distributed-scheduling branch July 13, 2023 16:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants