Switch to distributed scheduling model #186

psalz · 2023-06-26T19:12:49Z

As discussed today this is an early draft of the new distributed scheduling model implementation. I apologize for the state of the codebase right now, but maybe you can still take a quick look if you can spot anything major. As I'm still in the process of migrating all existing command-graph-related tests to the new architecture (with the help of @fknorr - thanks!), the codebase currently has to support both the legacy graph_generator and the new distributed_graph_generator side by side, so there is quite a bit of code duplication and hackery.

The most interesting/relevant parts to review are probably distributed_graph_generator (obviously), as well as maybe buffer_transfer_manager and the new testing infrastructure. Figuring out all data dependencies is still quite the mess, although one could argue it's slightly better because we have one level of indirection less (no "per node" data maps anymore). I hear @fknorr has some ideas of how this could be improved at some point in the future.

PeterTh

I have gone through everything (that isn't already covered by the r-tree PR), but not fully in-depth. I have purposefully not commented on most things already marked as TODO or NOCOMMIT in the code, since that is obviously already known anyway.

Overall, this looks good for the alpha release to me. distributed_graph_generator::generate_distributed_commands has lots of excellent comments, but it's still not easy to follow because it does so much. Post-0.4 it would be nice if we managed to somehow refactor or split that up.

include/distributed_graph_generator.h

PeterTh · 2023-06-27T12:02:44Z

src/buffer_transfer_manager.cc

 		assert(pkg.get_command_type() == command_type::await_push);
 		const auto& data = std::get<await_push_data>(pkg.data);

+		GridRegion<3> expected_region;
+		for(size_t i = 0; i < data.num_subranges; ++i) {
+			expected_region = GridRegion<3>::merge(expected_region, subrange_to_grid_box(data.region[i]));


This warning seems a bit extreme for an HPC (rather than security critical domain) codebase.

src/distributed_graph_generator.cc

test/accessor_tests.cc

test/graph_gen_transfer_tests_2.cc

github-actions

clang-tidy made some suggestions

src/graph_serializer.cc

test/distributed_graph_generator_test_utils.h

fknorr

Unfortunately I do not have time at the moment to review this. One thing I would like to see for the final release is distributed_graph_generator being renamed to command_graph_generator (or similar, +bikeshedding) because we have two (and in the future three) different graph representations!

include/distributed_graph_generator.h

include/executor.h

fknorr

Without reviewing the tests:

include/distributed_graph_generator.h

src/executor.cc

src/print_graph.cc

test/distributed_graph_generator_test_utils.h

github-actions · 2023-07-13T11:48:12Z

Check-perf-impact results: (cc07e36dc65a2e9d9a3deec584afa9e9)

❓ No new benchmark data submitted. ❓
Please re-run the microbenchmarks and include the results if your commit could potentially affect performance.

github-actions

clang-tidy made some suggestions

test/distributed_graph_generator_test_utils.h

github-actions · 2023-07-13T13:26:52Z

Check-perf-impact results: (9e900a306dc9f17e4a27439205a7680c)

⚠️ Significant slowdown in some microbenchmark results: 5 individual benchmarks affected
🚀 Significant speedup in some microbenchmark results: 38 individual benchmarks affected

Overall relative execution time: 0.72x (mean of relative medians)

This replaces Celerity's master/worker scheduling model with a fully distributed approach: Each node still processes all chunks, but commands are only generated for locally executed chunks, as well as to push data to other nodes as needed. Instead of tracking where the newest data resides within the distributed system at all times, each node now only determines which parts of the data are currently up to date (based on task-level write requirements) as well as which parts it "owns" (based on writes by locally executed chunks). Read requirements on stale data by local chunks generate await push commands, which now can be satisfied by any number of incoming transfers (as nodes do not know who currently owns the data). Conversely, reads on locally owned data by remote chunks generate push commands. Co-authored-by: Fabian Knorr <fabian.knorr@dps.uibk.ac.at>

psalz added this to the 0.4.0 milestone Jun 26, 2023

This comment was marked as outdated.

Sign in to view

PeterTh reviewed Jun 27, 2023

View reviewed changes

psalz force-pushed the distributed-scheduling branch from 4655582 to 48f293c Compare June 27, 2023 23:18

github-actions bot reviewed Jun 28, 2023

View reviewed changes

fknorr reviewed Jun 28, 2023

View reviewed changes

include/distributed_graph_generator.h Show resolved Hide resolved

include/executor.h Show resolved Hide resolved

psalz force-pushed the distributed-scheduling branch from 48f293c to 4dde523 Compare June 28, 2023 20:07

fknorr reviewed Jul 1, 2023

View reviewed changes

fknorr mentioned this pull request Jul 11, 2023

Remove long-deprecated public APIs #187

Merged

psalz force-pushed the 0.4.0-alpha branch from d9bb361 to 077f4f6 Compare July 13, 2023 10:54

psalz force-pushed the distributed-scheduling branch from dc8de68 to b77a20d Compare July 13, 2023 11:47

psalz marked this pull request as ready for review July 13, 2023 11:48

github-actions bot reviewed Jul 13, 2023

View reviewed changes

test/distributed_graph_generator_test_utils.h Show resolved Hide resolved

psalz force-pushed the distributed-scheduling branch from 0d327c1 to c1f8111 Compare July 13, 2023 13:26

Base automatically changed from 0.4.0-alpha to master July 13, 2023 14:02

psalz force-pushed the distributed-scheduling branch from c1f8111 to 2fbc0ff Compare July 13, 2023 15:03

psalz and others added 2 commits July 13, 2023 17:56

Update microbenchark results for distributed scheduling

68433b4

psalz force-pushed the distributed-scheduling branch from 2fbc0ff to 68433b4 Compare July 13, 2023 15:57

psalz merged commit b0ebd00 into master Jul 13, 2023

psalz deleted the distributed-scheduling branch July 13, 2023 16:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch to distributed scheduling model #186

Switch to distributed scheduling model #186

psalz commented Jun 26, 2023 •

edited

Loading

This comment was marked as outdated.

PeterTh left a comment

PeterTh Jun 27, 2023

github-actions bot left a comment

fknorr left a comment

fknorr left a comment

github-actions bot commented Jul 13, 2023

github-actions bot left a comment

github-actions bot commented Jul 13, 2023

Switch to distributed scheduling model #186

Switch to distributed scheduling model #186

Conversation

psalz commented Jun 26, 2023 • edited Loading

This comment was marked as outdated.

PeterTh left a comment

Choose a reason for hiding this comment

PeterTh Jun 27, 2023

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

fknorr left a comment

Choose a reason for hiding this comment

fknorr left a comment

Choose a reason for hiding this comment

github-actions bot commented Jul 13, 2023

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot commented Jul 13, 2023

psalz commented Jun 26, 2023 •

edited

Loading