Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-implement box and region types #204

Merged
merged 17 commits into from
Sep 15, 2023
Merged

Re-implement box and region types #204

merged 17 commits into from
Sep 15, 2023

Conversation

fknorr
Copy link
Contributor

@fknorr fknorr commented Aug 15, 2023

What started as academic curiosity turned into a classic month-long summertime deep-dive into a topic that is not even close to the top of our priority list. Oh well.

Currently, Celerity uses AllScale's GridBox and GridRegion types to describe point sets that originate from boxes or tilings of boxes. Out of these, GridRegion is particularly interesting: Internally a vector<GridBox>, it ensures that the tiling does not contain overlaps, and also opportunistically merges adjacent boxes when computing unions.

On the path to this PR, I set out on the following mission:

  1. Concisely re-implement box and region types in a way that fits current Celerity, namely through an int dimensions parameter, unsigned coordinate types and support for 0-dimensional instances.
  2. Investigate whether there's a better, or even optimal way of computing minimal, non-overlapping tilings for regions.
  3. Optimize the implementation of union, intersection and difference operations without resorting to a full R-tree.

The user-facing API is mostly self-explanatory and similar to the old one, but the inner workings require some explanation. I have made an effort to document all algorithmic decisions as well as possible.

Finally, this PR is able to remove the entirety of AllScale from our dependencies (some 20k LOC) 🥳

Region Normalization

The key innovation in the new region implementation is normalization. While regions are still represented as vectors of boxes, the new region type normalizes the tiling such that any two regions that cover the same points have identical box sequences. Normalization of an arbitrary user-provided box set works as follows:

  1. Remove all empty boxes
  2. In all dimensions except the last ("fastest"), collect the set of box boundaries (minima and maxima) in that dimension as the set of dissection lines
  3. Dissect all boxes along the obtained lines
  4. Remove all pairwise fully-covered boxes
  5. Merge connected / overlapping boxes, first along the the last ("fastest") dimension, up until the first ("slowest") dimension
  6. Sort boxes according to their coordinates

The resulting representation is unique for the covered set of points and maximizes the extent along the last ("fastest") dimension while minimizing the number of total boxes. A small example in 2D:

region-normalization

All region set operations (union, intersection, difference) work on these normalized regions and take the appropriate step to ensure their result remains normalized as well.

Effective Dimensionality

Like we saw in the implementation of the new region map, treating every region as three-dimensional for the purpose of "dimensionality erasure" can be costly. This is why this PR introduces the concept of effective dimensionality:

If the last k coordinates of an n-dimensional range are 1, of an id are 0 or of a box are (0, 1), its effective dimensionality is n-k.

Practically this means that the result of a box_cast<3>(box<1>) has storage dimensionality 3 but retains effective dimensionality 1, and it can be box_cast back to a 1-dimensional box without data loss.

The efficiency of region algorithms profits from knowledge about the data's effective dimensionality. Instead of true dimensionality erasure through a pimpl or similar, the region algorithms will detect the effective dimensionality on the data itself and dispatch to the correct, optimal (templated) implementation.

To improve robustness, this PR turns {range,id,chunk,subrange,box,region}_cast into checked casts, meaning that a debug build will assert that no type can be cast to a storage dimensionality below its effective dimensionality. Multiple tests were using the cast functions to create lower-dimensionality ranges in generic tests, for these, I have introduced test_utils::truncate_{range,id,...} as a replacement.

Grid Microbenchmarks

For development, I have implemented a bunch of benchmarks on both old and new implementations (for the old impl, benchmark code is added in Re-implement grid data structures with normalized regions and removed again in Remove inclusion of old grid implementation).

Grid Microbenchmark Results

The new impl appears to be about 4x faster on average across a wide variety of scenarios, but there are also individual instances where it is >5x slower. Bear in mind that this is not an apples-to-apples comparison because the resulting tilings differ between old and new impl.

Future Work

  • Use the 0-dimensional region type in our new region_map implementation
  • Make region_map determine the impl dimensionality from its range::get_min_dimensions
  • Return regions instead of boxes in region_map::get_region_values
  • Investigate if region_map can be made to operate in the same normalized representation to guarantee determinism

@fknorr fknorr mentioned this pull request Aug 15, 2023
3 tasks
@fknorr fknorr self-assigned this Aug 15, 2023
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

include/ranges.h Show resolved Hide resolved
src/grid.cc Show resolved Hide resolved
@github-actions
Copy link

github-actions bot commented Aug 15, 2023

Benchmark results with std::vector regions

Check-perf-impact results: (ef93035622456aad56ecc5a640eb0c87)

⚠️ Significant slowdown in some microbenchmark results: 4 individual benchmarks affected
🚀 Significant speedup in some microbenchmark results: 31 individual benchmarks affected
Added microbenchmark(s): 55 individual benchmarks affected

Overall relative execution time: 0.93x (mean of relative medians)

@fknorr fknorr marked this pull request as ready for review August 15, 2023 11:41
@fknorr fknorr requested review from psalz and PeterTh August 15, 2023 11:42
src/grid.cc Outdated Show resolved Hide resolved
@fknorr fknorr force-pushed the grid-from-scratch branch from 7a1f0dd to 0d17c4c Compare August 23, 2023 09:40
fknorr added a commit to fknorr/celerity-runtime that referenced this pull request Aug 23, 2023
Copy link
Contributor

@PeterTh PeterTh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, with a few change requests/comments.

As previously discussed offline, I am a bit concerned about the increase in the number of benchmark cases, since the current infrastructure has no way concepts of groups, or the importance of individual tests. But this is something which can be fixed independently in a future PR.

include/buffer_transfer_manager.h Show resolved Hide resolved
include/grid.h Outdated Show resolved Hide resolved
include/grid.h Outdated Show resolved Hide resolved
include/ranges.h Outdated Show resolved Hide resolved
include/workaround.h Show resolved Hide resolved
src/distributed_graph_generator.cc Outdated Show resolved Hide resolved
src/grid.cc Outdated Show resolved Hide resolved
@celerity celerity deleted a comment from github-actions bot Aug 25, 2023
@celerity celerity deleted a comment from github-actions bot Aug 25, 2023
@celerity celerity deleted a comment from github-actions bot Aug 25, 2023
@github-actions
Copy link

github-actions bot commented Aug 25, 2023

Benchmark results with gch::small_vector regions

Check-perf-impact results: (734d1429560e5faf06afaacb00f6674a)

⚠️ Significant slowdown in some microbenchmark results: benchmark intrusive graph dependency handling with N nodes - 10 / checking for dependencies, benchmark intrusive graph dependency handling with N nodes - 100 / creating nodes, generating large command graphs for N nodes - 16 / contracting tree topology
🚀 Significant speedup in some microbenchmark results: 37 individual benchmarks affected
Added microbenchmark(s): 55 individual benchmarks affected

Overall relative execution time: 0.91x (mean of relative medians)

@fknorr
Copy link
Contributor Author

fknorr commented Aug 25, 2023

Regions appear to be marginally faster when using gch::small_vector as the coordinate storage, even with mimalloc active. This is probably due to many regions just being single boxes in practice.

@fknorr fknorr force-pushed the grid-from-scratch branch from 1559539 to c24bdf3 Compare August 28, 2023 08:01
@PeterTh
Copy link
Contributor

PeterTh commented Aug 28, 2023

Regions appear to be marginally faster when using gch::small_vector as the coordinate storage, even with mimalloc active. This is probably due to many regions just being single boxes in practice.

This makes sense to me, even with mimalloc no allocation is better than an allocation. In that case, I'd go with that storage type, it's not like it changes anything about the interface or usage.

@fknorr fknorr added this to the 0.5.0 milestone Sep 8, 2023
Copy link
Member

@psalz psalz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! That's an impressive level of optimization. I cannot say that I've groked all the algorithms yet, will have to take another look on Monday.

include/region_map.h Show resolved Hide resolved
src/buffer_transfer_manager.cc Outdated Show resolved Hide resolved
src/buffer_manager.cc Outdated Show resolved Hide resolved
test/grid_test_utils.cc Outdated Show resolved Hide resolved
include/print_utils.h Outdated Show resolved Hide resolved
include/ranges.h Outdated Show resolved Hide resolved
include/ranges.h Outdated Show resolved Hide resolved
test/grid_test_utils.h Outdated Show resolved Hide resolved
src/grid.cc Outdated Show resolved Hide resolved
Copy link
Member

@psalz psalz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more notes!

src/grid.cc Outdated Show resolved Hide resolved
src/grid.cc Outdated Show resolved Hide resolved
src/grid.cc Outdated Show resolved Hide resolved
src/grid.cc Outdated Show resolved Hide resolved
test/grid_benchmarks.cc Outdated Show resolved Hide resolved
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

test/integration/backend.cc Outdated Show resolved Hide resolved
@github-actions
Copy link

Check-perf-impact results: (5a19ced85f862a00d0114dd241122462)

⚠️ Significant slowdown in some microbenchmark results: building command graphs in a dedicated scheduler thread for N nodes - 1 > immediate submission to a scheduler thread / contracting tree topology, benchmark independent task pattern with N tasks - 100 / task generation
🚀 Significant speedup in some microbenchmark results: 28 individual benchmarks affected
Added microbenchmark(s): 55 individual benchmarks affected

Overall relative execution time: 0.93x (mean of relative medians)

@celerity celerity deleted a comment from github-actions bot Sep 14, 2023
@fknorr
Copy link
Contributor Author

fknorr commented Sep 14, 2023

The huge "slowdown" in the independent-task-generation benchmark looks spurious to me.

@PeterTh
Copy link
Contributor

PeterTh commented Sep 14, 2023

Probably. You'd think a test where a run takes ~10ms wouldn't have such big variance, but the previous run already had huge error bars. A bit surprising.

@fknorr fknorr requested review from PeterTh and psalz September 14, 2023 13:58
Copy link
Member

@psalz psalz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢

@fknorr fknorr merged commit 3d7c59c into master Sep 15, 2023
@fknorr fknorr deleted the grid-from-scratch branch September 15, 2023 08:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants