Detect uninitialized reads and overlapping writes #224

fknorr · 2023-11-02T13:01:52Z

During IDAG development, asserting that instructions do not read from uninitialized memory turned out to be an invaluable tool for finding bugs in dependency tracking and data migration. This PR is a back-port of that detection logic, extended to also find overlapping writes between chunks on CDAG generation.

By default, task_manager and distributed_graph_generator throw when detecting an uninitialized read or overlapping writes in order to catch internal errors and report tests with bogus access patterns. In the runtime, results from these checks are reported through warning (uninitialized read) and error (overlapping writes) logs instead - both for backwards compatibillity, and because uninitialized reads are expected when submitting kernels with read_write access in a loop (see also celerity/meta#74).

All tests with incorrect access patterns were fixed, and the reporting feature is tested both with the TDAG / CDAG test context infrastructure as well as the actual runtime.

PeterTh

This is great change which should help prevent a common source of issues/UB. Some thoughts in the comments.

src/runtime.cc

test/print_graph_tests.cc

github-actions · 2023-11-02T17:34:28Z

no optimizations

Check-perf-impact results: (1d1f40270fc25a329bb16053c8fa15f2)

⚠️ Significant slowdown (>1.25x) in some microbenchmark results: building command graphs in a dedicated scheduler thread for N nodes - 1 > immediate submission to a scheduler thread / expanding tree topology, generating large command graphs for N nodes - 16 / wave_sim topology, benchmark independent task pattern with N tasks - 100 / task generation
🚀 Significant speedup (<0.80x) in some microbenchmark results: normalizing a fully mergeable tiling of boxes - 1 / large, embedded in 3d

Relative execution time per category: (mean of relative medians)

command-graph : 1.17x ⚠️
graph-nodes : 0.99x
grid : 1.00x
scheduler : 1.12x ⚠️
system : 1.10x
task-graph : 1.01x

github-actions · 2023-11-03T07:23:34Z

distributed overlapping-write detection

Check-perf-impact results: (7e3c8bfa480dfad2a4f08c4fabe32cc2)

⚠️ Significant slowdown (>1.25x) in some microbenchmark results: normalizing randomized box sets - 3d / small - native, building command graphs in a dedicated scheduler thread for N nodes - 1 > immediate submission to a scheduler thread / expanding tree topology, benchmark independent task pattern with N tasks - 100 / task generation
🚀 Significant speedup (<0.80x) in some microbenchmark results: benchmark stencil pattern with N time steps - 50 / iterations

Relative execution time per category: (mean of relative medians)

command-graph : 1.09x
graph-nodes : 0.99x
grid : 1.02x
scheduler : 1.09x
system : 1.08x
task-graph : 1.02x

github-actions · 2023-11-03T11:40:58Z

distributed overlapping-write detection + bounding-box uninitialized-read detection

Check-perf-impact results: (385bf6fbfe375253106eaa9f9b53a88e)

⚠️ Significant slowdown (>1.25x) in some microbenchmark results: benchmark independent task pattern with N tasks - 100 / task generation
🚀 Significant speedup (<0.80x) in some microbenchmark results: benchmark stencil pattern with N time steps - 1000 / iterations

Relative execution time per category: (mean of relative medians)

command-graph : 1.10x
graph-nodes : 0.98x
grid : 1.00x
scheduler : 1.06x
system : 1.05x
task-graph : 0.98x

fknorr · 2023-11-03T11:49:02Z

I managed to optimize this a bit further, but CDAG generation still takes a 10% performance hit.

test/graph_gen_transfer_tests.cc

psalz

Good change, thank you! A few minor notes:

include/utils.h

src/distributed_graph_generator.cc

test/graph_generation_tests.cc

test/runtime_tests.cc

test/task_graph_tests.cc

test/graph_gen_transfer_tests.cc

github-actions

clang-tidy made some suggestions

test/dag_benchmarks.cc

fknorr

I have addressed most concerns, but realized some potential inconsistencies in other features.

CMakeLists.txt

test/dag_benchmarks.cc

github-actions · 2023-11-14T21:09:03Z

disabling diagnostics by default in release builds

Check-perf-impact results: (9c7d95e3b2850796c2d810616c0f1aff)

⚠️ Significant slowdown (>1.25x) in some microbenchmark results: building command graphs in a dedicated scheduler thread for N nodes - 1 > immediate submission to a scheduler thread / expanding tree topology
🚀 Significant speedup (<0.80x) in some microbenchmark results: normalizing a fully mergeable tiling of boxes - 1 / large, embedded in 3d

Relative execution time per category: (mean of relative medians)

command-graph : 1.02x
graph-nodes : 1.01x
grid : 1.00x
scheduler : 1.02x
system : 1.00x
task-graph : 1.00x

github-actions · 2023-11-18T11:45:54Z

do not maintain initialized-region when diagnostics are disabled

Check-perf-impact results: (a1f01e5fe178a68d42a7063603612487)

⚠️ Significant slowdown (>1.25x) in some microbenchmark results: building command graphs in a dedicated scheduler thread for N nodes - 1 > immediate submission to a scheduler thread / expanding tree topology
🚀 Significant speedup (<0.80x) in some microbenchmark results: building command graphs in a dedicated scheduler thread for N nodes - 1 > immediate submission to a scheduler thread / chain topology

Relative execution time per category: (mean of relative medians)

command-graph : 1.01x
graph-nodes : 0.99x
grid : 1.01x
scheduler : 1.02x
system : 0.99x
task-graph : 1.00x

fknorr · 2023-11-18T12:36:47Z

I have turned the task-manager / graph-generator policy settings into constructor arguments; that way I can skip updating the initialized-region of a buffer when diagnostics are disabled (previously this region had to be maintained in anticipation of a change in polices, which of course never happened). I've also added a bit more documentation.

task_manager and distributed_graph_generator by default throw when detecting an uninitialized read or overlapping writes in order to catch invalid test code or internal bugs. In `runtime`, these conditions are reported through warning / error logs instead.

PeterTh

Looks good to me, just some smaller remarks.

include/task.h

include/utils.h

src/distributed_graph_generator.cc

test/print_graph_tests.cc

github-actions

clang-tidy made some suggestions

src/distributed_graph_generator.cc

github-actions · 2023-11-29T13:23:10Z

Check-perf-impact results: (3b34e58e3c100f4c3541a1ed59580f72)

❓ No new benchmark data submitted. ❓
Please re-run the microbenchmarks and include the results if your commit could potentially affect performance.

github-actions

clang-tidy made some suggestions

test/utils_tests.cc

github-actions · 2023-11-29T15:50:03Z

Check-perf-impact results: (54ff34045c40bf12cbc2ca95f475792b)

⚠️ Significant slowdown (>1.25x) in some microbenchmark results: building command graphs in a dedicated scheduler thread for N nodes - 1 > immediate submission to a scheduler thread / expanding tree topology
🚀 Significant speedup (<0.80x) in some microbenchmark results: building command graphs in a dedicated scheduler thread for N nodes - 1 > immediate submission to a scheduler thread / chain topology, building command graphs in a dedicated scheduler thread for N nodes - 1 > immediate submission to a scheduler thread / contracting tree topology

Relative execution time per category: (mean of relative medians)

command-graph : 1.00x
graph-nodes : 0.98x
grid : 1.01x
scheduler : 1.01x
system : 0.98x
task-graph : 1.01x

fknorr · 2023-11-29T15:52:30Z

I have used this opportunity to refactor how task debug names are generated from class KernelName types and printed in debug outputs. Unnamed kernels now do not show "unnamed_kernel" as their name anymore and the task type is visualized as well.

There is now a new step_builder::name() function to conveniently set task debug names from a distributed_graph_generator_test_context.

psalz

Muy bien!

* Stringify the default kernel name to "" instead of "unnamed_kernel" * Kernel names from `<class KernelName>` are simplified before being inserted into `struct task` * Move Type name demangling to utils.cc * Add print_task_debug_label to show task type and optional label in debug messages

task_manager and distributed_graph_generator by default throw when detecting an uninitialized read or overlapping writes in order to catch invalid test code or internal bugs. In `runtime`, these conditions are reported through warning / error logs instead.

fknorr requested review from psalz and PeterTh November 2, 2023 13:01

fknorr self-assigned this Nov 2, 2023

fknorr added this to the 0.5.0 milestone Nov 2, 2023

PeterTh reviewed Nov 2, 2023

View reviewed changes

src/runtime.cc Outdated Show resolved Hide resolved

test/print_graph_tests.cc Outdated Show resolved Hide resolved

test/print_graph_tests.cc Outdated Show resolved Hide resolved

fknorr force-pushed the access-pattern-diagnostics branch 2 times, most recently from 1bc771c to 7cd1d97 Compare November 2, 2023 16:25

fknorr added a commit that referenced this pull request Nov 2, 2023

Update benchmark results for #224

8e9b8c1

fknorr added a commit that referenced this pull request Nov 3, 2023

[RM] Update benchmark results for #224

8871636

fknorr added a commit that referenced this pull request Nov 3, 2023

[RM] Update benchmark results for #224

6dbafbb

fknorr commented Nov 3, 2023

View reviewed changes

test/graph_gen_transfer_tests.cc Show resolved Hide resolved

psalz requested changes Nov 7, 2023

View reviewed changes

fknorr force-pushed the access-pattern-diagnostics branch from f43978e to 04ccd4d Compare November 8, 2023 11:19

github-actions bot reviewed Nov 8, 2023

View reviewed changes

test/dag_benchmarks.cc Outdated Show resolved Hide resolved

test/dag_benchmarks.cc Outdated Show resolved Hide resolved

fknorr commented Nov 8, 2023

View reviewed changes

CMakeLists.txt Show resolved Hide resolved

CMakeLists.txt Outdated Show resolved Hide resolved

test/dag_benchmarks.cc Outdated Show resolved Hide resolved

fknorr added a commit that referenced this pull request Nov 9, 2023

Update benchmark results for #224

2df606b

celerity deleted a comment from github-actions bot Nov 9, 2023

fknorr force-pushed the access-pattern-diagnostics branch from 2df606b to a6a49f6 Compare November 13, 2023 16:04

fknorr added a commit that referenced this pull request Nov 14, 2023

Update benchmark results for #224

5318e68

celerity deleted a comment from github-actions bot Nov 14, 2023

fknorr added a commit that referenced this pull request Nov 16, 2023

[RM] Update benchmark results for #224

7c31444

fknorr force-pushed the access-pattern-diagnostics branch from 5318e68 to 8d13007 Compare November 16, 2023 21:51

fknorr added a commit that referenced this pull request Nov 18, 2023

Update benchmark results for #224

e590163

fknorr requested review from PeterTh and psalz November 18, 2023 12:36

PeterTh approved these changes Nov 20, 2023

View reviewed changes

include/task.h Show resolved Hide resolved

include/utils.h Outdated Show resolved Hide resolved

src/distributed_graph_generator.cc Show resolved Hide resolved

test/print_graph_tests.cc Show resolved Hide resolved

github-actions bot reviewed Nov 28, 2023

View reviewed changes

src/distributed_graph_generator.cc Outdated Show resolved Hide resolved

fknorr force-pushed the access-pattern-diagnostics branch from d9c3a3b to 1df889a Compare November 29, 2023 13:22

github-actions bot reviewed Nov 29, 2023

View reviewed changes

test/utils_tests.cc Show resolved Hide resolved

fknorr added a commit that referenced this pull request Nov 29, 2023

Update benchmark results for #224

a785278

fknorr force-pushed the access-pattern-diagnostics branch from 1df889a to a785278 Compare November 29, 2023 15:49

fknorr added a commit that referenced this pull request Nov 29, 2023

Update benchmark results for #224

0ad5db9

fknorr force-pushed the access-pattern-diagnostics branch 3 times, most recently from 3f4a28d to b7b6812 Compare December 4, 2023 13:42

psalz approved these changes Dec 4, 2023

View reviewed changes

fknorr added 4 commits December 4, 2023 16:53

Update benchmark results for #224

19b2a41

Add utils::panic() which throws in tests and aborts in production

4b73a23

fknorr force-pushed the access-pattern-diagnostics branch from b7b6812 to 4b73a23 Compare December 4, 2023 15:53

fknorr merged commit f12ddf7 into master Dec 5, 2023
28 checks passed

fknorr added a commit that referenced this pull request Dec 5, 2023

Update benchmark results for #224

50c3686

fknorr deleted the access-pattern-diagnostics branch January 25, 2024 09:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detect uninitialized reads and overlapping writes #224

Detect uninitialized reads and overlapping writes #224

fknorr commented Nov 2, 2023

PeterTh left a comment

github-actions bot commented Nov 2, 2023 •

edited by fknorr

Loading

github-actions bot commented Nov 3, 2023 •

edited by fknorr

Loading

github-actions bot commented Nov 3, 2023 •

edited by fknorr

Loading

fknorr commented Nov 3, 2023

psalz left a comment

github-actions bot left a comment

fknorr left a comment

github-actions bot commented Nov 14, 2023 •

edited by fknorr

Loading

github-actions bot commented Nov 18, 2023 •

edited by fknorr

Loading

fknorr commented Nov 18, 2023 •

edited

Loading

PeterTh left a comment

github-actions bot left a comment

github-actions bot commented Nov 29, 2023

github-actions bot left a comment

github-actions bot commented Nov 29, 2023

fknorr commented Nov 29, 2023 •

edited

Loading

psalz left a comment

Detect uninitialized reads and overlapping writes #224

Detect uninitialized reads and overlapping writes #224

Conversation

fknorr commented Nov 2, 2023

PeterTh left a comment

Choose a reason for hiding this comment

github-actions bot commented Nov 2, 2023 • edited by fknorr Loading

no optimizations

github-actions bot commented Nov 3, 2023 • edited by fknorr Loading

distributed overlapping-write detection

github-actions bot commented Nov 3, 2023 • edited by fknorr Loading

distributed overlapping-write detection + bounding-box uninitialized-read detection

fknorr commented Nov 3, 2023

psalz left a comment

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

fknorr left a comment

Choose a reason for hiding this comment

github-actions bot commented Nov 14, 2023 • edited by fknorr Loading

disabling diagnostics by default in release builds

github-actions bot commented Nov 18, 2023 • edited by fknorr Loading

do not maintain initialized-region when diagnostics are disabled

fknorr commented Nov 18, 2023 • edited Loading

PeterTh left a comment

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot commented Nov 29, 2023

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot commented Nov 29, 2023

fknorr commented Nov 29, 2023 • edited Loading

psalz left a comment

Choose a reason for hiding this comment

github-actions bot commented Nov 2, 2023 •

edited by fknorr

Loading

github-actions bot commented Nov 3, 2023 •

edited by fknorr

Loading

github-actions bot commented Nov 3, 2023 •

edited by fknorr

Loading

github-actions bot commented Nov 14, 2023 •

edited by fknorr

Loading

github-actions bot commented Nov 18, 2023 •

edited by fknorr

Loading

fknorr commented Nov 18, 2023 •

edited

Loading

fknorr commented Nov 29, 2023 •

edited

Loading