Capture buffer and host-object data on synchronization points #94

fknorr · 2022-02-21T16:19:58Z

Currently, there is no native way to get results out of a Celerity buffer and back into the main thread at the end of a Celerity program. The current workaround is celerity::allow_by_ref and reference captures in host tasks, which is inelegant and error-prone.

SYCL offers host_accessor and explicit copy operations together with awaitable kernel events for that purpose. This is not a good fit for Celerity, since stalling the main thread is orders of magnitude more expensive in the distributed case.

This PR introduces Captures, a declarative API that allows requesting host objects and buffer subranges from distr_queue::slow_full_sync() and from runtime shutdown via the new distr_queue::drain() function. Specifying a capture adds the necessary dependencies and data transfers to the associated Epoch command and copies or moves the data out to the calling main thread once the epoch is reached, confining stalls to APIs which the user already expects to be slow.

Example

Drains the queue on program exit to receive a verification result in the main thread.

int main() {
    distr_queue q;
    host_object<bool> verification_passed;

    q.submit([=](handler &cgh) {
        side_effect verify{verification_passed, cgh};
        cgh.host_task(on_master_node, [=] {
            *verify = ...;
        });
    });

    return q.drain(capture{verification_passed}) ? 0 : 1;

PeterTh

Some comments for now -- note that I haven't looked through the unit tests yet.

Also, I didn't add code comments for this, but I'm once again unhappy with the increasing signature bloat related to tasks ;)

PeterTh · 2022-06-17T13:01:53Z

include/capture.h

+class capture;
+
+template <typename T, int Dims>
+class buffer_data {


Maybe we can think of a more descriptive name for this.
Maybe buffer_capture_data?

How about buffer_snapshot?

I have renamed the struct to buffer_snapshot for now.

include/capture.h

include/distr_queue.h

src/graph_generator.cc

include/task_manager.h

github-actions

clang-tidy made some suggestions

test/test_utils.h

github-actions · 2022-08-16T16:27:29Z

clang-tidy review says "All clean, LGTM! 👍"

github-actions · 2022-08-17T10:23:38Z

clang-tidy review says "All clean, LGTM! 👍"

fknorr · 2022-08-17T10:56:12Z

The CI failure on dpcpp:HEAD is due to a bug in Clang 15 and not related to this PR.

psalz

Thanks! I've added some notes - better late than never, right?

Some additional remarks:

As discussed in person, the std::in_place constructor of host_object currently forwards to the initializer-list constructor of T, if one exists, which is unexpected.
I don't fully understand why the capture object exists, or rather, why can't we pass host objects / buffers into drain directly? Is it only for specifying the buffer range?
I'm not sure about the naming of drain. To me this doesn't really imply that this is the last operation I'm allowed to do on a queue, essentially destroying it. It sounds more like a different kind of sync, imo.
It is also not obvious in my opinion that slow_full_sync produces a copy, while drain moves the object. Didn't an earlier version of this patch require host_object to be moved into drain?

I'm also still concerned that the semantic difference between host objects and buffers will be confusing to people (one is local to each worker, the other distributed). This difference existed before, but is now exacerbated by the fact that capturing a buffer returns the same data on all workers, whereas capturing a host object does (potentially) not.

psalz · 2022-07-21T08:53:32Z

examples/matmul/matmul.cc

-	queue.slow_full_sync(); // Wait for verification_passed to become available
-
-	return verification_passed ? EXIT_SUCCESS : EXIT_FAILURE;
+	auto mat_a_dump = queue.drain(celerity::experimental::capture{mat_a_buf});


I realize you want to showcase buffer draining somewhere, but is this the right place? Before we did a distributed verification, now every node has to do the full matrix, which is not ideal imo.

The "syncing" example might be better suited for illustrating this!

I've returned to distributed verification, but replaced the ref-capture / sync with a drain(capture(host_object)). I feel like syncing was made to demonstrate the old ref-capture + slow_full_sync workaround. Maybe we should drop this example altogether.

I like it! Agreed on the syncing example.

include/distr_queue.h

psalz · 2022-08-24T16:03:49Z

src/task_manager.cc

+
+		if(tsk->get_epoch_action() == epoch_action::barrier) {
+			// Wait for the main thread to call resume_after_barrier. This ensures that no commands can be executed before all captures have been exfiltrated.
+			// It is not sufficient for the main thread to stop submitting work, since e.g. AWAIT PUSH commands do not depend on a task definition.


What is the concern here? That an await push writes into a buffer while it is being exfiltrated [I don't think that can happen atm]? Or simply to avoid concurrent access to the buffer manager?

As soon as the master node clears the barrier, it can start submitting commands to workers. If such a command does not depend on a task definition (e.g. await push), the worker's executor will start processing it right away, leading to a race between the capture-read on that node and the await-push-write.

Ahh okay I think I get it now. I think I was confused because I thought there should be an anti-dependency from an await push onto the exfiltrating epoch that prevents this race... And there is, but that's not enough because the exfiltration happens in the main thread, not the epoch worker job. So we need a way for that job to wait until the main thread is done exfiltrating. Correct?

src/task_manager.cc

include/task_manager.h

include/runtime.h

test/graph_generation_tests.cc

fknorr · 2022-09-21T14:47:57Z

As discussed in person, the std::in_place constructor of host_object currently forwards to the initializer-list constructor of T, if one exists, which is unexpected.

I have replaced the "universal" initializer syntax with the ()-syntax for non-aggregate types in places related to captures / host objects.

I don't fully understand why the capture object exists, or rather, why can't we pass host objects / buffers into drain directly? Is it only for specifying the buffer range?

It is about the buffer subranges, and also to clarify the meaning of arguments to slow_full_sync in user code.

I'm not sure about the naming of drain. To me this doesn't really imply that this is the last operation I'm allowed to do on a queue, essentially destroying it. It sounds more like a different kind of sync, imo.

IMO this is a pretty common term for "waiting for things to finish" (e.g. SLURM uses it to mark nodes that are not accepting more work in order to shut down).

It is also not obvious in my opinion that slow_full_sync produces a copy, while drain moves the object. Didn't an earlier version of this patch require host_object to be moved into drain?

I agree that the distinction is subtle and can be confusing. It also does not serve the use case of capturing a non-copyable object at a slow_full_sync barrier. It could be worth investigating to have the copy/move distinction instead on the capture level, e.g. by having an additional capture_by_move wrapper (this consideration only applies to host objects anyway since buffer data is always trivially copyable).

One thing that's "pretty" about the current solution is that a capture-on-drain does not introduce an additional shutdown-epoch after the capture-epoch. This is going to be pretty irrelevant from a performance perspective though.

I'm also still concerned that the semantic difference between host objects and buffers will be confusing to people (one is local to each worker, the other distributed). This difference existed before, but is now exacerbated by the fact that capturing a buffer returns the same data on all workers, whereas capturing a host object does (potentially) not.

I still agree on this, although I would perfer to revisit host objects after merging this (we're talking about experimental features anyway).

psalz

I still agree on this, although I would perfer to revisit host objects after merging this (we're talking about experimental features anyway).

Fine with me!

Thanks for addressing all remarks. Also I like the new buffer_snapshot name!

PeterTh · 2022-10-11T15:43:09Z

Offline discussion results:

have a move_capture and capture (copy) [ bikeshed name @psalz ]
get rid of drain entirely
in the future, implement fence which does not require a full barrier / epoch

…[](id<D>)

… captures

… / host objects

…ync{ => _internal}

fknorr · 2022-12-14T14:07:59Z

Superseded by #151.

fknorr changed the title ~~Capture buffer and host-object data on syncrhonization points~~ Capture buffer and host-object data on synchronization points Feb 21, 2022

fknorr force-pushed the captures branch from 058f7ad to 2e07377 Compare February 22, 2022 10:18

PeterTh assigned fknorr Feb 28, 2022

fknorr mentioned this pull request Mar 2, 2022

In-graph synchronization with Epochs #86

Merged

fknorr force-pushed the captures branch from 2e07377 to 421fc1e Compare March 2, 2022 15:16

fknorr force-pushed the captures branch 4 times, most recently from ff31946 to ba98498 Compare March 16, 2022 20:17

fknorr force-pushed the captures branch 6 times, most recently from 00b9c79 to beb4245 Compare March 30, 2022 18:02

fknorr mentioned this pull request Apr 25, 2022

Avoid unnecessary allocations around buffer storage #115

Merged

fknorr force-pushed the captures branch from beb4245 to 21502b5 Compare April 28, 2022 08:42

fknorr marked this pull request as ready for review April 28, 2022 08:43

PeterTh reviewed Jun 17, 2022

View reviewed changes

fknorr force-pushed the captures branch from 21502b5 to 43a055e Compare August 16, 2022 16:01

github-actions bot reviewed Aug 16, 2022

View reviewed changes

test/test_utils.h Outdated Show resolved Hide resolved

fknorr force-pushed the captures branch from 43a055e to ad266e3 Compare August 16, 2022 16:22

fknorr requested a review from psalz August 17, 2022 10:56

PeterTh self-requested a review August 24, 2022 12:38

psalz requested changes Aug 24, 2022

View reviewed changes

fknorr force-pushed the captures branch 2 times, most recently from bd5d826 to 37f29e3 Compare September 21, 2022 14:36

fknorr force-pushed the captures branch from 37f29e3 to e574e6d Compare September 21, 2022 14:50

psalz approved these changes Sep 22, 2022

View reviewed changes

fknorr and others added 15 commits November 2, 2022 19:53

Generalize subscript proxies to work with any type providing operator…

4798679

…[](id<D>)

Exfiltrate data on epochs through capture API

3d9652c

Reduce execution front to epochs after generating dependencies

ef23b85

Pause the executor in a barrier until the main thread has exfiltrated…

c85d15b

… captures

Replace drain(queue &&) with more SYCL'ish queue.drain()

4b5ff2d

Improve captures buffer_data API

5ee21e9

Migrate allow_by_ref in examples to captures

dc25dec

Rename private drain -> drain_internal

fb1bbd4

Assertion: allow side effects on epoch tasks

6f03aab

Avoid universal initializer syntax for non-aggregates around captures…

88f678f

… / host objects

Return to distributed verification in matmul example

ec7bc69

Fix and test single-capture distr_queue::slow_full_sync

1b259db

Rename experimental::{buffer_data => buffer_snapshot}

6e4843a

Document runtime::sync_guard; rename private distr_queue::slow_full_s…

404edf2

…ync{ => _internal}

graph_generation_tests verify that transfer commands are generated

a791df6

fknorr force-pushed the captures branch from e574e6d to a791df6 Compare November 3, 2022 09:33

fknorr mentioned this pull request Nov 7, 2022

Fences as a replacement for Captures #151

Merged

fknorr closed this Dec 14, 2022

fknorr deleted the captures branch February 5, 2024 12:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Capture buffer and host-object data on synchronization points #94

Capture buffer and host-object data on synchronization points #94

fknorr commented Feb 21, 2022 •

edited

Loading

PeterTh left a comment

PeterTh Jun 17, 2022

fknorr Aug 16, 2022

fknorr Sep 21, 2022

github-actions bot left a comment

github-actions bot commented Aug 16, 2022

github-actions bot commented Aug 17, 2022

fknorr commented Aug 17, 2022

psalz left a comment

psalz Jul 21, 2022

fknorr Sep 21, 2022

psalz Sep 22, 2022

psalz Aug 24, 2022

fknorr Sep 21, 2022 •

edited

Loading

psalz Sep 22, 2022

fknorr commented Sep 21, 2022

psalz left a comment

PeterTh commented Oct 11, 2022

fknorr commented Dec 14, 2022

Capture buffer and host-object data on synchronization points #94

Capture buffer and host-object data on synchronization points #94

Conversation

fknorr commented Feb 21, 2022 • edited Loading

Example

PeterTh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot commented Aug 16, 2022

github-actions bot commented Aug 17, 2022

fknorr commented Aug 17, 2022

psalz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fknorr Sep 21, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fknorr commented Sep 21, 2022

psalz left a comment

Choose a reason for hiding this comment

PeterTh commented Oct 11, 2022

fknorr commented Dec 14, 2022

fknorr commented Feb 21, 2022 •

edited

Loading

fknorr Sep 21, 2022 •

edited

Loading