Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Capture buffer and host-object data on synchronization points #94

Closed
wants to merge 15 commits into from

Conversation

fknorr
Copy link
Contributor

@fknorr fknorr commented Feb 21, 2022

Currently, there is no native way to get results out of a Celerity buffer and back into the main thread at the end of a Celerity program. The current workaround is celerity::allow_by_ref and reference captures in host tasks, which is inelegant and error-prone.

SYCL offers host_accessor and explicit copy operations together with awaitable kernel events for that purpose. This is not a good fit for Celerity, since stalling the main thread is orders of magnitude more expensive in the distributed case.

This PR introduces Captures, a declarative API that allows requesting host objects and buffer subranges from distr_queue::slow_full_sync() and from runtime shutdown via the new distr_queue::drain() function. Specifying a capture adds the necessary dependencies and data transfers to the associated Epoch command and copies or moves the data out to the calling main thread once the epoch is reached, confining stalls to APIs which the user already expects to be slow.

Example

Drains the queue on program exit to receive a verification result in the main thread.

int main() {
    distr_queue q;
    host_object<bool> verification_passed;

    q.submit([=](handler &cgh) {
        side_effect verify{verification_passed, cgh};
        cgh.host_task(on_master_node, [=] {
            *verify = ...;
        });
    });

    return q.drain(capture{verification_passed}) ? 0 : 1;

@fknorr fknorr changed the title Capture buffer and host-object data on syncrhonization points Capture buffer and host-object data on synchronization points Feb 21, 2022
@fknorr fknorr force-pushed the captures branch 4 times, most recently from ff31946 to ba98498 Compare March 16, 2022 20:17
@fknorr fknorr force-pushed the captures branch 6 times, most recently from 00b9c79 to beb4245 Compare March 30, 2022 18:02
@fknorr fknorr marked this pull request as ready for review April 28, 2022 08:43
Copy link
Contributor

@PeterTh PeterTh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments for now -- note that I haven't looked through the unit tests yet.

Also, I didn't add code comments for this, but I'm once again unhappy with the increasing signature bloat related to tasks ;)

class capture;

template <typename T, int Dims>
class buffer_data {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can think of a more descriptive name for this.
Maybe buffer_capture_data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about buffer_snapshot?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have renamed the struct to buffer_snapshot for now.

include/capture.h Show resolved Hide resolved
include/distr_queue.h Outdated Show resolved Hide resolved
src/graph_generator.cc Outdated Show resolved Hide resolved
include/task_manager.h Outdated Show resolved Hide resolved
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

test/test_utils.h Outdated Show resolved Hide resolved
@github-actions
Copy link

clang-tidy review says "All clean, LGTM! 👍"

1 similar comment
@github-actions
Copy link

clang-tidy review says "All clean, LGTM! 👍"

@fknorr
Copy link
Contributor Author

fknorr commented Aug 17, 2022

The CI failure on dpcpp:HEAD is due to a bug in Clang 15 and not related to this PR.

@fknorr fknorr requested a review from psalz August 17, 2022 10:56
@PeterTh PeterTh self-requested a review August 24, 2022 12:38
Copy link
Member

@psalz psalz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I've added some notes - better late than never, right?

Some additional remarks:

  • As discussed in person, the std::in_place constructor of host_object currently forwards to the initializer-list constructor of T, if one exists, which is unexpected.
  • I don't fully understand why the capture object exists, or rather, why can't we pass host objects / buffers into drain directly? Is it only for specifying the buffer range?
  • I'm not sure about the naming of drain. To me this doesn't really imply that this is the last operation I'm allowed to do on a queue, essentially destroying it. It sounds more like a different kind of sync, imo.
  • It is also not obvious in my opinion that slow_full_sync produces a copy, while drain moves the object. Didn't an earlier version of this patch require host_object to be moved into drain?

I'm also still concerned that the semantic difference between host objects and buffers will be confusing to people (one is local to each worker, the other distributed). This difference existed before, but is now exacerbated by the fact that capturing a buffer returns the same data on all workers, whereas capturing a host object does (potentially) not.

queue.slow_full_sync(); // Wait for verification_passed to become available

return verification_passed ? EXIT_SUCCESS : EXIT_FAILURE;
auto mat_a_dump = queue.drain(celerity::experimental::capture{mat_a_buf});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize you want to showcase buffer draining somewhere, but is this the right place? Before we did a distributed verification, now every node has to do the full matrix, which is not ideal imo.

The "syncing" example might be better suited for illustrating this!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've returned to distributed verification, but replaced the ref-capture / sync with a drain(capture(host_object)). I feel like syncing was made to demonstrate the old ref-capture + slow_full_sync workaround. Maybe we should drop this example altogether.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it! Agreed on the syncing example.

include/distr_queue.h Outdated Show resolved Hide resolved
include/distr_queue.h Outdated Show resolved Hide resolved

if(tsk->get_epoch_action() == epoch_action::barrier) {
// Wait for the main thread to call resume_after_barrier. This ensures that no commands can be executed before all captures have been exfiltrated.
// It is not sufficient for the main thread to stop submitting work, since e.g. AWAIT PUSH commands do not depend on a task definition.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the concern here? That an await push writes into a buffer while it is being exfiltrated [I don't think that can happen atm]? Or simply to avoid concurrent access to the buffer manager?

Copy link
Contributor Author

@fknorr fknorr Sep 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As soon as the master node clears the barrier, it can start submitting commands to workers. If such a command does not depend on a task definition (e.g. await push), the worker's executor will start processing it right away, leading to a race between the capture-read on that node and the await-push-write.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh okay I think I get it now. I think I was confused because I thought there should be an anti-dependency from an await push onto the exfiltrating epoch that prevents this race... And there is, but that's not enough because the exfiltration happens in the main thread, not the epoch worker job. So we need a way for that job to wait until the main thread is done exfiltrating. Correct?

src/task_manager.cc Outdated Show resolved Hide resolved
include/task_manager.h Outdated Show resolved Hide resolved
include/runtime.h Show resolved Hide resolved
test/graph_generation_tests.cc Outdated Show resolved Hide resolved
test/graph_generation_tests.cc Show resolved Hide resolved
@fknorr fknorr force-pushed the captures branch 2 times, most recently from bd5d826 to 37f29e3 Compare September 21, 2022 14:36
@fknorr
Copy link
Contributor Author

fknorr commented Sep 21, 2022

As discussed in person, the std::in_place constructor of host_object currently forwards to the initializer-list constructor of T, if one exists, which is unexpected.

I have replaced the "universal" initializer syntax with the ()-syntax for non-aggregate types in places related to captures / host objects.

I don't fully understand why the capture object exists, or rather, why can't we pass host objects / buffers into drain directly? Is it only for specifying the buffer range?

It is about the buffer subranges, and also to clarify the meaning of arguments to slow_full_sync in user code.

I'm not sure about the naming of drain. To me this doesn't really imply that this is the last operation I'm allowed to do on a queue, essentially destroying it. It sounds more like a different kind of sync, imo.

IMO this is a pretty common term for "waiting for things to finish" (e.g. SLURM uses it to mark nodes that are not accepting more work in order to shut down).

It is also not obvious in my opinion that slow_full_sync produces a copy, while drain moves the object. Didn't an earlier version of this patch require host_object to be moved into drain?

I agree that the distinction is subtle and can be confusing. It also does not serve the use case of capturing a non-copyable object at a slow_full_sync barrier. It could be worth investigating to have the copy/move distinction instead on the capture level, e.g. by having an additional capture_by_move wrapper (this consideration only applies to host objects anyway since buffer data is always trivially copyable).

One thing that's "pretty" about the current solution is that a capture-on-drain does not introduce an additional shutdown-epoch after the capture-epoch. This is going to be pretty irrelevant from a performance perspective though.

I'm also still concerned that the semantic difference between host objects and buffers will be confusing to people (one is local to each worker, the other distributed). This difference existed before, but is now exacerbated by the fact that capturing a buffer returns the same data on all workers, whereas capturing a host object does (potentially) not.

I still agree on this, although I would perfer to revisit host objects after merging this (we're talking about experimental features anyway).

Copy link
Member

@psalz psalz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still agree on this, although I would perfer to revisit host objects after merging this (we're talking about experimental features anyway).

Fine with me!

Thanks for addressing all remarks. Also I like the new buffer_snapshot name!

@PeterTh
Copy link
Contributor

PeterTh commented Oct 11, 2022

Offline discussion results:

  • have a move_capture and capture (copy) [ bikeshed name @psalz ]
  • get rid of drain entirely
  • in the future, implement fence which does not require a full barrier / epoch

@fknorr
Copy link
Contributor Author

fknorr commented Dec 14, 2022

Superseded by #151.

@fknorr fknorr closed this Dec 14, 2022
@fknorr fknorr deleted the captures branch February 5, 2024 12:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants