-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Capture buffer and host-object data on synchronization points #94
Conversation
ff31946
to
ba98498
Compare
00b9c79
to
beb4245
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments for now -- note that I haven't looked through the unit tests yet.
Also, I didn't add code comments for this, but I'm once again unhappy with the increasing signature bloat related to tasks ;)
include/capture.h
Outdated
class capture; | ||
|
||
template <typename T, int Dims> | ||
class buffer_data { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can think of a more descriptive name for this.
Maybe buffer_capture_data
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about buffer_snapshot
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have renamed the struct to buffer_snapshot
for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
clang-tidy review says "All clean, LGTM! 👍" |
1 similar comment
clang-tidy review says "All clean, LGTM! 👍" |
The CI failure on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I've added some notes - better late than never, right?
Some additional remarks:
- As discussed in person, the
std::in_place
constructor ofhost_object
currently forwards to the initializer-list constructor ofT
, if one exists, which is unexpected. - I don't fully understand why the
capture
object exists, or rather, why can't we pass host objects / buffers intodrain
directly? Is it only for specifying the buffer range? - I'm not sure about the naming of
drain
. To me this doesn't really imply that this is the last operation I'm allowed to do on a queue, essentially destroying it. It sounds more like a different kind of sync, imo. - It is also not obvious in my opinion that
slow_full_sync
produces a copy, whiledrain
moves the object. Didn't an earlier version of this patch requirehost_object
to be moved intodrain
?
I'm also still concerned that the semantic difference between host objects and buffers will be confusing to people (one is local to each worker, the other distributed). This difference existed before, but is now exacerbated by the fact that capturing a buffer returns the same data on all workers, whereas capturing a host object does (potentially) not.
examples/matmul/matmul.cc
Outdated
queue.slow_full_sync(); // Wait for verification_passed to become available | ||
|
||
return verification_passed ? EXIT_SUCCESS : EXIT_FAILURE; | ||
auto mat_a_dump = queue.drain(celerity::experimental::capture{mat_a_buf}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize you want to showcase buffer draining somewhere, but is this the right place? Before we did a distributed verification, now every node has to do the full matrix, which is not ideal imo.
The "syncing" example might be better suited for illustrating this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've returned to distributed verification, but replaced the ref-capture / sync with a drain(capture(host_object))
. I feel like syncing
was made to demonstrate the old ref-capture + slow_full_sync
workaround. Maybe we should drop this example altogether.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it! Agreed on the syncing example.
|
||
if(tsk->get_epoch_action() == epoch_action::barrier) { | ||
// Wait for the main thread to call resume_after_barrier. This ensures that no commands can be executed before all captures have been exfiltrated. | ||
// It is not sufficient for the main thread to stop submitting work, since e.g. AWAIT PUSH commands do not depend on a task definition. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the concern here? That an await push writes into a buffer while it is being exfiltrated [I don't think that can happen atm]? Or simply to avoid concurrent access to the buffer manager?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As soon as the master node clears the barrier, it can start submitting commands to workers. If such a command does not depend on a task definition (e.g. await push), the worker's executor will start processing it right away, leading to a race between the capture-read on that node and the await-push-write.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh okay I think I get it now. I think I was confused because I thought there should be an anti-dependency from an await push onto the exfiltrating epoch that prevents this race... And there is, but that's not enough because the exfiltration happens in the main thread, not the epoch worker job. So we need a way for that job to wait until the main thread is done exfiltrating. Correct?
bd5d826
to
37f29e3
Compare
I have replaced the "universal" initializer syntax with the
It is about the buffer subranges, and also to clarify the meaning of arguments to
IMO this is a pretty common term for "waiting for things to finish" (e.g. SLURM uses it to mark nodes that are not accepting more work in order to shut down).
I agree that the distinction is subtle and can be confusing. It also does not serve the use case of capturing a non-copyable object at a One thing that's "pretty" about the current solution is that a capture-on-drain does not introduce an additional shutdown-epoch after the capture-epoch. This is going to be pretty irrelevant from a performance perspective though.
I still agree on this, although I would perfer to revisit host objects after merging this (we're talking about experimental features anyway). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still agree on this, although I would perfer to revisit host objects after merging this (we're talking about experimental features anyway).
Fine with me!
Thanks for addressing all remarks. Also I like the new buffer_snapshot
name!
Offline discussion results:
|
…ync{ => _internal}
Superseded by #151. |
Currently, there is no native way to get results out of a Celerity buffer and back into the main thread at the end of a Celerity program. The current workaround is
celerity::allow_by_ref
and reference captures in host tasks, which is inelegant and error-prone.SYCL offers
host_accessor
and explicit copy operations together with awaitable kernel events for that purpose. This is not a good fit for Celerity, since stalling the main thread is orders of magnitude more expensive in the distributed case.This PR introduces Captures, a declarative API that allows requesting host objects and buffer subranges from
distr_queue::slow_full_sync()
and from runtime shutdown via the newdistr_queue::drain()
function. Specifying a capture adds the necessary dependencies and data transfers to the associated Epoch command and copies or moves the data out to the calling main thread once the epoch is reached, confining stalls to APIs which the user already expects to be slow.Example
Drains the queue on program exit to receive a verification result in the main thread.