Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL] Specify a modifiable graph as having device specific nodes #83

Merged
merged 7 commits into from
Mar 24, 2023
Merged
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
104 changes: 76 additions & 28 deletions sycl/doc/extensions/proposed/sycl_ext_oneapi_graph.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -139,9 +139,6 @@ Another reason for deferring a serialize/deserialize API (8) is that its scope
could extend from emitting the graph in a binary format, to emitting a
standardized IR format that enables further device specific graph optimizations.

Multi-device support (7) is something we are looking into introducing into
the extension, which may result in API changes.

=== Graph Building Mechanisms

This extension contains two different API mechanisms for constructing a graph
Expand Down Expand Up @@ -207,7 +204,7 @@ Table {counter: tableNumber}. Terminology.
(edges), represented by the `command_graph` class.

| Node
| A command, which can have different attributes.
| A command, which can have different attributes, targeting a specific device.

| Edge
| Dependency between commands as a happens-before relationship.
Expand Down Expand Up @@ -253,15 +250,15 @@ Table {counter: tableNumber}. Recorded Graph Definition.
| Concept | Description

| Node
| Nodes in a queue recorded graph represent each of the command group
submissions of the program. Each submission encompasses either one or both of
a.) some data movement, b.) a single asynchronous kernel launch. Nodes cannot
define forward edges, only backwards. That is, kernels can only create
dependencies on command-groups that have already been submitted. This means that
transparently a node can depend on a previously recorded graph (sub-graph),
which works by creating edges to the individual nodes in the old graph. Explicit
memory operations without kernels, such as a memory copy, are still classed as
nodes under this definition, as the
| A node in a queue recorded graph represents a command group submission to the
device associated with the queue begin recorded. Each submission encompasses
either one or both of a.) some data movement, b.) a single asynchronous kernel
launch. Nodes cannot define forward edges, only backwards. That is, kernels can
only create dependencies on command-groups that have already been submitted.
This means that transparently a node can depend on a previously recorded graph
(sub-graph), which works by creating edges to the individual nodes in the old
graph. Explicit memory operations without kernels, such as a memory copy, are
still classed as nodes under this definition, as the
{explicit-memory-ops}[SYCL 2020 specification states] that these can be seen as
specialized kernels executing on the device.

Expand Down Expand Up @@ -339,7 +336,7 @@ public:
node add(const property_list& propList = {});
EwanC marked this conversation as resolved.
Show resolved Hide resolved

template<typename T>
node add(T cgf, const property_list& propList = {});
node add(const device& syclDevice, T cgf, const property_list& propList = {});

void make_edge(node src, node dest);
};
Expand Down Expand Up @@ -418,7 +415,8 @@ This extension adds a new `command_graph` object which follows the
{crs}[common reference semantics] of other SYCL runtime objects.

A `command_graph` represents a directed acyclic graph of nodes, where each node
represents a single command or a sub-graph. The execution of a graph completes
represents a single command for a specific device or a sub-graph. A graph may be
made up of nodes targeting different devices. The execution of a graph completes
when all its nodes have completed.

A `command_graph` is built up by either recording queue submissions or
Expand Down Expand Up @@ -461,7 +459,8 @@ using the `command_graph::update()` method. This takes a graph in the
modifiable state and updates the executable graph to use the node input &
outputs of the modifiable graph, a technique called _Whole Graph Update_. The
modifiable graph must have the same topology as the graph originally used to
create the executable graphs, with the nodes added in the same order.
create the executable graphs, with the nodes targeting the same devices and
added in the same order.

==== Graph Member Functions

Expand Down Expand Up @@ -530,7 +529,7 @@ Exceptions:
----
using namespace ext::oneapi::experimental;
template<typename T>
node add(T cgf, const property_list& propList = {});
node add(const device& syclDevice, T cgf, const property_list& propList = {});
----
|This function adds a command group function object to a graph. The function
object statically contains a group of commands, of which a single command is
Expand All @@ -546,6 +545,13 @@ Preconditions:

Parameters:

* `syclDevice` - Device to process `cgf` with to create node command-group.
Commands are not required to execute on this device if the behaviour is
consistent with how they would not execute on the device associated with a
`sycl::queue` during regular queue submission. In particular host tasks,
sub-graphs with nodes targeting other devices, and memory copies may not
execute on `syclDevice`.

* `cgf` - Command group function object to be added as a node.

* `propList` - Zero or more properties can be provided to the constructed node
Expand Down Expand Up @@ -625,6 +631,9 @@ Exceptions:
A cycle may be introduced to the graph via a call to `make_edge()` that
creates a forward dependency.

* Throws synchronously with error code `invalid` if the graph contains a
node which targets a device not present in `syclContext`.

|===

Table {counter: tableNumber}. Member functions of the `command_graph` class for queue recording.
Expand Down Expand Up @@ -939,6 +948,11 @@ Parameters:

* `graph` - Graph object to execute.

Exceptions:

* Throws synchronously with error code `invalid` if the handler is submitted
to a queue which doesn't have a SYCL context which matches the context of
the executable graph.
|===

=== Thread Safety
Expand Down Expand Up @@ -980,6 +994,10 @@ Synchronous exception errors codes are defined for all of
`command_graph::begin_recording()`, `command_graph::end_recording()` and
`command_graph::update()`.

Submitting an executable graph using `handler::ext_oneapi_graph()` to
a queue with a different SYCL context than that of the executable graph will
result in a synchronous exception.

When a queue is in recording mode asynchronous exceptions will not be
generated, as no device execution is occurring. Synchronous errors specified as
being thrown in the default queue executing state, will still be thrown when a
Expand Down Expand Up @@ -1063,7 +1081,7 @@ evaluated as normal during command graph execution.
[source,c++]
----
using namespace ext::oneapi::experimental;
auto node = graph.add([&](sycl::handler& cgh){
auto node = graph.add([&](device, sycl::handler& cgh){
EwanC marked this conversation as resolved.
Show resolved Hide resolved
// Host code here is evaluated during the call to add()
cgh.host_task([=](){
// Code here is evaluated as part of executing the command graph node
Expand Down Expand Up @@ -1109,6 +1127,41 @@ are the advantage of USM.

Design 3) needs specific backend support for deferred allocation.

=== Multi-device Graph

A user can compose a graph with nodes targeting different devices. This allows
the benefits of defining an execution graph ahead of submission to be extended
to multi-device platforms. Without this capability a user would have to submit
individual single-device graphs and use events for dependencies, which is a
usage model this extension is aiming to optimize. Automatic load balancing of
commands across devices is not a problem this extension currently aims to solve,
it is the responsibility of the user to decide the device each command will be
be processed for, not the SYCL runtime.

A modifiable state `command_graph` contains nodes targeting specific devices,
rather than being a device agnostic representation only tied to devices on
finalization. This allows the implementation to process nodes which require
device information when the command group function is evaluated. For example,
a SYCL reduction implementation may desire the work-group/sub-group size, which
is normally gathered by the runtime from the device associated with the queue.

A drawback of this design is that it is less convenient for the use-case where
a user would like to run the same graph on N devices. Rather than finalizing a
single modifiable graph N times for N devices in this scenario, the user now has
to record N modifiable graphs and then as finalize each of them. If this use
case does become a usability issue, we could provide a specific API to support
it. For example an update API for the modifiable `command_graph`, similar to
EwanC marked this conversation as resolved.
Show resolved Hide resolved
parameters, so that the same modifiable graph can be updated to new devices then
re-finalized. There may need to be limitations on what devices a graph can be
updated to however, as device specific processing might not be possible to roll
back or defer. As such, it may be only possible to update to identical physical
devices, rather than re-target a GPU constructed graph to a CPU.

The queue an executable graph is submitted to needs to have the same context as
the executable graph, however execution of a multi-device graph is not limited
to the device associated to the queue. Instead, the queue is used as the
mechanism to submit the graph and express execution dependencies.

== Examples

[NOTE]
Expand Down Expand Up @@ -1138,6 +1191,7 @@ int main() {
float gamma = 3.0f;

sycl::queue q;
sycl::device device = q.get_device();

sycl_ext::command_graph g;

Expand All @@ -1155,7 +1209,7 @@ int main() {
// c

/* init data on the device */
auto node_i = g.add([&](sycl::handler& h) {
auto node_i = g.add(device, [&](sycl::handler& h) {
h.parallel_for(n, [=](sycl::id<1> it){
const size_t i = it[0];
x[i] = 1.0f;
Expand All @@ -1164,21 +1218,21 @@ int main() {
});
});

auto node_a = g.add([&](sycl::handler& h) {
auto node_a = g.add(device, [&](sycl::handler& h) {
h.parallel_for(sycl::range<1>{n}, [=](sycl::id<1> it) {
const size_t i = it[0];
x[i] = alpha * x[i] + beta * y[i];
});
}, { sycl_ext::property::node::depends_on(node_i)});

auto node_b = g.add([&](sycl::handler& h) {
auto node_b = g.add(device, [&](sycl::handler& h) {
h.parallel_for(sycl::range<1>{n}, [=](sycl::id<1> it) {
const size_t i = it[0];
z[i] = gamma * z[i] + beta * y[i];
});
}, { sycl_ext::property::node::depends_on(node_i)});

auto node_c = g.add(
auto node_c = g.add(device,
[&](sycl::handler& h) {
h.parallel_for(sycl::range<1>{n},
sycl::reduction(dotp, 0.0f, std::plus()),
Expand Down Expand Up @@ -1292,12 +1346,6 @@ submitted in its entirety for execution via

== Issues

=== Multi Device Graph

Allow an executable graph to contain nodes targeting different devices.

**Outcome:** Under consideration

=== Memory Allocation API

We would like to provide an API that allows graph scope memory to be
Expand Down