Skip to content

Commit

Permalink
[SYCL] Specify a modifiable graph as having device specific nodes
Browse files Browse the repository at this point in the history
Modifies the API and definition of a command_graph to define a modifiable command_graph
as having nodes targeting specific devices, rather than being device-agnostic. This change
will allow a multi-device graph in the future by only removing some error conditions.

To enable this, the explicit API has a `syclDevice` parameter introduced to `add()` for use as
the device to process `cgf`.

See #7 for discussion.


Co-authored-by: Pablo Reble <pablo.reble@intel.com>
  • Loading branch information
EwanC and reble authored Mar 24, 2023
1 parent 28acfa8 commit 6192647
Showing 1 changed file with 99 additions and 25 deletions.
124 changes: 99 additions & 25 deletions sycl/doc/extensions/proposed/sycl_ext_oneapi_graph.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -139,8 +139,11 @@ Another reason for deferring a serialize/deserialize API (8) is that its scope
could extend from emitting the graph in a binary format, to emitting a
standardized IR format that enables further device specific graph optimizations.

Multi-device support (7) is something we are looking into introducing into
the extension, which may result in API changes.
Multi-device support (7) is something that has been designed for in this
extension, with the definition of a graph node being device specific. However,
the ability for a user to define a single graph with nodes targeting different
devices is still disallowed until such a time as this feature can be backed up
by implementation coverage.

=== Graph Building Mechanisms

Expand Down Expand Up @@ -207,7 +210,7 @@ Table {counter: tableNumber}. Terminology.
(edges), represented by the `command_graph` class.

| Node
| A command, which can have different attributes.
| A command, which can have different attributes, targeting a specific device.

| Edge
| Dependency between commands as a happens-before relationship.
Expand Down Expand Up @@ -253,15 +256,15 @@ Table {counter: tableNumber}. Recorded Graph Definition.
| Concept | Description

| Node
| Nodes in a queue recorded graph represent each of the command group
submissions of the program. Each submission encompasses either one or both of
a.) some data movement, b.) a single asynchronous kernel launch. Nodes cannot
define forward edges, only backwards. That is, kernels can only create
dependencies on command-groups that have already been submitted. This means that
transparently a node can depend on a previously recorded graph (sub-graph),
which works by creating edges to the individual nodes in the old graph. Explicit
memory operations without kernels, such as a memory copy, are still classed as
nodes under this definition, as the
| A node in a queue recorded graph represents a command group submission to the
device associated with the queue begin recorded. Each submission encompasses
either one or both of a.) some data movement, b.) a single asynchronous kernel
launch. Nodes cannot define forward edges, only backwards. That is, kernels can
only create dependencies on command-groups that have already been submitted.
This means that transparently a node can depend on a previously recorded graph
(sub-graph), which works by creating edges to the individual nodes in the old
graph. Explicit memory operations without kernels, such as a memory copy, are
still classed as nodes under this definition, as the
{explicit-memory-ops}[SYCL 2020 specification states] that these can be seen as
specialized kernels executing on the device.

Expand Down Expand Up @@ -339,7 +342,7 @@ public:
node add(const property_list& propList = {});
template<typename T>
node add(T cgf, const property_list& propList = {});
node add(const device& syclDevice, T cgf, const property_list& propList = {});
void make_edge(node src, node dest);
};
Expand Down Expand Up @@ -418,7 +421,8 @@ This extension adds a new `command_graph` object which follows the
{crs}[common reference semantics] of other SYCL runtime objects.

A `command_graph` represents a directed acyclic graph of nodes, where each node
represents a single command or a sub-graph. The execution of a graph completes
represents a single command for a specific device or a sub-graph. A graph may be
made up of nodes targeting different devices. The execution of a graph completes
when all its nodes have completed.

A `command_graph` is built up by either recording queue submissions or
Expand Down Expand Up @@ -461,7 +465,8 @@ using the `command_graph::update()` method. This takes a graph in the
modifiable state and updates the executable graph to use the node input &
outputs of the modifiable graph, a technique called _Whole Graph Update_. The
modifiable graph must have the same topology as the graph originally used to
create the executable graphs, with the nodes added in the same order.
create the executable graphs, with the nodes targeting the same devices and
added in the same order.

==== Graph Member Functions

Expand Down Expand Up @@ -530,7 +535,7 @@ Exceptions:
----
using namespace ext::oneapi::experimental;
template<typename T>
node add(T cgf, const property_list& propList = {});
node add(const device& syclDevice, T cgf, const property_list& propList = {});
----
|This function adds a command group function object to a graph. The function
object statically contains a group of commands, of which a single command is
Expand All @@ -546,6 +551,13 @@ Preconditions:

Parameters:

* `syclDevice` - Device to process `cgf` with to create node command-group.
Commands are not required to execute on this device if the behaviour is
consistent with how they would not execute on the device associated with a
`sycl::queue` during regular queue submission. In particular host tasks,
sub-graphs with nodes targeting other devices, and memory copies may not
execute on `syclDevice`.

* `cgf` - Command group function object to be added as a node.

* `propList` - Zero or more properties can be provided to the constructed node
Expand All @@ -558,6 +570,9 @@ Exceptions:
* Throws synchronously with error code `invalid` if a queue is recording
commands to the graph.

* Throws synchronously with error code `invalid` if `syclDevice` is a different
device from the device targeted by the existing nodes of the graph.

|
[source,c++]
----
Expand Down Expand Up @@ -625,6 +640,9 @@ Exceptions:
A cycle may be introduced to the graph via a call to `make_edge()` that
creates a forward dependency.

* Throws synchronously with error code `invalid` if the graph contains a
node which targets a device not present in `syclContext`.

|===

Table {counter: tableNumber}. Member functions of the `command_graph` class for queue recording.
Expand Down Expand Up @@ -656,6 +674,10 @@ Exceptions:
* Throws synchronously with error code `invalid` if `recordingQueue` is
already recording to a different graph.

* Throws synchronously with error code `invalid` if `recordingQueue` is
associated with a device which is different from the device already targeted
by the nodes of the graph.

|
[source, c++]
----
Expand Down Expand Up @@ -939,6 +961,11 @@ Parameters:

* `graph` - Graph object to execute.

Exceptions:

* Throws synchronously with error code `invalid` if the handler is submitted
to a queue which doesn't have a SYCL context which matches the context of
the executable graph.
|===

=== Thread Safety
Expand Down Expand Up @@ -977,13 +1004,20 @@ Errors are reported through exceptions, as usual in the SYCL API. For new APIs,
submitting a graph for execution can generate unspecified asynchronous errors,
while `command_graph::finalize()` may throw unspecified synchronous exceptions.
Synchronous exception errors codes are defined for all of
`command_graph::begin_recording()`, `command_graph::end_recording()` and
`command_graph::update()`.
`command_graph::add()`, `command_graph::make_edge()`, `command_graph::update()`,
`command_graph::begin_recording()`, and `command_graph::end_recording()`.

Submitting an executable graph using `handler::ext_oneapi_graph()` to
a queue with a different SYCL context than that of the executable graph will
result in a synchronous exception.

When a queue is in recording mode asynchronous exceptions will not be
generated, as no device execution is occurring. Synchronous errors specified as
being thrown in the default queue executing state, will still be thrown when a
queue is in the recording state.
queue is in the recording state. A synchronous error with error code `invalid`
will also be thrown if a queue in recording mode tries to record a command to a
graph which already has nodes that target a device different from the device
associated with the recording queue.

The `command_graph::begin_recording` and `command_graph::end_recording`
entry-points return a `bool` value informing the user whether a related queue
Expand Down Expand Up @@ -1063,7 +1097,7 @@ evaluated as normal during command graph execution.
[source,c++]
----
using namespace ext::oneapi::experimental;
auto node = graph.add([&](sycl::handler& cgh){
auto node = graph.add(device, [&](sycl::handler& cgh){
// Host code here is evaluated during the call to add()
cgh.host_task([=](){
// Code here is evaluated as part of executing the command graph node
Expand Down Expand Up @@ -1109,6 +1143,42 @@ are the advantage of USM.

Design 3) needs specific backend support for deferred allocation.

=== Device Specific Graph

A modifiable state `command_graph` contains nodes targeting specific devices,
rather than being a device agnostic representation only tied to devices on
finalization. This allows the implementation to process nodes which require
device information when the command group function is evaluated. For example,
a SYCL reduction implementation may desire the work-group/sub-group size, which
is normally gathered by the runtime from the device associated with the queue.

This design also enables the future capability for a user to compose a graph
with nodes targeting different devices, allowing the benefits of defining an
execution graph ahead of submission to be extended to multi-device platforms.
Without this capability a user currently has to submit individual single-device
graphs and use events for dependencies, which is a usage model this extension is
aiming to optimize. Automatic load balancing of commands across devices is not a
problem this extension currently aims to solve, it is the responsibility of the
user to decide the device each command will be processed for, not the SYCL
runtime.

A drawback of this design is that it is less convenient for the use-case where
a user would like to run the same graph on N devices. Rather than finalizing a
single modifiable graph N times for N devices in this scenario, the user now has
to record N modifiable graphs and then as finalize each of them. If this use
case does become a usability issue, we could provide a specific API to support
it. For example, an update API for the modifiable `command_graph`, similar to
parameters, so that the same modifiable graph can be updated to new devices then
re-finalized. There may need to be limitations on what devices a graph can be
updated to however, as device specific processing might not be possible to roll
back or defer. As such, it may be only possible to update to identical physical
devices, rather than re-target a GPU constructed graph to a CPU.

The queue an executable graph is submitted to needs to have the same context as
the executable graph, however execution of a multi-device graph is not limited
to the device associated to the queue. Instead, the queue is used as the
mechanism to submit the graph and express execution dependencies.

== Examples

[NOTE]
Expand Down Expand Up @@ -1138,6 +1208,7 @@ int main() {
float gamma = 3.0f;
sycl::queue q;
sycl::device device = q.get_device();
sycl_ext::command_graph g;
Expand All @@ -1155,7 +1226,7 @@ int main() {
// c
/* init data on the device */
auto node_i = g.add([&](sycl::handler& h) {
auto node_i = g.add(device, [&](sycl::handler& h) {
h.parallel_for(n, [=](sycl::id<1> it){
const size_t i = it[0];
x[i] = 1.0f;
Expand All @@ -1164,21 +1235,21 @@ int main() {
});
});
auto node_a = g.add([&](sycl::handler& h) {
auto node_a = g.add(device, [&](sycl::handler& h) {
h.parallel_for(sycl::range<1>{n}, [=](sycl::id<1> it) {
const size_t i = it[0];
x[i] = alpha * x[i] + beta * y[i];
});
}, { sycl_ext::property::node::depends_on(node_i)});
auto node_b = g.add([&](sycl::handler& h) {
auto node_b = g.add(device, [&](sycl::handler& h) {
h.parallel_for(sycl::range<1>{n}, [=](sycl::id<1> it) {
const size_t i = it[0];
z[i] = gamma * z[i] + beta * y[i];
});
}, { sycl_ext::property::node::depends_on(node_i)});
auto node_c = g.add(
auto node_c = g.add(device,
[&](sycl::handler& h) {
h.parallel_for(sycl::range<1>{n},
sycl::reduction(dotp, 0.0f, std::plus()),
Expand Down Expand Up @@ -1296,7 +1367,10 @@ submitted in its entirety for execution via

Allow an executable graph to contain nodes targeting different devices.

**Outcome:** Under consideration
**Outcome:** This feature has been designed for with the definition of a graph
node being device specific. However, the ability for a user to define a single
graph with nodes targeting different devices is still disallowed until such a
time as this feature can be backed up by implementation coverage.

=== Memory Allocation API

Expand Down

0 comments on commit 6192647

Please sign in to comment.