From 61926472cacd0788e62bc61ce5afb3fa4f091f38 Mon Sep 17 00:00:00 2001 From: Ewan Crawford Date: Fri, 24 Mar 2023 10:23:05 +0000 Subject: [PATCH] [SYCL] Specify a modifiable graph as having device specific nodes Modifies the API and definition of a command_graph to define a modifiable command_graph as having nodes targeting specific devices, rather than being device-agnostic. This change will allow a multi-device graph in the future by only removing some error conditions. To enable this, the explicit API has a `syclDevice` parameter introduced to `add()` for use as the device to process `cgf`. See https://github.com/reble/llvm/issues/7 for discussion. Co-authored-by: Pablo Reble --- .../proposed/sycl_ext_oneapi_graph.asciidoc | 124 ++++++++++++++---- 1 file changed, 99 insertions(+), 25 deletions(-) diff --git a/sycl/doc/extensions/proposed/sycl_ext_oneapi_graph.asciidoc b/sycl/doc/extensions/proposed/sycl_ext_oneapi_graph.asciidoc index aec359ec925ec..11906e10ab2aa 100644 --- a/sycl/doc/extensions/proposed/sycl_ext_oneapi_graph.asciidoc +++ b/sycl/doc/extensions/proposed/sycl_ext_oneapi_graph.asciidoc @@ -139,8 +139,11 @@ Another reason for deferring a serialize/deserialize API (8) is that its scope could extend from emitting the graph in a binary format, to emitting a standardized IR format that enables further device specific graph optimizations. -Multi-device support (7) is something we are looking into introducing into -the extension, which may result in API changes. +Multi-device support (7) is something that has been designed for in this +extension, with the definition of a graph node being device specific. However, +the ability for a user to define a single graph with nodes targeting different +devices is still disallowed until such a time as this feature can be backed up +by implementation coverage. === Graph Building Mechanisms @@ -207,7 +210,7 @@ Table {counter: tableNumber}. Terminology. (edges), represented by the `command_graph` class. | Node -| A command, which can have different attributes. +| A command, which can have different attributes, targeting a specific device. | Edge | Dependency between commands as a happens-before relationship. @@ -253,15 +256,15 @@ Table {counter: tableNumber}. Recorded Graph Definition. | Concept | Description | Node -| Nodes in a queue recorded graph represent each of the command group -submissions of the program. Each submission encompasses either one or both of -a.) some data movement, b.) a single asynchronous kernel launch. Nodes cannot -define forward edges, only backwards. That is, kernels can only create -dependencies on command-groups that have already been submitted. This means that -transparently a node can depend on a previously recorded graph (sub-graph), -which works by creating edges to the individual nodes in the old graph. Explicit -memory operations without kernels, such as a memory copy, are still classed as -nodes under this definition, as the +| A node in a queue recorded graph represents a command group submission to the +device associated with the queue begin recorded. Each submission encompasses +either one or both of a.) some data movement, b.) a single asynchronous kernel +launch. Nodes cannot define forward edges, only backwards. That is, kernels can +only create dependencies on command-groups that have already been submitted. +This means that transparently a node can depend on a previously recorded graph +(sub-graph), which works by creating edges to the individual nodes in the old +graph. Explicit memory operations without kernels, such as a memory copy, are +still classed as nodes under this definition, as the {explicit-memory-ops}[SYCL 2020 specification states] that these can be seen as specialized kernels executing on the device. @@ -339,7 +342,7 @@ public: node add(const property_list& propList = {}); template - node add(T cgf, const property_list& propList = {}); + node add(const device& syclDevice, T cgf, const property_list& propList = {}); void make_edge(node src, node dest); }; @@ -418,7 +421,8 @@ This extension adds a new `command_graph` object which follows the {crs}[common reference semantics] of other SYCL runtime objects. A `command_graph` represents a directed acyclic graph of nodes, where each node -represents a single command or a sub-graph. The execution of a graph completes +represents a single command for a specific device or a sub-graph. A graph may be +made up of nodes targeting different devices. The execution of a graph completes when all its nodes have completed. A `command_graph` is built up by either recording queue submissions or @@ -461,7 +465,8 @@ using the `command_graph::update()` method. This takes a graph in the modifiable state and updates the executable graph to use the node input & outputs of the modifiable graph, a technique called _Whole Graph Update_. The modifiable graph must have the same topology as the graph originally used to -create the executable graphs, with the nodes added in the same order. +create the executable graphs, with the nodes targeting the same devices and +added in the same order. ==== Graph Member Functions @@ -530,7 +535,7 @@ Exceptions: ---- using namespace ext::oneapi::experimental; template -node add(T cgf, const property_list& propList = {}); +node add(const device& syclDevice, T cgf, const property_list& propList = {}); ---- |This function adds a command group function object to a graph. The function object statically contains a group of commands, of which a single command is @@ -546,6 +551,13 @@ Preconditions: Parameters: +* `syclDevice` - Device to process `cgf` with to create node command-group. + Commands are not required to execute on this device if the behaviour is + consistent with how they would not execute on the device associated with a + `sycl::queue` during regular queue submission. In particular host tasks, + sub-graphs with nodes targeting other devices, and memory copies may not + execute on `syclDevice`. + * `cgf` - Command group function object to be added as a node. * `propList` - Zero or more properties can be provided to the constructed node @@ -558,6 +570,9 @@ Exceptions: * Throws synchronously with error code `invalid` if a queue is recording commands to the graph. +* Throws synchronously with error code `invalid` if `syclDevice` is a different + device from the device targeted by the existing nodes of the graph. + | [source,c++] ---- @@ -625,6 +640,9 @@ Exceptions: A cycle may be introduced to the graph via a call to `make_edge()` that creates a forward dependency. +* Throws synchronously with error code `invalid` if the graph contains a + node which targets a device not present in `syclContext`. + |=== Table {counter: tableNumber}. Member functions of the `command_graph` class for queue recording. @@ -656,6 +674,10 @@ Exceptions: * Throws synchronously with error code `invalid` if `recordingQueue` is already recording to a different graph. +* Throws synchronously with error code `invalid` if `recordingQueue` is + associated with a device which is different from the device already targeted + by the nodes of the graph. + | [source, c++] ---- @@ -939,6 +961,11 @@ Parameters: * `graph` - Graph object to execute. +Exceptions: + +* Throws synchronously with error code `invalid` if the handler is submitted + to a queue which doesn't have a SYCL context which matches the context of + the executable graph. |=== === Thread Safety @@ -977,13 +1004,20 @@ Errors are reported through exceptions, as usual in the SYCL API. For new APIs, submitting a graph for execution can generate unspecified asynchronous errors, while `command_graph::finalize()` may throw unspecified synchronous exceptions. Synchronous exception errors codes are defined for all of -`command_graph::begin_recording()`, `command_graph::end_recording()` and -`command_graph::update()`. +`command_graph::add()`, `command_graph::make_edge()`, `command_graph::update()`, +`command_graph::begin_recording()`, and `command_graph::end_recording()`. + +Submitting an executable graph using `handler::ext_oneapi_graph()` to +a queue with a different SYCL context than that of the executable graph will +result in a synchronous exception. When a queue is in recording mode asynchronous exceptions will not be generated, as no device execution is occurring. Synchronous errors specified as being thrown in the default queue executing state, will still be thrown when a -queue is in the recording state. +queue is in the recording state. A synchronous error with error code `invalid` +will also be thrown if a queue in recording mode tries to record a command to a +graph which already has nodes that target a device different from the device +associated with the recording queue. The `command_graph::begin_recording` and `command_graph::end_recording` entry-points return a `bool` value informing the user whether a related queue @@ -1063,7 +1097,7 @@ evaluated as normal during command graph execution. [source,c++] ---- using namespace ext::oneapi::experimental; -auto node = graph.add([&](sycl::handler& cgh){ +auto node = graph.add(device, [&](sycl::handler& cgh){ // Host code here is evaluated during the call to add() cgh.host_task([=](){ // Code here is evaluated as part of executing the command graph node @@ -1109,6 +1143,42 @@ are the advantage of USM. Design 3) needs specific backend support for deferred allocation. +=== Device Specific Graph + +A modifiable state `command_graph` contains nodes targeting specific devices, +rather than being a device agnostic representation only tied to devices on +finalization. This allows the implementation to process nodes which require +device information when the command group function is evaluated. For example, +a SYCL reduction implementation may desire the work-group/sub-group size, which +is normally gathered by the runtime from the device associated with the queue. + +This design also enables the future capability for a user to compose a graph +with nodes targeting different devices, allowing the benefits of defining an +execution graph ahead of submission to be extended to multi-device platforms. +Without this capability a user currently has to submit individual single-device +graphs and use events for dependencies, which is a usage model this extension is +aiming to optimize. Automatic load balancing of commands across devices is not a +problem this extension currently aims to solve, it is the responsibility of the +user to decide the device each command will be processed for, not the SYCL +runtime. + +A drawback of this design is that it is less convenient for the use-case where +a user would like to run the same graph on N devices. Rather than finalizing a +single modifiable graph N times for N devices in this scenario, the user now has +to record N modifiable graphs and then as finalize each of them. If this use +case does become a usability issue, we could provide a specific API to support +it. For example, an update API for the modifiable `command_graph`, similar to +parameters, so that the same modifiable graph can be updated to new devices then +re-finalized. There may need to be limitations on what devices a graph can be +updated to however, as device specific processing might not be possible to roll +back or defer. As such, it may be only possible to update to identical physical +devices, rather than re-target a GPU constructed graph to a CPU. + +The queue an executable graph is submitted to needs to have the same context as +the executable graph, however execution of a multi-device graph is not limited +to the device associated to the queue. Instead, the queue is used as the +mechanism to submit the graph and express execution dependencies. + == Examples [NOTE] @@ -1138,6 +1208,7 @@ int main() { float gamma = 3.0f; sycl::queue q; + sycl::device device = q.get_device(); sycl_ext::command_graph g; @@ -1155,7 +1226,7 @@ int main() { // c /* init data on the device */ - auto node_i = g.add([&](sycl::handler& h) { + auto node_i = g.add(device, [&](sycl::handler& h) { h.parallel_for(n, [=](sycl::id<1> it){ const size_t i = it[0]; x[i] = 1.0f; @@ -1164,21 +1235,21 @@ int main() { }); }); - auto node_a = g.add([&](sycl::handler& h) { + auto node_a = g.add(device, [&](sycl::handler& h) { h.parallel_for(sycl::range<1>{n}, [=](sycl::id<1> it) { const size_t i = it[0]; x[i] = alpha * x[i] + beta * y[i]; }); }, { sycl_ext::property::node::depends_on(node_i)}); - auto node_b = g.add([&](sycl::handler& h) { + auto node_b = g.add(device, [&](sycl::handler& h) { h.parallel_for(sycl::range<1>{n}, [=](sycl::id<1> it) { const size_t i = it[0]; z[i] = gamma * z[i] + beta * y[i]; }); }, { sycl_ext::property::node::depends_on(node_i)}); - auto node_c = g.add( + auto node_c = g.add(device, [&](sycl::handler& h) { h.parallel_for(sycl::range<1>{n}, sycl::reduction(dotp, 0.0f, std::plus()), @@ -1296,7 +1367,10 @@ submitted in its entirety for execution via Allow an executable graph to contain nodes targeting different devices. -**Outcome:** Under consideration +**Outcome:** This feature has been designed for with the definition of a graph +node being device specific. However, the ability for a user to define a single +graph with nodes targeting different devices is still disallowed until such a +time as this feature can be backed up by implementation coverage. === Memory Allocation API