Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider removing MLActivation parameters used for op fusion #658

Closed
a-sully opened this issue Apr 27, 2024 · 6 comments · Fixed by #664
Closed

Consider removing MLActivation parameters used for op fusion #658

a-sully opened this issue Apr 27, 2024 · 6 comments · Fixed by #664
Labels

Comments

@a-sully
Copy link
Contributor

a-sully commented Apr 27, 2024

MLActivation currently has two uses:

  1. With operators like batchNormalization and conv2d, as a hint that this activation may be fusable with the given operator
  2. With recurrent operators like lstm and gru, as activations applied to gates within the given recurrent unit
Operator Recurrent? Purpose
batchNormalization No Maybe fusable
conv2d No Maybe fusable
convTranspose2d No Maybe fusable
gru Yes Applied to gate
gruCell Yes Applied to gate
lstm Yes Applied to gate
lstmCell Yes Applied to gate

I have some thoughts on (2), but I'll leave that to another issue. Let's focus on (1) here :)

My claim is that there's no benefit to passing MLActivations as parameters to MLOperands for the purpose of op fusion. Here's why:

  1. False positives: For any given permutation of operator, activation, and backend, op fusion may not be possible (it's quite unlikely, actually)
    • CoreML does not natively support fusing activations with any of these operators
    • DirectML only supports fusing some activations with some operators. There's an existing Chromium bug to un-fuse MLActivations from their respective MLOperand if the combo is not supported for the given version of DML
  2. False negatives: For any given operator which does not take currently take an MLActivation in the WebNN spec, it may in fact be fusible with its input or output

What this means in practice is:

  • Implementations wrapping backends which can't fuse operators, or can't fuse some operators with some activations, must trivially break apart the MLOperand and its MLActivation into what's effectively just two MLOperands connected to each other (as Chromium's CoreML backend currently does):
       input
         |
      operator
         |
     activation
         |
       output
    
  • Implementations wrapping backends which can fuse operators must do an optimization pass anyways to fuse operators which do not have MLActivation parameters in WebNN (as Chromium's DML backend currently does)

Whether a given operator can be fused with a given activation is a very backend-specific quirk. Presumably we don't want to plumb through a new MLActivation operator to the web for every operator which any backend decides it can now fuse with some activation! This seems best left as an implementation detail either by the user agent (as described above) or the framework (who knows how much op fusion is happening in Core ML under the hood?! 🤔)

Thoughts?

chromium-wpt-export-bot pushed a commit to web-platform-tests/wpt that referenced this issue Apr 27, 2024
Note that ops are still being fused, unbeknownst to the
web platform layer!

Spec discussions are happening here:
webmachinelearning/webnn#658

Bug: TODO
Change-Id: I8a4f2ab708e34087eeb10421596243b17bc8d7e2
chromium-wpt-export-bot pushed a commit to web-platform-tests/wpt that referenced this issue Apr 27, 2024
Ops can still be fused, unbeknownst to the web platform layer!

Spec discussions are happening here:
webmachinelearning/webnn#658

Bug: TODO
Change-Id: I8a4f2ab708e34087eeb10421596243b17bc8d7e2
chromium-wpt-export-bot pushed a commit to web-platform-tests/wpt that referenced this issue Apr 27, 2024
Ops can still be fused, unbeknownst to the web platform layer!

Spec discussions are happening here:
webmachinelearning/webnn#658

Bug: TODO
Change-Id: I8a4f2ab708e34087eeb10421596243b17bc8d7e2
@fdwr
Copy link
Collaborator

fdwr commented May 1, 2024

🤔 Reviewing so many code reviews the past year, these fields...

  • MLConv2dOptions::activation
  • MLConvTranspose2dOptions::activation
  • MLBatchNormalizationOptions::activation

...have indeed shown to be problematic in the following aspects:

  • Front-end complexity - high-level WebNN clients need more complicated graph node enumeration and mapping logic to fuse activation operators, fusions which may or may not actually even fuse under the hood to provide any perf advantage, depending on the particular backend used and the hardware driver that backend uses; but it adds definite repeated complexity to each caller vs a more straight-forward mapping between their operator currency and WebNN calls. Ideally there should be no performance difference (from the WebNN caller POV) between calling {conv+relu} vs {conv,relu}, as the backend is what knows best which ops are fusible and can fuse them when it calls the platform.
  • Back-end complexity - more backend platforms do not support the fused activations than support them, and if a platform doesn't support a combination, then the backend must treat them as if they were two separately issued operators to begin with by unfusing them. Even within the same backend, different platform versions support different sets of activations; and backends needing to internally support both fusion and unfusion is more burdensome than only supporting one direction - internal fusion of adjacent operators.
  • Test complexity - combinatorial test cases are needed for every base operator with every potential activation.
  • Spec validation issues - some operators (like softmax with the axis parameter) cannot be validated at creation time because it's not until they are combined with a base operator that the input shape is known.
  • Inconsistencies and incompleteness - some MLActivations are only valid with some operators, like clamp which was added as a stand-alone MLActivation useable with convolution but invalid to use with lstm/lstmCell/gru/gruCell, which leads to user surprises and documentation that's not straight-forward. Conversely, some other operators are actually fusible in the backend platform (like GEMM+softmax in DML) but are not expressible at the WebNN layer. Mingming can attest to the bugs and interesting considerations that caused so far. Trying to come up with a complete "activation set" to satisfy this across all the platforms is untenable, given the divergences. There's also semantic awkwardness to just what an "activation" function really is as operators don't have clear buckets that can all fit into - e.g. softmax is a normalization operator, and clamp is an elementwise math operator, but they can also both be activation functions (and that list will grow in the future, something each backend can keep up with relative to the underlying platform better than the WebNN API can satisfy broadly).

This added complexity is for performance, and that performance is worthwhile, but of the currently implemented backends, the one that benefits most from these fusions is the DirectML one in Chromium, and this backend already does a one operator look-ahead anyway for simple adjacent fusions (like Conv+Relu) which means that any WebNN caller can call either {conv, relu} or {conv+relu} and get the same performance benefit - no fusion logic required from the caller at the WebNN API level. So, I support removing activation fields from MLConv2dOptions, MLConvTranspose2dOptions, and MLBatchNormalizationOptions, as the intended benefit still happens, and it eases implementation burden of front-end/back-end/conformance.

FYI @mingmingtasd.

(I also want to check with Chai and Rafael too for their opinions...)

@a-sully
Copy link
Contributor Author

a-sully commented May 1, 2024

Thanks for the support, @fdwr. I've marked #664 as ready for review!

@huningxin
Copy link
Contributor

Thanks for the proposal @a-sully and discussion @fdwr !

I think it is good to leave the activation functions fusion to underlying graph optimizer (either in Chromium backends, such as DirectML backend in Chromium, or native ML frameworks, such as XNNPACK's xnn_subgraph_fusion()). That would make the WebNN API clearer.

Previously when WebNN is used as a target backend of frameworks, such as ONNXRuntime Web and TFLite Wasm runtime, WebNN is likely given a convolution with fused activation function thanks to the framework's graph optimizer. A WebNN backend could leverage that and map to native convolution operator supporting fused activation function directly. Now the WebNN backend should ensure whether the native framework can fuse activation functions. If the native framework doesn't fuse, WebNN backend should add an optimization pass and fuse by itself in order to avoid the performance regression.

Would it be worth adding an implementation note?

@a-sully
Copy link
Contributor Author

a-sully commented May 2, 2024

Thanks, @huningxin. I didn't mention anything about op fusion in #664 because there's already this text in the spec:

Before the execution, the computation graph that is used to compute one or more specified outputs needs to be compiled and optimized. The key purpose of the compilation step is to enable optimizations that span two or more operations, such as operation or loop fusion.

If you have more suggestions, please comment on that PR! :)

@huningxin
Copy link
Contributor

huningxin commented May 3, 2024

@a-sully

I didn't mention anything about op fusion in #664 because there's already this text in the spec:

Before the execution, the computation graph that is used to compute one or more specified outputs needs to be compiled and optimized. The key purpose of the compilation step is to enable optimizations that span two or more operations, such as operation or loop fusion.

That one is good, but it is more about the graph compilation time optimization that is usually done by native ML APIs. I think we won't want to miss the graph construction time optimization opportunity when it is natively supported, as the spec text mentions for previous MLActivation interface

These activations function types are used to create other operations. One such use of this interface is for when an activation function is fused into another operation such as conv2d() or batchNormalization() during a graph construction session. Such fused activation functions can provide a significant performance improvement when supported natively by the underlying implementation. This is intended as an optimization opportunity for implementers.

DirectML backend in Chromium exploits this op fusion opportunity when constructing the DirectML graph.

If you have more suggestions, please comment on that PR! :)

Will do!

@mingmingtasd
Copy link
Contributor

  • Spec validation issues - some operators (like softmax with the axis parameter) cannot be validated at creation time because it's not until they are combined with a base operator that the input shape is known.

@fdwr Correct! Such as the softmax with axis, if the base operator's output shape is unknown, we can't fuse them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants