Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support CPU - WebAssembly scenario of the op level execution use case #156

Closed
huningxin opened this issue Mar 22, 2021 · 5 comments
Closed

Comments

@huningxin
Copy link
Contributor

Open this issue to follow up the operation-specific APIs discussion of 3/18 WebML CG call. @pyu10055 @wchao1115 @anssiko @jbingham, please take a look.

Use case

This is one scenario of the framework's op level execution use case (more details can be found in operation-specific API proposal). A JavaScript ML framework executes ops on the CPU device with WebAssembly. For the compute intensive ops, such as conv2d or matmul, the framework also wants to use WebNN API to execute the op (by a single-op MLGraph) with the ML-specific instructions, such as Vector Neural Network Instructions (VNNI), on the same CPU device.

Requirements

WebNN should allow frameworks create a MLContex for CPU device. This would avoid the unnecessary data copying cross devices when frameworks use WebAssembly - CPU to execute other ops.

WebNN should allow frameworks control when the output data is available for access. This would avoid the unnecessary tensor layout conversions between native ML API and the WebNN. Some background:

  • Some native ML APIs use hardware dependent memory layout for acceleration, for example oneDNN uses different blocked memory layouts for better vectorization and cache reuse on different platforms.
  • The memory layout conversions are expensive.
  • Frameworks may use WebNN API to execute multiple ops (via multiple single-op MLGraphs) without access the intermediate results between them.

For example, a user of TensorFlow.js may execute 3 conv2d but only access the output of the last one:

c = tf.conv2d(a, b);
e = tf.conv2d(c, d);
h = tf.conv2d(f, g);
output = await h.data();

A potential WebNN implementation would only need to do the memory layout conversion and put the data into ArrayBufferView when h.data() is invoked.

@anssiko
Copy link
Member

anssiko commented Mar 30, 2021

(Cross-linking @wchao1115’s comment on WebAssembly.Memory object #149 (comment).)

@huningxin
Copy link
Contributor Author

huningxin commented May 24, 2021

In order to better understand this use case, recently I happened to experimentally implement conv2d of TF.js Wasm backend by WebNN API. The implementation is in conv2d_impl.cc and the WebNN calls are guarded by USE_WEBNN_OP. With the prototype, I observed good performance speedup (3X to 5X) by a tf.conv2d benchmark when offloading the compute to native library (such as XNNPACK or oneDNN) via WebNN running on CPU.

According to the prototype, there are some findings:

  1. TF.js Wasm backend expects the input and output data of an op execution to be in standard layout.
  2. TF.js Wasm backend pre-allocates input and output buffers for an op execution.
  3. TF.js Wasm backend executes an op synchronously.

@pyu10055

@pyu10055
Copy link
Collaborator

@huningxin That is correct, TFJS wasm backend is synchronous, the computational heavy ops are executed with webworkers for multi-threading. And TFJS can be ran with a webworker to achieve asynchronous.

@huningxin
Copy link
Contributor Author

@pyu10055 , thanks for the clarification.

BTW, the webnn-native code to reproduce the conv2d perf of #156 (comment) is in webmachinelearning/webnn-native#10 for review. Feel free to check it out.

@anssiko
Copy link
Member

anssiko commented Mar 3, 2023

While doing issue gardening, noticed this issue had been fixed by #174.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants