pytorch · SS-JIA · Apr 18, 2024 · Apr 18, 2024 · cccclai · Apr 18, 2024
diff --git a/backends/vulkan/README.md b/backends/vulkan/README.md
@@ -0,0 +1,193 @@
+# ExecuTorch Vulkan Delegate
+
+The ExecuTorch Vulkan delegate is a native GPU delegate for ExecuTorch that is
+built on top of the cross-platform Vulkan GPU API standard. It is primarily
+designed to leverage the GPU to accelerate model inference on Android devices,
+but can be used on any platform that supports an implementation of Vulkan:
+laptops, servers, and edge devices.
+
+::::{note}
+The Vulkan delegate is currently under active development, and its components
+are subject to change.
+::::
+
+## What is Vulkan?
+
+Vulkan is a low-level GPU API specification developed as a successor to OpenGL.
+It is designed to offer developers a more explicit control over GPUs compared to
+previous specifications in order to reduce overhead and maximize the
+capabilities of the modern graphics hardware.
+
+Vulkan has been widely adopted among GPU vendors, and most modern GPUs (both
+desktop and mobile) in the market support Vulkan. Vulkan is also included in
+Android from Android 7.0 onwards.
+
+**Note that Vulkan is a GPU API, not a GPU Math Library**. That is to say it
+provides a way to execute compute and graphics operations on a GPU, but does not
+come with a built-in library of performant compute kernels.
+
+## The Vulkan Compute Library
+
+The ExecuTorch Vulkan Delegate is a wrapper around a standalone runtime known as
+the **Vulkan Compute Library**. The aim of the Vulkan Compute Library is to
+provide GPU implementations for PyTorch operators via GLSL compute shaders to be
+executed using Vulkan.
+
+The Vulkan Compute Library is a fork/iteration of the [PyTorch Vulkan Backend](https://pytorch.org/tutorials/prototype/vulkan_workflow.html).
+The core components of the PyTorch Vulkan backend were forked into ExecuTorch
+and adapted for an AOT graph-mode style of model inference (as opposed to
+PyTorch which adopted an eager execution style of model inference).
+
+The components of the Vulkan Compute Library are contained in the
+`executorch/backends/vulkan/runtime/` directory. The core components are listed
+and described below:
+
+```
+runtime/
+├── api/ .................... Wrapper API around Vulkan to manage Vulkan objects
+└── graph/ .................. ComputeGraph class which implements graph mode inference
+    └── ops/ ................ Base directory for operator implementations
+        ├── glsl/ ........... GLSL compute shaders
+        │   ├── *.glsl
+        │   └── conv2d.glsl
+        └── impl/ ........... C++ code to dispatch GPU compute shaders
+            ├── *.cpp
+            └── Conv2d.cpp
+```
+
+## Features
+
+The Vulkan delegate currently supports the following features:
+
+* **Memory Planning**
+  * Intermediate tensors whose lifetimes do not overlap will share memory allocations. This reduces the peak memory usage of model inference.
+* **Capability Based Partitioning**:
+  * A graph can be partially lowered to the Vulkan delegate via a partitioner, which will identify nodes (i.e. operators) that are supported by the Vulkan delegate and lower only supported subgraphs
+* **Support for upper-bound dynamic shapes**:
+  * Tensors can change shape between inferences as long as its current shape is smaller than the bounds specified during lowering
+
+In addition to increasing operator coverage, the following features are
+currently in development:
+
+* **Quantization Support**
+  * We are currently working on support for 8-bit dynamic quantization, with plans to extend to other quantization schemes in the future.
+* **Memory Layout Management**
+  * Memory layout is an important factor to optimizing performance. We plan to introduce graph passes to introduce memory layout transitions throughout a graph to optimize memory-layout sensitive operators such as Convolution and Matrix Multiplication.
+* **Selective Build**
+  * We plan to make it possible to control build size by selecting which operators/shaders you want to build with
+
+## End to End Example
+
+To further understand the features of the Vulkan Delegate and how to use it,
+consider the following end to end example with MobileNet V2.
+
+### Compile and lower a model to the Vulkan Delegate
+
+Assuming ExecuTorch has been set up and installed, the following script can be
+used to produce a lowered MobileNet V2 model as `vulkan_mobilenetv2.pte`.
+
+```
+import torch
+import torchvision.models as models
+
+from torch.export import export, ExportedProgram
+from torchvision.models.mobilenetv2 import MobileNet_V2_Weights
+from executorch.backends.vulkan.partitioner.vulkan_partitioner import VulkanPartitioner
+from executorch.exir import EdgeProgramManager, ExecutorchProgramManager, to_edge
+from executorch.exir.backend.backend_api import to_backend
+
+mobilenet_v2 = models.mobilenetv2.mobilenet_v2(weights=MobileNet_V2_Weights.DEFAULT).eval()
+sample_inputs = (torch.randn(1, 3, 224, 224), )
+
+exported_program: ExportedProgram = export(mobilenet_v2, sample_inputs)
+edge: EdgeProgramManager = to_edge(exported_program)
+
+# Lower the model to Vulkan backend
+edge = edge.to_backend(VulkanPartitioner())
+
+exec_prog = edge.to_executorch()
+
+with open("vulkan_mobilenetv2.pte", "wb") as file:
+    exec_prog.write_to_file(file)
+```
+
+Like other ExecuTorch delegates, a model can be lowered to the Vulkan Delegate
+using the `to_backend()` API. The Vulkan Delegate implements the
+`VulkanPartitioner` class which identifies nodes (i.e. operators) in the graph
+that are supported by the Vulkan delegate, and separates compatible sections of
+the model to be executed on the GPU.
+
+This means the a model can be lowered to the Vulkan delegate even if it contains
+some unsupported operators. This will just mean that only parts of the graph
+will be executed on the GPU.
+
+
+::::{note}
+The [Vulkan partitioner code](https://github.com/pytorch/executorch/blob/main/backends/vulkan/partitioner/vulkan_partitioner.py)
+can be inspected to examine which ops are currently implemented in the Vulkan
+delegate.
+::::
+
+### Build Vulkan Delegate libraries
+
+The easiest way to build and test the Vulkan Delegate is to build for Android
+and test on a local Android device. Android devices have built in support for
+Vulkan, and the Android NDK ships with a GLSL compiler, which is needed to
+compile the Vulkan Compute Library's GLSL compute shaders.
+
+The Vulkan Delegate libraries can be built by setting `-DEXECUTORCH_BUILD_VULKAN=ON`
+when building with CMake.
+
+First, make sure that you have the Android NDK installed - Android NDK r25c is
+recommended. The Android SDK should also be installed so that you have access
+to `adb`.
+
+```shell
+# Recommended version is Android NDK r25c.
+export ANDROID_NDK=<path_to_ndk>
+# Select an appropriate Android ABI
+export ANDROID_ABI=arm64-v8a
+# All subsequent commands should be performed from ExecuTorch repo root
+cd <path_to_executorch_root>
+# Make sure adb works
+adb --version
+```
+
+To build and install ExecuTorch libraries (for Android) with the Vulkan
+Delegate:
+
+```shell
+# From executorch root directory
+(rm -rf cmake-android-out && \
+  pp cmake . -DCMAKE_INSTALL_PREFIX=cmake-android-out \
+    -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \
+    -DANDROID_ABI=$ANDROID_ABI \
+    -DEXECUTORCH_BUILD_VULKAN=ON \
+    -DPYTHON_EXECUTABLE=python \
+    -Bcmake-android-out && \
+  cmake --build cmake-android-out -j16 --target install)
+```
+
+### Run the Vulkan model on device
+
+::::{note}
+Since operator support is currently limited, only binary arithmetic operators
+will run on the GPU. Expect inference to be slow as the majority of operators
+are being executed via Portable operators.
+::::
+
+Now, the partially delegated model can be executed (partially) on your device's
+GPU!
+
+```shell
+# Build a model runner binary linked with the Vulkan delegate libs
+cmake --build cmake-android-out --target vulkan_executor_runner -j32
+
+# Push model to device
+adb push vulkan_mobilenetv2.pte /data/local/tmp/vulkan_mobilenetv2.pte
+# Push binary to device
+adb push cmake-android-out/backends/vulkan/vulkan_executor_runner /data/local/tmp/runner_bin
+
+# Run the model
+adb shell /data/local/tmp/runner_bin --model_path /data/local/tmp/vulkan_mobilenetv2.pte
+```
diff --git a/backends/vulkan/docs/android_demo.md b/backends/vulkan/docs/android_demo.md
@@ -0,0 +1,148 @@
+# Building and Running ExecuTorch with the Vulkan Backend
+
+The [ExecuTorch Vulkan Delegate](./native-delegates-executorch-vulkan-delegate.md)
+is a native GPU delegate for ExecuTorch.
+
+<!----This will show a grid card on the page----->
+::::{grid} 2
+:::{grid-item-card}  What you will learn in this tutorial:
+:class-card: card-content
+* How to export the Stories 110M parameter model with partial GPU delegation
+* How to execute the partially delegated model on Android
+:::
+:::{grid-item-card}  Prerequisites:
+:class-card: card-prerequisites
+* Follow [**Setting up ExecuTorch**](./getting-started-setup.md)
+* Follow [**Setting up the ExecuTorch LLaMA Android Demo App**](./llm/llama-demo-android.md)
+:::
+::::
+
+## Prerequisites
+
+Note that all the steps below should be performed from the ExecuTorch repository
+root directory, and assumes that you have gone through the steps of setting up
+ExecuTorch.
+
+You should also refer to the **Prerequisites** section of the [**Setting up the ExecuTorch LLaMA Android Demo App**](./llm/llama-demo-android.md)
+Tutorial in order to install the specified versions of the Android NDK and the
+Android SDK.
+
+```shell
+# Recommended version is Android NDK r25c.
+export ANDROID_NDK=<path_to_ndk>
+# Select an appropriate Android ABI
+export ANDROID_ABI=arm64-v8a
+# All subsequent commands should be performed from ExecuTorch repo root
+cd <path_to_executorch_root>
+# Make sure adb works
+adb --version
+```
+
+## Lowering the Stories 110M model to Vulkan
+
+::::{note}
+The resultant model will only be partially delegated to the Vulkan backend. In
+particular, only binary arithmetic operators (`aten.add`, `aten.sub`,
+`aten.mul`, `aten.div`) and the matrix multiplication operator (`aten.mm`) will
+be executed on the GPU via the Vulkan delegate. The rest of the model will be
+executed using Portable operators. This is because the Vulkan delegate is still
+early in development and currently has limited operator coverage.
+::::
+
+First, download `stories110M.pt` and `tokenizer.model` from Github:
+
+```shell
+wget "https://huggingface.co/karpathy/tinyllamas/resolve/main/stories110M.pt"
+wget "https://raw.githubusercontent.com/karpathy/llama2.c/master/tokenizer.model"
+```
+
+Next, create the params file:
+
+```shell
+echo '{"dim": 768, "multiple_of": 32, "n_heads": 12, "n_layers": 12, "norm_eps": 1e-05, "vocab_size": 32000}' > params.json
+```
+
+Then, create a tokenizer binary file:
+
+```shell
+python -m examples.models.llama2.tokenizer.tokenizer -t tokenizer.model -o tokenizer.bin
+```
+
+Finally, export the `stories110M.pt` file into an ExecuTorch program:
+
+```shell
+python -m examples.models.llama2.export_llama -c stories110M.pt -p params.json --vulkan
+```
+
+A `vulkan_llama2.pte` file should have been created as a result of the last step.
+
+Push the tokenizer binary and `vulkan_llama2.pte` onto your Android device:
+
+```shell
+adb mkdir /data/local/tmp/llama/
+adb push tokenizer.bin /data/local/tmp/llama/
+adb push vulkan_llama2.pte /data/local/tmp/llama/
+```
+
+## Build and Run the LLaMA runner binary on Android
+
+First, build and install ExecuTorch libraries, then build the LLaMA runner
+binary using the Android NDK toolchain.
+
+```shell
+(rm -rf cmake-android-out && \
+  cmake . -DCMAKE_INSTALL_PREFIX=cmake-android-out \
+    -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \
+    -DANDROID_ABI=$ANDROID_ABI \
+    -DEXECUTORCH_BUILD_EXTENSION_MODULE=ON \
+    -DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON \
+    -DEXECUTORCH_BUILD_VULKAN=ON \
+    -DEXECUTORCH_BUILD_OPTIMIZED=ON \
+    -DPYTHON_EXECUTABLE=python \
+    -Bcmake-android-out && \
+  cmake --build cmake-android-out -j16 --target install)
+
+# Build LLaMA Runner library
+(rm -rf cmake-android-out/examples/models/llama2 && \
+  cmake examples/models/llama2 \
+    -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \
+    -DANDROID_ABI=$ANDROID_ABI \
+    -DCMAKE_INSTALL_PREFIX=cmake-android-out \
+    -DPYTHON_EXECUTABLE=python \
+    -Bcmake-android-out/examples/models/llama2 && \
+  cmake --build cmake-android-out/examples/models/llama2 -j16)
+```
+
+Finally, push and run the llama runner binary on your Android device.
+
+```shell
+adb push cmake-android-out/examples/models/llama2/llama_main /data/local/tmp/llama_main
+
+adb shell /data/local/tmp/llama_main \
+    --model_path=/data/local/tmp/llama/vulkan_llama2.pte \
+    --tokenizer_path=/data/local/tmp/llama/tokenizer.bin \
+    --prompt "hi" \--temperature=0
+```
+
+The following output will be produced:
+
+```
+hippo named Hippy lived in a big pond. Hippy was a very happy hippo. He liked to play...
+```
+
+## Running with the LLaMA Android Demo App
+
+It is also possible to run the partially delegated Vulkan model inside the LLaMA
+Android demo app.
+
+First, make some modifications to the Android app setup script to make sure that
+the Vulkan backend is built when building and installing ExecuTorch libraries:
+
+```shell
+# Run from executorch root directory. You can also edit this in a code editor
+sed -i 's/-DEXECUTORCH_BUILD_XNNPACK=ON/-DEXECUTORCH_BUILD_XNNPACK=ON -DEXECUTORCH_BUILD_VULKAN=ON/g' examples/demo-apps/android/LlamaDemo/setup.sh
+```
+
+Then, Follow the instructions at [**Setting up the ExecuTorch LLaMA Android Demo App**](./llm/llama-demo-android.md)
+to build and run the demo application on your Android device. Once the app
+starts up, you can load and run the `vulkan_llama2.pte` model with the app.
diff --git a/docs/source/build-run-vulkan.md b/docs/source/build-run-vulkan.md
@@ -0,0 +1 @@
+```{include} ../../backends/vulkan/docs/android_demo.md
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -100,6 +100,7 @@ Topics in this section will help you get started with ExecuTorch.
    demo-apps-android
    examples-end-to-end-to-lower-model-to-delegate
    tutorial-xnnpack-delegate-lowering
+   build-run-vulkan
    ..
       Alphabetical by backend name. Be sure to keep the same order in the
       customcarditem entries below.
@@ -183,6 +184,7 @@ Topics in this section will help you get started with ExecuTorch.
    :hidden:
 
    native-delegates-executorch-xnnpack-delegate
+   native-delegates-executorch-vulkan-delegate
    backend-delegates-integration
    backend-delegates-dependencies
 
@@ -262,6 +264,13 @@ ExecuTorch tutorials.
    :link: tutorial-xnnpack-delegate-lowering.html
    :tags: Export,Backend,Delegation,Quantization,XNNPACK
 
+.. customcarditem::
+   :header: Building and Running ExecuTorch with Vulkan Backend
+   :card_description: A tutorial that walks you through the process of building ExecuTorch with Vulkan Backend
+   :image: _static/img/generic-pytorch-logo.png
+   :link: build-run-vulkan.html
+   :tags: Export,Backend,Delegation,Vulkan
+
 ..
    Alphabetical by backend name. Be sure to keep the same order in the Tutorials
    toctree entry above.

diff --git a/docs/source/native-delegates-executorch-vulkan-delegate.md b/docs/source/native-delegates-executorch-vulkan-delegate.md
@@ -0,0 +1 @@
+```{include} ../../backends/vulkan/README.md