diff --git a/developers/general/contributing/index.html b/developers/general/contributing/index.html index 36a7eab651d8..3ef0446bfb39 100755 --- a/developers/general/contributing/index.html +++ b/developers/general/contributing/index.html @@ -4198,7 +4198,7 @@
Ahead-of-time compilation
Scheduling and execution logic are compiled together
Project architecture
Support for advanced model features
Dynamic shapes, flow control, streaming, and more
Importing from ML frameworks
Designed for CPUs, GPUs, and other accelerators
First class support for many popular devices and APIs
Deployment configurations
Low overhead, pipelined execution
Efficient power and resource usage on server and edge devices
Benchmarking
Binary size as low as 30KB on embedded systems
Running on bare-metal
Debugging and profiling support
Profiling with Tracy
IREE supports importing from a variety of ML frameworks:
The IREE compiler tools run on Linux, Windows, and macOS and can generate efficient code for a variety of runtime platforms:
and architectures:
Support for hardware accelerators and APIs is also included:
IREE adopts a holistic approach towards ML model compilation: the IR produced contains both the scheduling logic, required to communicate data dependencies to low-level parallel pipelined hardware/API like Vulkan, and the execution logic, encoding dense computation on the hardware in the form of hardware/API-specific binaries like SPIR-V.
Using IREE involves the following general steps:
Import your model
Develop your program using one of the supported frameworks, then import into IREE
Select your deployment configuration
Identify your target platform, accelerator(s), and other constraints
Compile your model
Compile through IREE, picking settings based on your deployment configuration
Run your model
Use IREE's runtime components to execute your compiled model
IREE supports importing models from a growing list of ML frameworks and model formats:
IREE provides a flexible set of tools for various deployment scenarios. Fully featured environments can use IREE for dynamic model deployments taking advantage of multi-threaded hardware, while embedded systems can bypass IREE's runtime entirely or interface with custom accelerators.
IREE supports the full set of these configurations using the same underlying technology.
Model compilation is performed ahead-of-time on a host machine for any combination of targets. The compilation process converts from layers and operators used by high level frameworks down into optimized native code and associated scheduling logic.
For example, compiling for GPU execution using Vulkan generates SPIR-V kernels and Vulkan API calls. For CPU execution, native code with static or dynamic linkage and the associated function calls are generated.
IREE offers a low level C API, as well as several sets of API bindings for compiling and running programs using various languages.
IREE is in the early stages of development and is not yet ready for broad adoption. We use both GitHub Projects and GitHub Milestones to track progress.
Pronounced \"eerie\" and often styled with the emoji\u00a0\u21a9
While IREE does offer binary distributions for its compiler tools and Python bindings, building from source is still useful when using IREE's runtime or when making changes to the compiler or import tools themselves.
Running on a platform like Android involves cross-compiling from a host platform (e.g. Linux) to a target platform (a specific Android version and system architecture):
You should already be able to build IREE from source on your host platform. Please make sure you have followed the getting started steps.
The Android Native Developer Kit (NDK) is needed to use native C/C++ code on Android. You can download it here, or, if you have installed Android Studio, you can follow this guide instead.
Note
Make sure the ANDROID_NDK environment variable is set after installing the NDK.
ANDROID_NDK
ADB (the Android Debug Bridge) is also needed to communicate with Android devices from the command line. Install it following the official user guide.
Build and install on your host machine:
cmake -GNinja -B ../iree-build/ \\\n -DCMAKE_INSTALL_PREFIX=../iree-build/install \\\n -DCMAKE_BUILD_TYPE=RelWithDebInfo \\\n .\ncmake --build ../iree-build/ --target install\n
Build the runtime using the Android NDK toolchain:
cmake -GNinja -B ../iree-build-android/ \\\n -DCMAKE_TOOLCHAIN_FILE=\"${ANDROID_NDK?}/build/cmake/android.toolchain.cmake\" \\\n -DIREE_HOST_BIN_DIR=\"$PWD/../iree-build/install/bin\" \\\n -DANDROID_ABI=\"arm64-v8a\" \\\n -DANDROID_PLATFORM=\"android-29\" \\\n -DIREE_BUILD_COMPILER=OFF \\\n .\ncmake --build ../iree-build-android/\n
cmake -GNinja -B ../iree-build-android/ \\\n -DCMAKE_TOOLCHAIN_FILE=\"%ANDROID_NDK%/build/cmake/android.toolchain.cmake\" \\\n -DIREE_HOST_BIN_DIR=\"%CD%/../iree-build/install/bin\" \\\n -DANDROID_ABI=\"arm64-v8a\" \\\n -DANDROID_PLATFORM=\"android-29\" \\\n -DIREE_BUILD_COMPILER=OFF \\\n .\ncmake --build ../iree-build-android/\n
See the Android NDK CMake guide and Android Studio CMake guide for details on configuring CMake for Android.
The specific ANDROID_ABI and ANDROID_PLATFORM used should match your target device.
ANDROID_ABI
ANDROID_PLATFORM
Make sure you enable developer options and USB debugging on your Android device and can see your it when you run adb devices, then run all tests through ctest:
adb devices
# Build test dependencies\ncmake --build ../iree-build-android/ --target iree-test-deps\n\n# Ensure that your Android device is visible\nadb devices\n\n# Run tests\nctest --test-dir ../iree-build-android/ --output-on-failure\n
This will automatically upload build artifacts to the connected Android device, run the tests, then report the status back to your host machine.
Invoke the host compiler tools to produce a bytecode module FlatBuffer:
../iree-build/install/bin/iree-compile \\\n --iree-hal-target-backends=vmvx \\\n samples/models/simple_abs.mlir \\\n -o /tmp/simple_abs_vmvx.vmfb\n
Push the Android runtime tools to the device, along with any FlatBuffer files:
adb push ../iree-build-android/tools/iree-run-module /data/local/tmp/\nadb shell chmod +x /data/local/tmp/iree-run-module\nadb push /tmp/simple_abs_vmvx.vmfb /data/local/tmp/\n
Run the tool:
adb shell /data/local/tmp/iree-run-module --device=local-task \\\n --module=/data/local/tmp/simple_abs_vmvx.vmfb \\\n --function=abs \\\n --input=\"f32=-5\"\n
IREE can be built from source using CMake. We also recommend the Ninja CMake generator and the clang or MSVC C/C++ compilers.
IREE developers and CIs primarily use Ninja, clang, and MSVC. Other configurations (including the Makefile generator and gcc) are \"best effort\". Patches to improve support are always welcome.
Install a compiler/linker (typically \"clang\" and \"lld\" package)
Install CMake (typically \"cmake\" package)
Install Ninja (typically \"ninja-build\" package)
On Debian/Ubuntu:
sudo apt install cmake ninja-build clang lld\n
Install CMake
Install Ninja
If using Homebrew:
brew install cmake ninja\n
Install MSVC from Visual Studio or \"Tools for Visual Studio\" on the official downloads page
Install CMake from the official downloads page
Install Ninja from the official site
Initialize MSVC by running vcvarsall.bat to build on the command line. See the official documentation for details.
vcvarsall.bat
Use Git to clone the IREE repository and initialize its submodules:
git clone https://github.com/iree-org/iree.git\ncd iree\ngit submodule update --init\n
The most basic CMake workflow is:
# Configure\ncmake -G Ninja -B ../iree-build/ .\n\n# Build\ncmake --build ../iree-build/\n
Caution - slow builds
The compiler build is complex. You will want a powerful machine and to tune the settings following the next section. In 2023, we've seen builds take around 5-10 minutes on 64-core Linux machines.
Use case permitting, disabling the compiler build with -DIREE_BUILD_COMPILER=OFF will drastically simplify the build.
-DIREE_BUILD_COMPILER=OFF
The configure step should be customized for your build environment. These settings can improve compile and link times substantially.
# Recommended development options using clang and lld:\ncmake -G Ninja -B ../iree-build/ -S . \\\n -DCMAKE_BUILD_TYPE=RelWithDebInfo \\\n -DIREE_ENABLE_ASSERTIONS=ON \\\n -DIREE_ENABLE_SPLIT_DWARF=ON \\\n -DIREE_ENABLE_THIN_ARCHIVES=ON \\\n -DCMAKE_C_COMPILER=clang \\\n -DCMAKE_CXX_COMPILER=clang++ \\\n -DIREE_ENABLE_LLD=ON\n
# Recommended development options using clang and lld:\ncmake -G Ninja -B ../iree-build/ -S . \\\n -DCMAKE_BUILD_TYPE=RelWithDebInfo \\\n -DIREE_ENABLE_ASSERTIONS=ON \\\n -DIREE_ENABLE_SPLIT_DWARF=ON \\\n -DCMAKE_C_COMPILER=clang \\\n -DCMAKE_CXX_COMPILER=clang++ \\\n -DIREE_ENABLE_LLD=ON\n
It is also possible to add -DIREE_ENABLE_THIN_ARCHIVES=ON if the CMAKE_AR variable is defined and points to the path of either the GNU binutils or LLVM ar program, overriding the default Apple ar.
-DIREE_ENABLE_THIN_ARCHIVES=ON
CMAKE_AR
ar
# Recommended development options:\ncmake -G Ninja -B ../iree-build/ -S . \\\n -DCMAKE_BUILD_TYPE=RelWithDebInfo \\\n -DIREE_ENABLE_ASSERTIONS=ON\n
We recommend using the RelWithDebInfo build type by default for a good balance of debug info and performance. The Debug, Release, and MinSizeRel build types are useful in more specific cases. Note that several useful LLVM debugging features are only available in Debug builds. See the official CMake documentation for general details.
RelWithDebInfo
Debug
Release
MinSizeRel
We recommend using ccache with CMake, especially when rebuilding the compiler. To use it, configure CMake with:
ccache
-DCMAKE_C_COMPILER_LAUNCHER=ccache -DCMAKE_CXX_COMPILER_LAUNCHER=ccache\n
See also our developer documentation for ccache.
By default, the CMake build includes:
llvm-cpu
cuda
vulkan-spirv
local-task
vulkan
The default build does not include:
These can be changed via the IREE_ CMake options listed in the root CMakeLists.txt.
IREE_
CMakeLists.txt
When using IREE within other projects, you can register compiler plugins and runtime HAL drivers. You can also bring your own copy of LLVM and some other tools. See the root CMakeLists.txt for details.
Tests are run via ctest. To build and run the core project tests:
# Build default targets\ncmake --build ../iree-build/\n\n# Run tests\nctest --test-dir ../iree-build/\n
Caution
This has two limitations:
To build and then run all tests:
# 1. Build default targets\ncmake --build ../iree-build/\n\n# 2. Build test dependencies\ncmake --build ../iree-build/ --target iree-test-deps\n\n# 3. Run tests\nctest --test-dir ../iree-build/\n\n\n# Or combine all steps using a utility target\ncmake --build ../iree-build --target iree-run-tests\n
To run only certain tests, we have a helper script that converts environment variables into ctest filters:
# Run default tests\n./build_tools/cmake/ctest_all.sh ../iree-build\n\n# Run tests, turning CUDA on and Vulkan off\nexport IREE_CUDA_DISABLE=0\nexport IREE_VULKAN_DISABLE=1\n./build_tools/cmake/ctest_all.sh ../iree-build\n
# Build\ncmake --build ../iree-build/\n\n# Run a standalone sample application\n../iree-build/runtime/src/iree/runtime/demo/hello_world_embedded\n# 4xf32=1 1.1 1.2 1.3\n# *\n# 4xf32=10 100 1000 10000\n# =\n# 4xf32=10 110 1200 13000\n\n# Try out the developer tools\nls ../iree-build/tools/\n../iree-build/tools/iree-compile --help\n../iree-build/tools/iree-run-module --help\n
Python packages can either be built from source or installed from our releases. See the Python bindings page for details about the bindings themselves.
You will need a recent Python installation >=3.9 (we aim to support non-eol Python versions).
Make sure your 'python' is what you expect:
Note that on multi-python systems, this may have a version suffix, and on many Linuxes where python2 and python3 can co-exist, you may also want to use python3.
python3
which python\npython --version\n
Note that on multi-python systems, this may have a version suffix, and on macOS where python2 and python3 can co-exist, you may also want to use python3.
The Python launcher for Windows (py) can help manage versions.
py
which python\npython --version\npy --list-paths\n
We recommend using virtual environments to manage python packages, such as through venv (about, tutorial):
venv
python -m venv .venv\nsource .venv/bin/activate\n
python -m venv .venv\n.venv\\Scripts\\activate.bat\n
When done, run deactivate.
deactivate
# Upgrade PIP before installing other requirements\npython -m pip install --upgrade pip\n\n# Install IREE build requirements\npython -m pip install -r runtime/bindings/python/iree/runtime/build_requirements.txt\n
To build the Python bindings, configure CMake with the IREE_BUILD_PYTHON_BINDINGS option. We also recommend explicitly setting which Python executable to use with Python3_EXECUTABLE:
IREE_BUILD_PYTHON_BINDINGS
Python3_EXECUTABLE
# Configure (including other options as discussed above)\ncmake -G Ninja -B ../iree-build/ \\\n -DIREE_BUILD_PYTHON_BINDINGS=ON \\\n -DPython3_EXECUTABLE=\"$(which python)\" \\\n .\n\n# Build\ncmake --build ../iree-build/\n
Extend your PYTHONPATH with IREE's bindings/python paths and try importing:
PYTHONPATH
bindings/python
source ../iree-build/.env && export PYTHONPATH\n# The 'PYTHONPATH' environment variable should now contain\n# iree-build/compiler/bindings/python;iree-build/runtime/bindings/python\n\npython -c \"import iree.compiler; help(iree.compiler)\"\npython -c \"import iree.runtime; help(iree.runtime)\"\n
..\\iree-build\\.env.ps1 # or ..\\iree-build\\.env.bat\n# The 'PYTHONPATH' environment variable should now contain\n# iree-build/compiler/bindings/python;iree-build/runtime/bindings/python\n\npython -c \"import iree.compiler; help(iree.compiler)\"\npython -c \"import iree.runtime; help(iree.runtime)\"\n
Using IREE's ML framework importers requires a few extra steps:
# Install test requirements\npython -m pip install -r integrations/tensorflow/test/requirements.txt\n\n# Install pure Python packages (no build required)\npython -m pip install integrations/tensorflow/python_projects/iree_tf\npython -m pip install integrations/tensorflow/python_projects/iree_tflite\n\n# Then test the tools:\niree-import-tf --help\niree-import-tflite --help\n
Cross-compilation for iOS consists of the two steps below.
For cross-compilation, you need Xcode. It comes with the SDKs for iOS devices and the simulator, as well as the simctl tool for controlling the simulator from the command line.
simctl
On your host platform, you should already be able to build IREE from source. Please make sure you've gone through the steps in getting started.
Build and install on your macOS host:
cmake -S . -B ../iree-build/ -GNinja \\\n -DCMAKE_BUILD_TYPE=RelWithDebInfo \\\n -DCMAKE_INSTALL_PREFIX=../iree-build/install\n\ncmake --build ../iree-build/ --target install\n
Build the runtime for the iOS Simulator.
cmake -S . -B ../build-ios-sim -GNinja \\\n -DCMAKE_SYSTEM_NAME=iOS \\\n -DCMAKE_OSX_SYSROOT=$(xcodebuild -version -sdk iphonesimulator Path) \\\n -DCMAKE_OSX_ARCHITECTURES=arm64 \\\n -DCMAKE_SYSTEM_PROCESSOR=arm64 \\\n -DCMAKE_OSX_DEPLOYMENT_TARGET=11.0 \\\n -DCMAKE_IOS_INSTALL_COMBINED=YES \\\n -DIREE_HOST_BIN_DIR=\"$PWD/../iree-build/install/bin\" \\\n -DCMAKE_INSTALL_PREFIX=../build-ios-sim/install \\\n -DIREE_BUILD_COMPILER=OFF\n\ncmake --build ../build-ios-sim --config Release --target install\n
Or, we can build the runtime for iOS devices it by changing the value of the -DCMAKE OSX SYSROOT option to:
-DCMAKE OSX SYSROOT
-DCMAKE_OSX_SYSROOT=$(xcodebuild -version -sdk iphoneos Path)\n
Run the IREE compiler on the host to generate a module.
We could test the generated module by running the macOS version of iree-run-module on the host.
iree-run-module
../iree-build/install/bin/iree-run-module \\\n --module=/tmp/simple_abs_vmvx.vmfb \\\n --device=local-task \\\n --function=abs \\\n --input=\"f32=-5\"\n
To run it on the iOS simulator, we need to copy the vmfb file into the iree-run-module iOS app bundle.
cp /tmp/simple_abs_vmvx.vmfb \\\n ../build-ios-sim/install/bin/iree-run-module.app/\n
Open the iOS Simulator Manager on the host.
open -a Simulator\n
After creating and booting a simulator in this app, you can list it from the command-line.
xcrun simctl list devices | grep Booted\n
This is what should come out of the command:
iPhone 14 Pro (12341234-ABCD-ABCD-ABCD-123412341234) (Booted)\n
where iPhone 14 Pro is the device being simulated and 12341234-ABCD-ABCD-ABCD-123412341234 is the simulator's unique device ID (UDID).
iPhone 14 Pro
12341234-ABCD-ABCD-ABCD-123412341234
Install the app iree-run-module on the simulator, given its UDID.
xcrun simctl install <UDID> ../build-ios-sim/install/bin/iree-run-module.app\n
Check the path to the installed bundle, where the simple_abs_vmvx.vmfb module should be found.
simple_abs_vmvx.vmfb
ls $(xcrun simctl get_app_container <UDID> dev.iree.iree-run-module)\n
The string dev.iree.iree-run-module is the bundle identifier of the iOS app. The CMake building process generates it and saves it in the property list (plist) file ../build-ios-sim/install/bin/iree-run-module.app/Info.plist.
dev.iree.iree-run-module
../build-ios-sim/install/bin/iree-run-module.app/Info.plist
Launch the iree-run-module app on the simulator to run the IREE module simple_abs_vmvx.vmfb.
xcrun simctl launch --console \\\n <UDID> \\\n dev.iree.runmodule \\\n --device=local-task \\\n --function=abs \\\n --input=\"f32=-5\" \\\n --module=$(xcrun simctl get_app_container <UDID> dev.iree.iree-run-module)/simple_abs_vmvx.vmfb\n
Running on a platform like RISC-V involves cross-compiling from a host platform (e.g. Linux) to a target platform (a specific RISC-V CPU architecture and operating system):
You'll need a RISC-V LLVM compilation toolchain and a RISC-V enabled QEMU emulator.
See instructions in the following links
The RISCV_TOOLCHAIN_ROOT environment variable needs to be set to the root directory of the installed GNU toolchain when building the RISC-V compiler target and the runtime library.
RISCV_TOOLCHAIN_ROOT
Execute the following script to download the prebuilt RISC-V toolchain and QEMU from the IREE root directory:
./build_tools/riscv/riscv_bootstrap.sh\n
The prebuilt toolchain is built with AlmaLinux release 8.8 docker It requires glibc >= 2.28 for your host machine.
For RISC-V vector extensions support, see additional instructions
cmake -GNinja -B ../iree-build/ \\\n -DCMAKE_C_COMPILER=clang \\\n -DCMAKE_CXX_COMPILER=clang++ \\\n -DCMAKE_INSTALL_PREFIX=../iree-build/install \\\n -DCMAKE_BUILD_TYPE=RelWithDebInfo \\\n .\ncmake --build ../iree-build/ --target install\n
The following instruction shows how to build for a RISC-V 64-bit Linux machine. For other RISC-V targets, please refer to riscv.toolchain.cmake as a reference of how to set up the cmake configuration.
cmake -GNinja -B ../iree-build-riscv/ \\\n -DCMAKE_TOOLCHAIN_FILE=\"./build_tools/cmake/riscv.toolchain.cmake\" \\\n -DIREE_HOST_BIN_DIR=$(realpath ../iree-build/install/bin) \\\n -DRISCV_CPU=linux-riscv_64 \\\n -DIREE_BUILD_COMPILER=OFF \\\n -DRISCV_TOOLCHAIN_ROOT=${RISCV_TOOLCHAIN_ROOT} \\\n -DIREE_ENABLE_CPUINFO=OFF \\\n .\ncmake --build ../iree-build-riscv/\n
The following instructions are meant for the RISC-V 64-bit Linux target. For the bare-metal target, please refer to simple_embedding to see how to build a ML workload for a bare-metal machine.
Set the path to qemu-riscv64 Linux emulator binary in the QEMU_BIN environment variable. If it is installed with riscv_bootstrap.sh, the path is default at ${HOME}/riscv/qemu/linux/RISCV/bin/qemu-riscv64.
QEMU_BIN
riscv_bootstrap.sh
export QEMU_BIN=<path to qemu-riscv64 binary>\n
Run the RISC-V emulation:
${QEMU_BIN} \\\n -cpu rv64 \\\n -L ${RISCV_TOOLCHAIN_ROOT}/sysroot/ \\\n ../iree-build-riscv/tools/iree-run-module \\\n --device=local-task \\\n --module=/tmp/simple_abs_vmvx.vmfb \\\n --function=abs \\\n --input=f32=-5\n
RISC-V Vector extensions allows SIMD code to run more efficiently. To enable the vector extension for the compiler toolchain and the emulator, build the tools from the following sources:
git://sourceware.org/git/binutils-gdb.git
The SIMD code can be generated following the IREE CPU flow with the additional command-line flags
tools/iree-compile \\\n --iree-hal-target-backends=llvm-cpu \\\n --iree-llvmcpu-target-triple=riscv64 \\\n --iree-llvmcpu-target-abi=lp64d \\\n --iree-llvmcpu-target-cpu-features=\"+m,+a,+f,+d,+zvl512b,+v\" \\\n --riscv-v-fixed-length-vector-lmul-max=8 \\\n iree_input.mlir -o mobilenet_cpu.vmfb\n
Then run on the RISC-V QEMU:
${QEMU_BIN} \\\n -cpu rv64,Zve64d=true,vlen=512,elen=64,vext_spec=v1.0 \\\n -L ${RISCV_TOOLCHAIN_ROOT}/sysroot/ \\\n ../iree-build-riscv/tools/iree-run-module \\\n --device=local-task \\\n --module=mobilenet_cpu.vmfb \\\n --function=predict \\\n --input=\"1x224x224x3xf32=0\"\n
Projects built by community members:
The SHARK and SRT projects offer highly tuned performance and user interfaces for running a large corpus of machine learning programs.
The SHARK-Turbine project provides tools for bridging between PyTorch and IREE.
The IREE Bare-Metal Arm Sample shows how to build IREE with the Arm GNU Toolchain for bare-metal Arm targets using the open-source firmware libraries CMSIS and libopencm3.
The IREE C++ Template shows one way to integrate IREE's runtime into a project with CMake.
Official repositories:
iree-jax is home to IREE's AOT support for JAX programs.
iree-experimental includes various samples and prototypes built with IREE.
iree-llvm-sandbox contains experimental work by the IREE team closely related to LLVM and MLIR, usually with the aim of contributing back to those upstream projects.
Updates from the IREE team
IREE is being designed with re-targetability as a core goal: it should be possible to use IREE to target a broad spectrum of power regimes, from embedded systems to distributed clusters; and it should be possible to extend IREE to target new back-ends without having to reinvent the wheel each time.
To explore this, we recently branched out from our initial focus on low-latency mobile deployments with a goal of using IREE to target data center workloads on Nvidia CUDA. This post describes how we quickly brought up a CUDA back-end for IREE and used it to train BERT, then shares some metrics and next steps.
IREE has a HAL API that abstract all the targets behind a common interface. The first step to supporting a CUDA target was to map the HAL API onto CUDA. We use the CUDA driver API to reduce dependencies and be closer to the hardware. The HAL API is based on other GPU APIs like Vulkan and Metal, so it was a natural fit for CUDA. The HAL API exposes memory allocations, basic fill and memset commands, kernel dispatch, and general command buffer handling. The original implementation uses the CUDA graph API as a graph maps naturally to command buffers. There is also an implementation using CUDA streams for comparison.
HAL exposes an API that can be tested independently, even if we are not able to create CUDA kernels yet we can test a large portion of the CUDA driver using CTS tests. Those can be run to make sure a system has the required CUDA support.
CUDA has an open source backend in LLVM generating PTX that we are leveraging. Therefore IREE can create NVVM (CUDA LLVM variant) and use LLVM's backend to generate PTX. The CUDA driver will do the \"last mile compilation\" at runtime to convert PTX into the GPU's native ISA.
IREE compiler pipeline starts from linalg with tensor operands. A large part of the compiler is independent of the target.
The linalg on tensor representation of the graph is broken up into dispatch regions that are processed by NVVM Codegen. A simple implementation of the compiler is to run bufferization and convert linalg to standard followed by conversion to NVVM/LLVM. Most of those transformation can re-use upstream MLIR transformations and share it with any other backend targeting LLVM IR. Leveraging MLIR conversion to LLVM will allow us to quickly go from a simple \"hello world\" to supporting full models.
IREE code generation is based on MLIR infrastructure so each step can easily be tested independently using the MLIR lit framework.
Kernels are encoded in a FlatBuffer containing the PTX code as well as the workgroup size to use for the dispatch. This allows serialization of the kernels in the IR, it is then de-serialized by the HAL layer.
table CUDAExecutableDef {\n // A map of entry point ordinals to string names as used in the shader\n // library.\n entry_points:[string];\n\n // Block sizes for each entry point.\n block_sizes:[CUDABlockSizeDef];\n\n // PTX string of the module.\n ptx_image:string;\n}\n
Together those 3 steps are enough to provide most of the functionality and we can now successfully compile full models.
To reproduce running a simple op end to end through CUDA backend, save the following mlir in /tmp/add.mlir and then run the following given commands:
/tmp/add.mlir
func.func @add(%arg0: tensor<4xf32>, %arg1: tensor<4xf32>) -> tensor<4xf32> {\n %0 = tensor.empty() : tensor<4xf32>\n %1 = linalg.generic {\n indexing_maps = [\n affine_map<(d0) -> (d0)>, affine_map<(d0) -> (d0)>, affine_map<(d0) -> (d0)>], iterator_types = [\"parallel\"]}\n ins(%arg0, %arg1 : tensor<4xf32>, tensor<4xf32>)\n outs(%0 : tensor<4xf32>) {\n ^bb0(%in: f32, %in_0: f32, %out: f32):\n %2 = arith.addf %in, %in_0 : f32\n linalg.yield %2 : f32\n } -> tensor<4xf32>\n return %1 : tensor<4xf32>\n}\n
# First compile into a VM bytecode module.\n$ ../iree-build/tools/iree-compile \\\n --iree-hal-target-backends=cuda \\\n /tmp/add.mlir \\\n -o /tmp/add.vmfb\n\n# Run the module through CUDA HAL backend.\n$ ../iree-build/tools/iree-run-module \\\n --device=cuda \\\n --module=/tmp/add.vmfb \\\n --function=add \\\n --input=\"4xf32=[1 2 3 4]\" \\\n --input=\"4xf32=[2 2 2 2]\"\n\nEXEC @add\n4xf32=3 4 5 6\n
Now that we have enabled functionality we need to look at the performance. Once again we can leverage existing MLIR transformations to speed up the developement work.
The first obvious step to get efficient code on CUDA is to make sure we distribute the work on enough blocks and threads to fill up the GPU. At the time of bring up not all ops were being tiled and distributed in the common IREE layer. During dispatch region creation we apply tile and fuse which will distribute the work into a set of workgroups that are mapped to CUDA blocks.
At the beginning of the code generation we look at the dispatch region and decide on the tile size for a workgroup. For CUDA we also decide the number of threads per block. We will then have a pass tiling the ops in the dispatch region a second time to distribute the work onto threads within the block.
At this stage the IR looks like the following:
%8 = \"gpu.thread_id\"() {dimension = \"x\"} : () -> index\n %9 = affine.apply affine_map<()[s0] -> (s0 * 4)>()[%8]\n %10 = memref.subview %in0[%9] [4] [1] : memref<128xf32, affine_map<(d0)[s0] -> (d0 + s0)>> to memref<4xf32, affine_map<(d0)[s0] -> (d0 + s0)>>\n %11 = memref.subview %in1[%9] [4] [1] : memref<128xf32, affine_map<(d0)[s0] -> (d0 + s0)>> to memref<4xf32, affine_map<(d0)[s0] -> (d0 + s0)>>\n %12 = memref.subview %out[%9] [4] [1] : memref<128xf32, affine_map<(d0)[s0] -> (d0 + s0)>> to memref<4xf32, affine_map<(d0)[s0] -> (d0 + s0)>>\n linalg.generic {\n indexing_maps = [affine_map<(d0) -> (d0)>,\n affine_map<(d0) -> (d0)>,\n affine_map<(d0) -> (d0)>],\n iterator_types = [\"parallel\"]}\n ins(%10, %11 :\n memref<4xf32, affine_map<(d0)[s0] -> (d0 + s0)>>,\n memref<4xf32, affine_map<(d0)[s0] -> (d0 + s0)>>)\n outs(%12 : memref<4xf32, affine_map<(d0)[s0] -> (d0 + s0)>>) {\n ^bb0(%arg1: f32, %arg2: f32, %arg3: f32): // no predecessors\n %13 = addf %arg1, %arg2 : f32\n linalg.yield %13 : f32\n }\n
Even though GPUs execute most operations as scalar, memory operations are optimized to access 128 bits of data per thread. Therefore it is critical to vectorize load/store operations. After tiling to a size we vectorize the IR to get vector read/write mapping to load4/store4. This significantly improves the memory access pattern of the code generated.
This convert the previous IR to:
%8 = \"gpu.thread_id\"() {dimension = \"x\"} : () -> index\n %9 = affine.apply affine_map<()[s0] -> (s0 * 4)>()[%8]\n %10 = memref.subview %in0[%9] [4] [1] : memref<128xf32, affine_map<(d0)[s0] -> (d0 + s0)>> to memref<4xf32, affine_map<(d0)[s0] -> (d0 + s0)>>\n %11 = memref.subview %in1[%9] [4] [1] : memref<128xf32, affine_map<(d0)[s0] -> (d0 + s0)>> to memref<4xf32, affine_map<(d0)[s0] -> (d0 + s0)>>\n %12 = memref.subview %out[%9] [4] [1] : memref<128xf32, affine_map<(d0)[s0] -> (d0 + s0)>> to memref<4xf32, affine_map<(d0)[s0] -> (d0 + s0)>>\n %13 = vector.transfer_read %10[%c0], %cst {in_bounds = [true]} : memref<4xf32, affine_map<(d0)[s0] -> (d0 + s0)>>, vector<4xf32>\n %14 = vector.transfer_read %11[%c0], %cst {in_bounds = [true]} : memref<4xf32, affine_map<(d0)[s0] -> (d0 + s0)>>, vector<4xf32>\n %15 = addf %13, %14 : vector<4xf32>\n vector.transfer_write %15, %12[%c0] {in_bounds = [true]} : vector<4xf32>, memref<4xf32, affine_map<(d0)[s0] -> (d0 + s0)>>\n
Nvidia GPUs have a fast shared memory that needs to be leveraged to optimize cases where we may be memory bound and have the potential to re-use memory reads.
For operations like GEMM using shared memory gives us a significant speed up. We leverage memory promotion, vector distribution and software pipelining transformations from MLIR to generate efficient copies from global to shared memory that can be interleaved with the compute work.
Those different transformations compose to this flow:
The full dump step by step of a linalg.matmul operation can be found here.
We compare the performance of a single GEMM operation to highly optimized library cuBLAS using mmperf framework.
The graph can be re-produced based on instructions on mmperf
Nod.ai has contributed an experimental HAL module for ROCM that allows us to re-use the compiler parts to support ROCM, more support is going to be added in the future.
Several performance improvements are still under progress, including optimizing the runtime allocator to reduce the host-side overhead and tuning tile sizes based profiling.
Several models are running and we will publish more detailed benchmark results in the near future.
This tutorial is simultaneously about IREE, MLIR, and specifically the MLIR Linalg dialect.
MLIR is a programming language, but MLIR in itself is almost just an empty shell. What it really provides is a framework allowing to define MLIR dialects which are where the features come from.
The \"IR\" part of the MLIR name stands for \"intermediate representation\". It means that MLIR is meant to be primarily for compiler-internal representations of code. But MLIR is actually fairly nice for humans to work with, and it's not hard to hand-author some MLIR programs from scratch. That is exactly the topic of this tutorial.
The \"ML\" part of the MLIR name stands for \"multi-level\" (not machine learning!). It means that MLIR allows for multiple dialects to be freely mixed in the same MLIR programs. Each dialect can define operations, types and attributes, and each single MLIR statement can mix ops, types and attributes coming from different dialects.
Linalg is a MLIR dialect that essentially consists of a single op, linalg.generic, with most other ops in this dialect being just convenience aliases for special cases of linalg.generic. So, to describe Linalg dialect is essentially to describe linalg.generic.
linalg.generic
The point of this is that this single op, linalg.generic, is:
These traits make the Linalg dialect an ideal \"middle-end\" IR for a machine learning compiler.
IREE is a MLIR compiler and runtime that can lower MLIR programs through successive, ever lower-level dialects, ultimately producing machine code for various CPU, GPU and other hardware targets. Check out the Developer overview docs and the ML frameworks docs.
Front-ends can ingest source programs from various machine-learning frameworks into MLIR Linalg dialect. Boundaries are in flux, but it is a good enough mental model to think of anything up to Linalg as \"front-end\". One example is, for ingesting PyTorch programs, the front-end is torch-mlir and end-users are encouraged to use iree-turbine, which integrates IREE, torch-mlir and PyTorch.
This tutorial is only concerned about the Linalg dialect, and we are going to learn to hand-author some Linalg programs. The point of the above tangent about front-ends is to make it clear that no matter which way you feed a program into IREE, it will internally be rewritten into a Linalg program, because that really is the intermediate representation in this compiler.
IREE builds can be downloaded or installed as Python packages or built from sources.
Before we start: there is also an official Linalg tutorial. It takes a different approach compared to the present tutorial, so the two are complementary.
Here is our first Linalg function. The scalar type used in this program, f32, is 32-bit floating-point.
f32
Notice some elements of MLIR syntax:
%
%result
@
@foo
^
^bb0
#
#map_1d_identity
x
10xf32
dialect.name
tensor.empty
empty
tensor
func.func
func
// The 1D identity map, used below.\n#map_1d_identity = affine_map<(m) -> (m)>\n\n// Define a function @foo taking two tensor arguments `%lhs` and `%rhs` and returning a tensor.\nfunc.func @foo(\n %lhs : tensor<10xf32>,\n %rhs : tensor<10xf32>\n ) -> tensor<10xf32> {\n // A constant used below.\n %c0f32 = arith.constant 0.0 : f32\n // Create a result \"init value\". Think of it as an abstract \"allocation\",\n // creating a tensor but not giving its elements any particular value. It would be\n // undefined behavior to read any element from this tensor.\n %result_empty = tensor.empty() : tensor<10xf32>\n\n // Perform the computation. The following is all a single linalg.generic op.\n\n %result = linalg.generic {\n // This {...} section is the \"attributes\" - some compile-time settings for this op.\n indexing_maps=[\n // Indexing maps for the parameters listed in `ins(...)`\n #map_1d_identity,\n #map_1d_identity,\n // Indexing maps for the parameters listed in `outs(...)`\n #map_1d_identity\n ],\n // There is one tensor dimension, and it's a parallel-iteration dimension,\n // meaning that it occurs also as a result tensor dimension. The alternative\n // would be \"reduction\", for dimensions that do not occur in the result tensor.\n iterator_types=[\"parallel\"]\n } // End of the attributes for this linalg.generic. Next come the parameters:\n // `ins` is where we pass regular input-parameters\n ins(%lhs, %rhs : tensor<10xf32>, tensor<10xf32>)\n // `outs` is where we pass the \"outputs\", but that term has a subtle meaning\n // in linalg. Here we are passing a tensor.empty, meaning just a placeholder\n // for the output with no preexisting element values. In other examples with\n // an accumulator, this is where the accumulator would be passed.\n outs(%result_empty : tensor<10xf32>)\n // End of parameters. The next {...} part is the \"code block\".\n {\n // bb0 is a code block taking one scalar from each input tensor as argument, and\n // computing and \"yielding\" (ie returning) the corresponding output tensor element.\n ^bb0(%lhs_entry : f32, %rhs_entry : f32, %unused_result_entry : f32):\n %add = arith.addf %lhs_entry, %rhs_entry : f32\n linalg.yield %add : f32\n } // End of the basic block. Finally, we describe the return type.\n -> tensor<10xf32>\n\n // End of the linalg.generic op.\n\n // Return the function's return value.\n return %result : tensor<10xf32>\n}\n
Compile it like this:
iree-compile --iree-hal-target-backends=llvm-cpu prog.mlir -o /tmp/prog.vmfb\n
These are just minimalist iree-compile flags for running on CPU without trying to maximize performance.
iree-compile
--iree-hal-target-backends=
--device=
--iree-llvmcpu-target-triple=
--iree-llvmcpu-target-cpu=
--iree-llvmcpu-target-cpu=znver4
--iree-llvmcpu-target-cpu-features=
--iree-llvmcpu-target-cpu=host
Run it like this:
$ iree-run-module --module=/tmp/prog.vmfb \\\n --input=10xf32=[0,1,2,3,4,5,6,7,8,9] \\\n --input=10xf32=[90,80,70,60,50,40,30,20,10,0]\n\nEXEC @foo\nresult[0]: hal.buffer_view\n10xf32=90 81 72 63 54 45 36 27 18 9\n
Here, each --input parameter specifies one input. First its shape and element type, 10xf32, then the example array elements in [...] brackets. The output of iree-run-module above shows the contents of the result.
--input
[...]
While we are going to mostly focus on static shapes for simplicity in the rest of this tutorial, let us give one dynamic-shape example to at least show that that's not a problem. Here is the dynamic-shape equivalent of the previous example.
#map_1d_identity = affine_map<(m) -> (m)>\n\nfunc.func @foo(\n %lhs : tensor<?xf32>,\n %rhs : tensor<?xf32>\n ) -> tensor<?xf32> {\n %c0f32 = arith.constant 0.0 : f32\n %c0 = arith.constant 0 : index\n %size = tensor.dim %lhs, %c0 : tensor<?xf32>\n %result_empty = tensor.empty(%size) : tensor<?xf32>\n\n %result = linalg.generic {\n indexing_maps=[\n // Indexing maps for the parameters listed in `ins(...)`\n #map_1d_identity,\n #map_1d_identity,\n // Indexing maps for the parameters listed in `outs(...)`\n #map_1d_identity\n ],\n iterator_types=[\"parallel\"]\n } ins(%lhs, %rhs : tensor<?xf32>, tensor<?xf32>)\n outs(%result_empty : tensor<?xf32>)\n {\n ^bb0(%lhs_entry : f32, %rhs_entry : f32, %unused_result_entry : f32):\n %add = arith.addf %lhs_entry, %rhs_entry : f32\n linalg.yield %add : f32\n }\n -> tensor<?xf32>\n\n return %result : tensor<?xf32>\n}\n
This program can be compiled and run exactly like the previous one, except that now the iree-run-module command may specify inputs of arbitrary length. The only requirement is that both inputs have the same length, otherwise the linalg.generic will have undefined behavior.
$ iree-compile --iree-hal-target-backends=llvm-cpu prog.mlir -o /tmp/prog.vmfb\n$ iree-run-module --module=/tmp/prog.vmfb \\\n --input=10xf32=[0,1,2,3,4,5,6,7,8,9] \\\n --input=10xf32=[90,80,70,60,50,40,30,20,10,0]\n\nEXEC @foo\nresult[0]: hal.buffer_view\n10xf32=90 81 72 63 54 45 36 27 18 9\n
outs
Here is a more concise variant achieving the same result in fewer lines of code, and giving us a first taste of that that outs(...) parameters list can do. We didn't want to show it first, because it's less idiomatic. outs will only become really necessary (and idiomatic) when we will look at reduction iterators. In the previous examples, we had only passed a tensor.empty placeholder for outs. This new example shows that we can actually pass there any of the inputs that are shaped like the result.
outs(...)
reduction
#map_1d_identity = affine_map<(m) -> (m)>\n\nfunc.func @foo(\n %lhs : tensor<10xf32>,\n %rhs : tensor<10xf32>\n ) -> tensor<10xf32> {\n\n %result = linalg.generic {\n indexing_maps=[\n // Indexing maps for the parameters listed in `ins(...)`\n #map_1d_identity,\n // Indexing maps for the parameters listed in `outs(...)`\n #map_1d_identity\n ],\n iterator_types=[\"parallel\"]\n } ins(%lhs : tensor<10xf32>)\n outs(%rhs : tensor<10xf32>)\n {\n ^bb0(%lhs_entry : f32, %rhs_entry : f32):\n %add = arith.addf %lhs_entry, %rhs_entry : f32\n linalg.yield %add : f32\n }\n -> tensor<10xf32>\n\n return %result : tensor<10xf32>\n}\n
This function takes a 1D array of floats and returns their sum. tensor<f32> is a 0-dimensional tensor type. We could as well extract the single f32 element and return that, but we wanted to make this example as simple as possible.
tensor<f32>
What's subtle here is how the bb0 block in the linalg.generic now actively uses the %result_entry as an operand to arith.addf, yielding the result of this addition on every iteration. Implicitly, this stores the result of that addition to the destination, from where it is re-loaded on the next iteration again as %result_entry. So the SSA value %result_entry has a different value on each iteration.
bb0
%result_entry
arith.addf
Because the values from the outs parameter are now actually used, we can't directly pass there the tensor.empty, whose elements are uninitialized. We have to initialize the result entries as zeroes, which is achieved by the linalg.fill.
linalg.fill
#map_1d_identity = affine_map<(m) -> (m)>\n#map_1d_proj_0d = affine_map<(m) -> ()>\n\nfunc.func @foo(\n %input : tensor<10xf32>) -> tensor<f32> {\n %result_empty = tensor.empty() : tensor<f32>\n %cst_0 = arith.constant 0.0 : f32\n %result_init = linalg.fill ins(%cst_0 : f32) outs(%result_empty : tensor<f32>) -> tensor<f32>\n %result = linalg.generic {\n indexing_maps=[\n // Indexing maps for the parameters listed in `ins(...)`\n #map_1d_identity,\n // Indexing maps for the parameters listed in `outs(...)`\n #map_1d_proj_0d\n ],\n iterator_types=[\"reduction\"]\n } ins(%input : tensor<10xf32>)\n outs(%result_init : tensor<f32>)\n {\n ^bb0(%input_entry : f32, %result_entry : f32):\n %add = arith.addf %input_entry, %result_entry : f32\n linalg.yield %add : f32\n }\n -> tensor<f32>\n\n return %result : tensor<f32>\n}\n
$ iree-compile --iree-hal-target-backends=llvm-cpu prog.mlir -o /tmp/prog.vmfb\n$ iree-run-module --module=/tmp/prog.vmfb --input=10xf32=[0,1,2,3,4,5,6,7,8,9]\n\nEXEC @foo\nresult[0]: hal.buffer_view\nf32=45\n
parallel
This is our first 2D example so for the first time we have to start explaining how the iterator_types are enumerated and we start seeing some more interesting examples of affine_map.
iterator_types
affine_map
#map_2d_identity = affine_map<(m, n) -> (m, n)>\n#map_2d_proj_first = affine_map<(m, n) -> (m)>\n\nfunc.func @foo(\n %input : tensor<3x5xf32>) -> tensor<3xf32> {\n %result_empty = tensor.empty() : tensor<3xf32>\n %cst_0 = arith.constant 0.0 : f32\n %result_init = linalg.fill ins(%cst_0 : f32) outs(%result_empty : tensor<3xf32>) -> tensor<3xf32>\n %result = linalg.generic {\n indexing_maps=[\n // Indexing maps for the parameters listed in `ins(...)`\n #map_2d_identity,\n // Indexing maps for the parameters listed in `outs(...)`\n #map_2d_proj_first\n ],\n iterator_types=[\n // Rule: the i-th iterator_type corresponds to the i-th coordinate in the\n // source space of the affine_maps defined above, (m, n). So:\n \"parallel\", // This refers to the `m` coordinate in the affine-maps.\n // This is the coordinate that is preserved in the result,\n // see the map_2d_proj_first map given above.\n \"reduction\" // This refers to the `n` coordinate in the affine-maps.\n // This is the coordinate that is dropped by the map_2d_proj_first\n // given above and thus not present in the 1D result.\n ]\n } ins(%input : tensor<3x5xf32>)\n outs(%result_init : tensor<3xf32>)\n {\n ^bb0(%input_entry : f32, %result_entry : f32):\n %add = arith.addf %input_entry, %result_entry : f32\n linalg.yield %add : f32\n }\n -> tensor<3xf32>\n\n return %result : tensor<3xf32>\n}\n
$ iree-compile --iree-hal-target-backends=llvm-cpu prog.mlir -o /tmp/prog.vmfb\n$ iree-run-module --module=/tmp/prog.vmfb \\\n --input=3x5xf32=[[0,1,2,3,4],[5,6,7,8,9],[10,11,12,13,14]]\n\nEXEC @foo\nresult[0]: hal.buffer_view\n3xf32=10 35 60\n
linalg.matmul
We are now ready to see how to express matrix multiplication as a linalg.generic. But actually, rather than just writing that by hand, we are going to let Linalg do it for us. Indeed, in addition to linalg.generic, Linalg contains a number of \"named ops\", which are essentially just short-hand notation for special cases of linalg.generic. One of them is linalg.matmul, doing matrix multiplication accumulating into an existing accumulator. Here is a simple function performing a matrix-multiplication-with-accumulation using linalg.matmul. Also in this example, we use dynamic shapes (the ? in the shapes, see the above section where we encountered that), but we could just as well use static shapes.
?
func.func @foo(%lhs: tensor<?x?xf32>, %rhs: tensor<?x?xf32>, %acc: tensor<?x?xf32>) -> tensor<?x?xf32> {\n %result = linalg.matmul\n ins(%lhs, %rhs: tensor<?x?xf32>, tensor<?x?xf32>)\n outs(%acc: tensor<?x?xf32>)\n -> tensor<?x?xf32>\n return %result: tensor<?x?xf32>\n}\n
$ iree-compile --iree-hal-target-backends=llvm-cpu prog.mlir -o /tmp/prog.vmfb\n$ iree-run-module --module=/tmp/prog.vmfb \\\n --input=2x2xf32=[[1,2][3,4]] \\\n --input=2x2xf32=[[1,4][3,2]] \\\n --input=2x2xf32=[[0,0][0,0]]\n\nEXEC @matmul_dynamic\nresult[0]: hal.buffer_view\n2x2xf32=[7 8][15 20]\n
Now we encounter another IREE tool: iree-opt. Unlike iree-compile which compiles a MLIR program all the way down to a .vmfb that's ready to run on the target device, iree-opt only applies selected transformations.
iree-opt
.vmfb
We run:
iree-opt --linalg-generalize-named-ops prog.mlir\n
And that prints:
#map = affine_map<(d0, d1, d2) -> (d0, d2)>\n#map1 = affine_map<(d0, d1, d2) -> (d2, d1)>\n#map2 = affine_map<(d0, d1, d2) -> (d0, d1)>\nmodule {\n func.func @foo(%arg0: tensor<?x?xf32>, %arg1: tensor<?x?xf32>, %arg2: tensor<?x?xf32>) -> tensor<?x?xf32> {\n %0 = linalg.generic {indexing_maps = [#map, #map1, #map2], iterator_types = [\"parallel\", \"parallel\", \"reduction\"]} ins(%arg0, %arg1 : tensor<?x?xf32>, tensor<?x?xf32>) outs(%arg2 : tensor<?x?xf32>) {\n ^bb0(%in: f32, %in_0: f32, %out: f32):\n %1 = arith.mulf %in, %in_0 : f32\n %2 = arith.addf %out, %1 : f32\n linalg.yield %2 : f32\n } -> tensor<?x?xf32>\n return %0 : tensor<?x?xf32>\n }\n}\n
So that's the linalg.generic implementing matrix multiplication equivalently to the above linalg.matmul form. We can compile and run that like the above program and it will have exactly the same result.
Here the 3 listed iterator_types, [\"parallel\", \"parallel\", \"reduction\"], correspond to the 3 listed coordinates in the affine_map's, (d0, d1, d2). So, d0 and d1 are parallel dimensions and d2 is the reduction dimension. That's why the first two affine_map's results involve d2 (they are respectively for the LHS %arg0 and RHS %arg1) and the last affine_map's result only involves the parallel d0 and d1, as it refers to the result matrix.
[\"parallel\", \"parallel\", \"reduction\"]
(d0, d1, d2)
d0
d1
d2
%arg0
%arg1
Some current IREE compiler optimizations are only triggering on named ops like linalg.matmul, not on the equivalent linalg.generic form. Think of that as a non-essential current limitation, and the intent is over time to overcome these, but in the near term do use linalg.matmul when performance matters.
MLIR defines integer types for absolutely any bit-width, including non-power-of-two bit-widths, and in three signedness flavors:
si
ui
i
So for instance, si16 is the 16-bit signed integer type, ui24 is the 24-bit unsigned integer type, and i8 is the sign-less 8-bit integer type.
si16
ui24
i8
Now here is a very important principle of how the MLIR dialects that are relevant to us in IREE operate:
Only use sign-less types. Always encode signedness in operations, not in types.
For example, here is how we perform a matrix multiplication where the LHS is signed 8-bit integers, the RHS is unsigned 8-bit integers, and the accumulator is signed 32-bit integers. Notice how the fact that LHS is signed and the RHS is unsigned is encoded only in the implementation of the linalg.generic basic block, where the LHS and RHS entries are extended, respectively as signed (arith.extsi) and unsigned (arith.extui):
arith.extsi
arith.extui
#map = affine_map<(d0, d1, d2) -> (d0, d2)>\n#map1 = affine_map<(d0, d1, d2) -> (d2, d1)>\n#map2 = affine_map<(d0, d1, d2) -> (d0, d1)>\nmodule {\n func.func @foo(%lhs: tensor<?x?xi8>, %rhs: tensor<?x?xi8>, %acc: tensor<?x?xi32>) -> tensor<?x?xi32> {\n %result = linalg.generic\n {indexing_maps = [#map, #map1, #map2],\n iterator_types = [\"parallel\", \"parallel\", \"reduction\"]}\n ins(%lhs, %rhs : tensor<?x?xi8>, tensor<?x?xi8>)\n outs(%acc : tensor<?x?xi32>) {\n ^bb0(%lhs_entry: i8, %rhs_entry: i8, %acc_entry: i32):\n %lhs_extended = arith.extsi %lhs_entry : i8 to i32\n %rhs_extended = arith.extui %rhs_entry : i8 to i32\n %mul = arith.muli %lhs_extended, %rhs_extended : i32\n %add = arith.addi %acc_entry, %mul : i32\n linalg.yield %add : i32\n } -> tensor<?x?xi32>\n return %result : tensor<?x?xi32>\n }\n}\n
$ iree-compile --iree-hal-target-backends=llvm-cpu prog.mlir -o /tmp/prog.vmfb\n$ iree-run-module --module=/tmp/prog.vmfb \\\n --input=2x2xi8=[[-1,-2][-3,-4]] \\\n --input=2x2xi8=[[1,4][3,2]] \\\n --input=2x2xi32=[[0,0][0,0]]\n\nEXEC @foo\nresult[0]: hal.buffer_view\n2x2xi32=[-7 -8][-15 -20]\n
A current runtime limitation, https://github.com/iree-org/iree/issues/16241, prevents passing sub-byte-bit-width integers on the iree-run-module command line.
Source file: matmul.mlir:
matmul.mlir
func.func @matmul_dynamic(%lhs: tensor<?x?xf32>, %rhs: tensor<?x?xf32>, %acc: tensor<?x?xf32>) -> tensor<?x?xf32> {\n %result = linalg.matmul ins(%lhs, %rhs: tensor<?x?xf32>, tensor<?x?xf32>) outs(%acc: tensor<?x?xf32>) -> tensor<?x?xf32>\n return %result: tensor<?x?xf32>\n}\n
Basic compilation command line:
$ iree-compile matmul.mlir -o /tmp/matmul.vmfb \\\n --iree-hal-target-backends=llvm-cpu \\\n --iree-llvmcpu-target-cpu=znver4 \\\n --iree-llvmcpu-enable-ukernels=all\n
This creates a IREE bytecode module:
$ ls -l /tmp/matmul.vmfb\n\n-rw-rw-r-- 1 2884 Jan 22 10:37 /tmp/matmul.vmfb\n
The above .vmfb is the only thing that's needed to run this matmul on the target device. But to understand microkernels, we are now going to generate additional intermediate files.
Additional iree-compile flags to save intermediate files (IR, assembly, object code):
--iree-hal-dump-executable-intermediates-to=/tmp/matmul --x86-asm-syntax=intel\n
This saves LLVM IR in binary serialization (\"bitcode\", filename extension .bc). To read it, we need to \"disassemble\" it using llvm-dis to obtain textual IR (filename extension .ll).
.bc
llvm-dis
.ll
llvm-dis /tmp/matmul/*.bc\n
Intermediate files:
35196 /tmp/matmul/module_matmul_linked_llvm_cpu_embedded_elf_x86_64.codegen.bc\n 251597 /tmp/matmul/module_matmul_linked_llvm_cpu_embedded_elf_x86_64.codegen.ll\n 181740 /tmp/matmul/module_matmul_linked_llvm_cpu_embedded_elf_x86_64.linked.bc\n1396190 /tmp/matmul/module_matmul_linked_llvm_cpu_embedded_elf_x86_64.linked.ll\n 32096 /tmp/matmul/module_matmul_linked_llvm_cpu_embedded_elf_x86_64.o\n 34504 /tmp/matmul/module_matmul_linked_llvm_cpu_embedded_elf_x86_64.optimized.bc\n 184981 /tmp/matmul/module_matmul_linked_llvm_cpu_embedded_elf_x86_64.optimized.ll\n 82016 /tmp/matmul/module_matmul_linked_llvm_cpu_embedded_elf_x86_64.s\n
Another important iree-compile flag: --mlir-print-ir-after-all records the IR after each pass. We save that (stderr) output to a file, ir.log by appending to the iree-compile command line:
--mlir-print-ir-after-all
ir.log
--mlir-print-ir-after-all 2>/tmp/matmul/ir.log\n
This graph shows the transformations from the source matmul.mlir to the final matmul.vmfb with the various intermediates met in the previous section:
matmul.vmfb
graph TD;\nmatmulontensors-- CPUMaterializeEncoding -->mmt4dontensors;\nmmt4dontensors-- CPULowerToUKernels -->ukernelontensors;\nukernelontensors-- IREEComprehensiveBufferize -->ukernelonmemref;\nukernelonmemref-- LowerUKernelOpsToCalls -->ukernelcall;\nukernelcall-- ConvertToLLVM -->codegenll;\ncodegenll-->bitcodelinking;\ngenericsource-- clang -emit-llvm --> genericbitcode -- llvm-link --> ukernelbitcode;\narchsource -- clang -emit-llvm --> archbitcode -- llvm-link --> ukernelbitcode;\nukernelbitcode-->ukernelbitcodeembedded;\nukernelbitcodeembedded-->bitcodelinking;\nbitcodelinking-->linkedll;\nlinkedll -- IR optimization --> optimizedll;\noptimizedll -- LLVM x86 backend --> asm -- LLVM assembler --> object -- iree-compile output --> vmfb;\nmatmulontensors[\"linalg.matmul on tensors\"];\nmmt4dontensors[\"linalg.mmt4d on tensors\"];\nukernelontensors[\"ukernel.generic on tensors\"];\nukernelonmemref[\"ukernel.generic on memrefs\"];\nukernelcall[\"call to ukernel entry point\"];\ncodegenll[\"module_matmul_...codegen.ll\"];\nlinkedll[\"module_matmul_...linked.ll\"];\noptimizedll[\"module_matmul_...optimized.ll\"];\ngenericsource[\"generic source code\nmmt4d.c\"]\narchsource[\"architecture-specific source code\nmmt4d_x86_64_avx512_base.c\"]\ngenericbitcode[\"generic code as bitcode\nukernel_bitcode_generic_x86_64.bc\"]\narchbitcode[\"architecture-specific code as bitcode\nukernel_bitcode_arch_x86_64_avx512_base.bc\"]\nukernelbitcode[\"linked bitcode\nukernel_bitcode_x86_64.bc\"];\nukernelbitcodeembedded[\"microkernel bitcode embedded as\nstatic data in iree-compile\"];\nbitcodelinking[\"llvm::Linker::LinkInModule\"];\nasm[\"x86 asm, module_matmul_...s\"];\nobject[\"x86 ELF, module_matmul_...o\"];\nvmfb[\"matmul.vmfb\"];\n\nsubgraph Part1[\"Part 1: MLIR code generation\"]\n matmulontensors\n mmt4dontensors\n ukernelontensors\n ukernelonmemref\n ukernelcall\n codegenll\nend\n\nsubgraph Part2[\"Part 2: Microkernels compilation (part of the IREE build)\"]\n genericsource\n archsource\n genericbitcode\n archbitcode\n ukernelbitcode\n ukernelbitcodeembedded\nend\n\nsubgraph Part3[\"Part 3: Linking with microkernels, optimizing, producing object code\"]\n bitcodelinking\n linkedll\n optimizedll\n asm\n object\n vmfb\nend\n\nstyle Part1 stroke:#FDD835,stroke-width:2px\nstyle Part2 stroke:#039BE5,stroke-width:2px\nstyle Part3 stroke:#43A047,stroke-width:2px
Some initial boilerplate happens around our linalg.matmul before anything interesting happens to it.:
\u27a4 Appendix: IR dump after WrapEntryPointsPass
Next, the first interesting thing is the CPUMaterializeEncoding pass, where the linalg.matmul gets rewritten into a linalg.mmt4d which is a matmul with a tiled data layout. This is where we start specializing to the target ISA feature set, AVX-512, favoring a 16x16 tile size for this float32 matmul.
CPUMaterializeEncoding
linalg.mmt4d
\u27a4 Appendix: IR Dump After CPUMaterializeEncoding
The idea is that linalg.mmt4d is what we will have a microkernel for, below. There is no need to have microkernels for anything but the target-optimal tiled layout, so we don't bother carrying a microkernel for linalg.matmul itself. The matrix layout transformation, bringing matrix data into this tiled layout, is also out of the scope of this linalg.mmt4d and hence of the mmt4d microkernel: we can rely on generic code-generation to take care of these byte-permutations, which is our preference as we aim to let that fuse into producers/consumers.
mmt4d
Next comes the rewrite of linalg.mmt4d into a microkernel op, done by the CPULowerToUKernels pass. Here is the TableGen definition of the generic microkernel op we're going to generate:
CPULowerToUKernels
TableGen definition of ukernel.generic
ukernel.generic
C++ compiler code for CPULowerToUKernels
\u27a4 Appendix: IR Dump After CPULowerToUKernels
Notice that this IR is still working on tensor values, not on memref values.
memref
Next, bufferization takes place. tensor values become memref.
\u27a4 Appendix: IR Dump After IREEComprehensiveBufferize
Next, the LowerUKernelOpsToCalls runs, rewriting ukernel.generic ops into function calls.
LowerUKernelOpsToCalls
\u27a4 Appendix: IR Dump After LowerUKernelOpsToCalls
Finally, this gets lowered to the MLIR LLVM dialect, in preparation for outputting plain LLVM IR.
\u27a4 Appendix: IR Dump After ConvertToLLVM
The above gets converted to plain LLVM IR and that's our first intermediate file, module_matmul_linked_llvm_cpu_embedded_elf_x86_64.codegen.bc, which llvm-dis helps disassemble into a textual IR file (.ll).
module_matmul_linked_llvm_cpu_embedded_elf_x86_64.codegen.bc
\u27a4 Appendix: Intermediate file: ...codegen.bc, disassembled to ...codegen.ll
...codegen.bc
...codegen.ll
The above IR references an external symbol iree_uk_mmt4d for the microkernel that it calls, so it now needs to be linked against the ukernels bitcode.
iree_uk_mmt4d
Microkernels are:
clang -emit-llvm
-ffreestanding
#include
C source code for the iree_uk_mmt4d microkernel entry point
This calls an architecture-specific function to return a function pointer to the optimized inner-loop implementation to use for given data types and SIMD ISA features, and then uses that in a generic outer-loop implementation.
So the really interesting part is the implementation of the inner-loop function that we got a function pointer to. For example, here is the one used in our example where the element type is f32 and the target has AVX-512.
A custom CMake function, iree_bitcode_library, wraps clang to compile these C source files with special flags to obtain freestanding bitcode.
iree_bitcode_library
clang
Likewise, a custom CMake function, iree_link_bitcode, wraps llvm-link to link bitcode files.
iree_link_bitcode
llvm-link
These are used during the IREE compiler build (as a dependency of iree-compile) to build microkernels as bitcode for all supported target architectures, generating one bitcode file for each architecture in the build directory:
~/iree-build$ ls ./runtime/src/iree/builtins/ukernel/ukernel_bitcode_*.bc | grep -v generic\n./runtime/src/iree/builtins/ukernel/ukernel_bitcode_arm_32.bc\n./runtime/src/iree/builtins/ukernel/ukernel_bitcode_arm_64.bc\n./runtime/src/iree/builtins/ukernel/ukernel_bitcode_riscv_32.bc\n./runtime/src/iree/builtins/ukernel/ukernel_bitcode_riscv_64.bc\n./runtime/src/iree/builtins/ukernel/ukernel_bitcode_x86_64.bc\n
These files are then embedded as static data within iree-compile, so that iree-compile stays self-contained.
Here are some samples of ukernel bitcode if you are curious what it looks like:
\u27a4 Appendix: embedded microkernel bitcode: iree_uk_mmt4d ukernel entry point
\u27a4 Appendix: embedded microkernel bitcode: inner-loop tile function
The previous two sections covered respectively the compilation of the MLIR module, and the compilation of microkernels, as two separate bitcode modules. Now we turn to how these two bitcode modules are linked together.
After code generation, iree-compile loads microkernel bitcode: https://github.com/iree-org/iree/blob/c437add6a3b1e3e873cec95505d37c4938fee74f/compiler/src/iree/compiler/Dialect/HAL/Target/LLVMCPU/LLVMCPUTarget.cpp#L490
It is worth zooming into that loadUKernelBitcode function as, in addition to just loading the bitcode, it does one important thing: it adds the alwaysinline attribute on every function. As we will see just below, always inlining microkernels is key to achieving perfect results with no downsides compared to a pure code-generation approach. https://github.com/iree-org/iree/blob/c437add6a3b1e3e873cec95505d37c4938fee74f/compiler/src/iree/compiler/Dialect/HAL/Target/LLVMCPU/Builtins/UKernel.cpp#L36-L62
loadUKernelBitcode
alwaysinline
And links it into the current module: https://github.com/iree-org/iree/blob/c437add6a3b1e3e873cec95505d37c4938fee74f/compiler/src/iree/compiler/Dialect/HAL/Target/LLVMCPU/LLVMCPUTarget.cpp#L499
The linked IR so far is not very interesting, as it is still essentially just the concatenation of the above-discussed codegen and microkernel bitcode (except now with alwaysinline attributes). If you are curious, it is dumped as the ...linked.bc file.
...linked.bc
Where it gets interesting is that immediately after that, we run LLVM IR optimization passes, which can be thought of as a form of link-time optimization (LTO): https://github.com/iree-org/iree/blob/c437add6a3b1e3e873cec95505d37c4938fee74f/compiler/src/iree/compiler/Dialect/HAL/Target/LLVMCPU/LLVMCPUTarget.cpp#L527
At this point, all the microkernel code gets inlined into the dispatch function, the correct AVX-512 optimized tile function is selected and inlined, and everything else is DCE'd. That's how the user pays no cost for what they don't use --- not only for the microkernel entry points that they don't call, but also for all the unused code paths within each microkernel.
\u27a4 Appendix: Intermediate file: ...optimized.bc, disassembled to ...optimized.ll
...optimized.bc
...optimized.ll
This then goes to the LLVM x86 backend, which produces x86 assembly.
\u27a4 Appendix: x86 assembly
// -----// IR Dump After mlir::iree_compiler::IREE::ABI::WrapEntryPointsPass (iree-abi-wrap-entry-points) //----- //\n[...]\n// -----// IR Dump After Inliner (inline) //----- //\n#executable_target_embedded_elf_x86_64_ = #hal.executable.target<\"llvm-cpu\", \"embedded-elf-x86_64\", {cpu = \"znver4\", cpu_features = \"+mmx,+popcnt,+sse,+sse2,+sse3,+ssse3,+sse4.1,+sse4.2,+avx,+avx2,+sse4a,+fma,+avx512f,+bmi,+bmi2,+aes,+pclmul,+avx512vl,+avx512bw,+avx512dq,+avx512cd,+avx512vbmi,+avx512ifma,+avx512vpopcntdq,+avx512vbmi2,+gfni,+vpclmulqdq,+avx512vnni,+avx512bitalg,+avx512bf16,+adx,+clflushopt,+clwb,+clzero,+cx16,+cx8,+crc32,+f16c,+fsgsbase,+fxsr,+invpcid,+lzcnt,+movbe,+mwaitx,+pku,+prfchw,+rdpid,+rdpru,+rdrnd,+rdseed,+sahf,+sha,+shstk,+vaes,+wbnoinvd,+x87,+xsave,+xsavec,+xsaveopt,+xsaves,+evex512\", data_layout = \"e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128\", native_vector_size = 64 : index, target_triple = \"x86_64-unknown-unknown-eabi-elf\", ukernels = \"all\"}>\n#device_target_llvm_cpu = #hal.device.target<\"llvm-cpu\", {executable_targets = [#executable_target_embedded_elf_x86_64_]}>\nmodule attributes {hal.device.targets = [#device_target_llvm_cpu]} {\n func.func @matmul_dynamic(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view, %arg2: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub, iree.reflection = {iree.abi.declaration = \"sync func @matmul_dynamic(%input0: tensor<?x?xf32>, %input1: tensor<?x?xf32>, %input2: tensor<?x?xf32>) -> (%output0: tensor<?x?xf32>)\"}} {\n %0 = hal.buffer_view.dim<%arg0 : !hal.buffer_view>[0] : index\n %1 = hal.buffer_view.dim<%arg0 : !hal.buffer_view>[1] : index\n %2 = hal.tensor.import %arg0 \"input0\" : !hal.buffer_view -> tensor<?x?xf32>{%0, %1}\n %3 = hal.buffer_view.dim<%arg1 : !hal.buffer_view>[0] : index\n %4 = hal.buffer_view.dim<%arg1 : !hal.buffer_view>[1] : index\n %5 = hal.tensor.import %arg1 \"input1\" : !hal.buffer_view -> tensor<?x?xf32>{%3, %4}\n %6 = hal.buffer_view.dim<%arg2 : !hal.buffer_view>[0] : index\n %7 = hal.buffer_view.dim<%arg2 : !hal.buffer_view>[1] : index\n %8 = hal.tensor.import %arg2 \"input2\" : !hal.buffer_view -> tensor<?x?xf32>{%6, %7}\n %9 = linalg.matmul ins(%2, %5 : tensor<?x?xf32>, tensor<?x?xf32>) outs(%8 : tensor<?x?xf32>) -> tensor<?x?xf32>\n %10 = hal.tensor.export %9 \"output0\" : tensor<?x?xf32>{%6, %7} -> !hal.buffer_view\n return %10 : !hal.buffer_view\n }\n}\n
// -----// IR Dump After CPUMaterializeEncoding (iree-codegen-cpu-materialize-encoding) //----- //\n[...]\n// -----// IR Dump After Canonicalizer (canonicalize) //----- //\n[...]\n// -----// IR Dump After CSE (cse) //----- //\n#executable_target_embedded_elf_x86_64_ = #hal.executable.target<\"llvm-cpu\", \"embedded-elf-x86_64\", {cpu = \"znver4\", cpu_features = \"+mmx,+popcnt,+sse,+sse2,+sse3,+ssse3,+sse4.1,+sse4.2,+avx,+avx2,+sse4a,+fma,+avx512f,+bmi,+bmi2,+aes,+pclmul,+avx512vl,+avx512bw,+avx512dq,+avx512cd,+avx512vbmi,+avx512ifma,+avx512vpopcntdq,+avx512vbmi2,+gfni,+vpclmulqdq,+avx512vnni,+avx512bitalg,+avx512bf16,+adx,+clflushopt,+clwb,+clzero,+cx16,+cx8,+crc32,+f16c,+fsgsbase,+fxsr,+invpcid,+lzcnt,+movbe,+mwaitx,+pku,+prfchw,+rdpid,+rdpru,+rdrnd,+rdseed,+sahf,+sha,+shstk,+vaes,+wbnoinvd,+x87,+xsave,+xsavec,+xsaveopt,+xsaves,+evex512\", data_layout = \"e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128\", native_vector_size = 64 : index, target_triple = \"x86_64-unknown-unknown-eabi-elf\", ukernels = \"all\"}>\n#map = affine_map<()[s0] -> (s0 ceildiv 16)>\n#device_target_llvm_cpu = #hal.device.target<\"llvm-cpu\", {executable_targets = [#executable_target_embedded_elf_x86_64_]}>\nmodule attributes {hal.device.targets = [#device_target_llvm_cpu]} {\n func.func @matmul_dynamic(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view, %arg2: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub, iree.reflection = {iree.abi.declaration = \"sync func @matmul_dynamic(%input0: tensor<?x?xf32>, %input1: tensor<?x?xf32>, %input2: tensor<?x?xf32>) -> (%output0: tensor<?x?xf32>)\"}} {\n %cst = arith.constant 0.000000e+00 : f32\n %0 = hal.buffer_view.dim<%arg0 : !hal.buffer_view>[0] : index\n %1 = hal.buffer_view.dim<%arg0 : !hal.buffer_view>[1] : index\n %2 = hal.tensor.import %arg0 \"input0\" : !hal.buffer_view -> tensor<?x?xf32>{%0, %1}\n %3 = hal.buffer_view.dim<%arg1 : !hal.buffer_view>[0] : index\n %4 = hal.buffer_view.dim<%arg1 : !hal.buffer_view>[1] : index\n %5 = hal.tensor.import %arg1 \"input1\" : !hal.buffer_view -> tensor<?x?xf32>{%3, %4}\n %6 = hal.buffer_view.dim<%arg2 : !hal.buffer_view>[0] : index\n %7 = hal.buffer_view.dim<%arg2 : !hal.buffer_view>[1] : index\n %8 = hal.tensor.import %arg2 \"input2\" : !hal.buffer_view -> tensor<?x?xf32>{%6, %7}\n %9 = affine.apply #map()[%0]\n %10 = tensor.empty(%9, %1) : tensor<?x?x16x1xf32>\n %pack = tensor.pack %2 padding_value(%cst : f32) outer_dims_perm = [0, 1] inner_dims_pos = [0, 1] inner_tiles = [16, 1] into %10 : tensor<?x?xf32> -> tensor<?x?x16x1xf32>\n %11 = affine.apply #map()[%4]\n %12 = tensor.empty(%11, %3) : tensor<?x?x16x1xf32>\n %pack_0 = tensor.pack %5 padding_value(%cst : f32) outer_dims_perm = [1, 0] inner_dims_pos = [1, 0] inner_tiles = [16, 1] into %12 : tensor<?x?xf32> -> tensor<?x?x16x1xf32>\n %13 = affine.apply #map()[%6]\n %14 = affine.apply #map()[%7]\n %15 = tensor.empty(%13, %14) : tensor<?x?x16x16xf32>\n %pack_1 = tensor.pack %8 padding_value(%cst : f32) outer_dims_perm = [0, 1] inner_dims_pos = [0, 1] inner_tiles = [16, 16] into %15 : tensor<?x?xf32> -> tensor<?x?x16x16xf32>\n %16 = linalg.mmt4d ins(%pack, %pack_0 : tensor<?x?x16x1xf32>, tensor<?x?x16x1xf32>) outs(%pack_1 : tensor<?x?x16x16xf32>) -> tensor<?x?x16x16xf32>\n %17 = tensor.empty(%6, %7) : tensor<?x?xf32>\n %unpack = tensor.unpack %16 outer_dims_perm = [0, 1] inner_dims_pos = [0, 1] inner_tiles = [16, 16] into %17 : tensor<?x?x16x16xf32> -> tensor<?x?xf32>\n %18 = hal.tensor.export %unpack \"output0\" : tensor<?x?xf32>{%6, %7} -> !hal.buffer_view\n return %18 : !hal.buffer_view\n }\n}\n
// -----// IR Dump After CPULowerToUKernels (iree-codegen-cpu-lower-to-ukernels) //----- //\nmodule {\n func.func @matmul_dynamic_dispatch_3_mmt4d_DxDxDx16x16x1_f32() {\n %c1281_i32 = arith.constant 1281 : i32\n %c1_i32 = arith.constant 1 : i32\n %c16_i32 = arith.constant 16 : i32\n %c1 = arith.constant 1 : index\n %c0 = arith.constant 0 : index\n %c32_i64 = arith.constant 32 : i64\n %0 = hal.interface.constant.load[0] : i32\n %1 = hal.interface.constant.load[1] : i32\n %2 = hal.interface.constant.load[2] : i32\n %3 = hal.interface.constant.load[3] : i32\n %4 = hal.interface.constant.load[4] : i32\n %5 = hal.interface.constant.load[5] : i32\n %6 = hal.interface.constant.load[6] : i32\n %7 = hal.interface.constant.load[7] : i32\n %8 = hal.interface.constant.load[8] : i32\n %9 = hal.interface.constant.load[9] : i32\n %10 = hal.interface.constant.load[10] : i32\n %11 = hal.interface.constant.load[11] : i32\n %12 = hal.interface.constant.load[12] : i32\n %13 = hal.interface.constant.load[13] : i32\n %14 = hal.interface.constant.load[14] : i32\n %15 = hal.interface.constant.load[15] : i32\n %16 = arith.extui %0 : i32 to i64\n %17 = arith.extui %1 : i32 to i64\n %18 = arith.shli %17, %c32_i64 : i64\n %19 = arith.ori %16, %18 : i64\n %20 = arith.index_castui %19 : i64 to index\n %21 = arith.extui %2 : i32 to i64\n %22 = arith.extui %3 : i32 to i64\n %23 = arith.shli %22, %c32_i64 : i64\n %24 = arith.ori %21, %23 : i64\n %25 = arith.index_castui %24 : i64 to index\n %26 = arith.extui %4 : i32 to i64\n %27 = arith.extui %5 : i32 to i64\n %28 = arith.shli %27, %c32_i64 : i64\n %29 = arith.ori %26, %28 : i64\n %30 = arith.index_castui %29 : i64 to index\n %31 = arith.extui %6 : i32 to i64\n %32 = arith.extui %7 : i32 to i64\n %33 = arith.shli %32, %c32_i64 : i64\n %34 = arith.ori %31, %33 : i64\n %35 = arith.index_castui %34 : i64 to index\n %36 = arith.extui %8 : i32 to i64\n %37 = arith.extui %9 : i32 to i64\n %38 = arith.shli %37, %c32_i64 : i64\n %39 = arith.ori %36, %38 : i64\n %40 = arith.index_castui %39 : i64 to index\n %41 = arith.extui %10 : i32 to i64\n %42 = arith.extui %11 : i32 to i64\n %43 = arith.shli %42, %c32_i64 : i64\n %44 = arith.ori %41, %43 : i64\n %45 = arith.index_castui %44 : i64 to index\n %46 = arith.extui %12 : i32 to i64\n %47 = arith.extui %13 : i32 to i64\n %48 = arith.shli %47, %c32_i64 : i64\n %49 = arith.ori %46, %48 : i64\n %50 = arith.index_castui %49 : i64 to index\n %51 = arith.extui %14 : i32 to i64\n %52 = arith.extui %15 : i32 to i64\n %53 = arith.shli %52, %c32_i64 : i64\n %54 = arith.ori %51, %53 : i64\n %55 = arith.index_castui %54 : i64 to index\n %56 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<?x?x16x1xf32>>{%30, %35}\n %57 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%20) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<?x?x16x1xf32>>{%40, %45}\n %58 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%25) : !flow.dispatch.tensor<readwrite:tensor<?x?x16x16xf32>>{%50, %55}\n %workgroup_id_x = hal.interface.workgroup.id[0] : index\n %workgroup_count_x = hal.interface.workgroup.count[0] : index\n %workgroup_id_y = hal.interface.workgroup.id[1] : index\n %workgroup_count_y = hal.interface.workgroup.count[1] : index\n scf.for %arg0 = %workgroup_id_y to %30 step %workgroup_count_y {\n scf.for %arg1 = %workgroup_id_x to %40 step %workgroup_count_x {\n %59 = flow.dispatch.tensor.load %56, offsets = [%arg0, 0, 0, 0], sizes = [1, %35, 16, 1], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<?x?x16x1xf32>>{%30, %35} -> tensor<1x?x16x1xf32>\n %60 = flow.dispatch.tensor.load %57, offsets = [%arg1, 0, 0, 0], sizes = [1, %35, 16, 1], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<?x?x16x1xf32>>{%40, %45} -> tensor<1x?x16x1xf32>\n %61 = flow.dispatch.tensor.load %58, offsets = [%arg0, %arg1, 0, 0], sizes = [1, 1, 16, 16], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readwrite:tensor<?x?x16x16xf32>>{%50, %55} -> tensor<1x1x16x16xf32>\n %dim = tensor.dim %60, %c1 : tensor<1x?x16x1xf32>\n %62 = iree_codegen.ukernel.generic \"iree_uk_mmt4d\" ins(%59, %60 : tensor<1x?x16x1xf32>, tensor<1x?x16x1xf32>) outs(%61 : tensor<1x1x16x16xf32>) (%c1, %c1, %dim, %c16_i32, %c16_i32, %c1_i32, %c1281_i32 : index, index, index, i32, i32, i32, i32) fn_def_attrs {hal.import.bitcode = true, hal.import.cconv = 1 : i32, hal.import.fields = [\"processor_data\"]} strided_outer_dims(1) -> tensor<1x1x16x16xf32>\n flow.dispatch.tensor.store %62, %58, offsets = [%arg0, %arg1, 0, 0], sizes = [1, 1, 16, 16], strides = [1, 1, 1, 1] : tensor<1x1x16x16xf32> -> !flow.dispatch.tensor<readwrite:tensor<?x?x16x16xf32>>{%50, %55}\n }\n }\n return\n }\n}\n
// -----// IR Dump After IREEComprehensiveBufferize (iree-codegen-iree-comprehensive-bufferize) //----- //\n[...]\n// -----// IR Dump After EmptyTensorToAllocTensor (empty-tensor-to-alloc-tensor) //----- //\n[...]\n// -----// IR Dump After ResolveShapedTypeResultDims (resolve-shaped-type-result-dims) //----- //\n[...]\n// -----// IR Dump After Canonicalizer (canonicalize) //----- //\n[...]\n// -----// IR Dump After CSE (cse) //----- //\n[...]\n// -----// IR Dump After CleanupBufferAllocView (iree-codegen-cleanup-buffer-alloc-view) //----- //\nfunc.func @matmul_dynamic_dispatch_3_mmt4d_DxDxDx16x16x1_f32() {\n %c1281_i32 = arith.constant 1281 : i32\n %c1_i32 = arith.constant 1 : i32\n %c16_i32 = arith.constant 16 : i32\n %c1 = arith.constant 1 : index\n %c0 = arith.constant 0 : index\n %c32_i64 = arith.constant 32 : i64\n %0 = hal.interface.constant.load[0] : i32\n %1 = hal.interface.constant.load[1] : i32\n %2 = hal.interface.constant.load[2] : i32\n %3 = hal.interface.constant.load[3] : i32\n %4 = hal.interface.constant.load[4] : i32\n %5 = hal.interface.constant.load[5] : i32\n %6 = hal.interface.constant.load[6] : i32\n %7 = hal.interface.constant.load[7] : i32\n %8 = hal.interface.constant.load[8] : i32\n %9 = hal.interface.constant.load[9] : i32\n %10 = hal.interface.constant.load[10] : i32\n %11 = hal.interface.constant.load[11] : i32\n %12 = hal.interface.constant.load[12] : i32\n %13 = hal.interface.constant.load[13] : i32\n %14 = hal.interface.constant.load[14] : i32\n %15 = hal.interface.constant.load[15] : i32\n %16 = arith.extui %0 : i32 to i64\n %17 = arith.extui %1 : i32 to i64\n %18 = arith.shli %17, %c32_i64 : i64\n %19 = arith.ori %16, %18 : i64\n %20 = arith.index_castui %19 : i64 to index\n %21 = arith.extui %2 : i32 to i64\n %22 = arith.extui %3 : i32 to i64\n %23 = arith.shli %22, %c32_i64 : i64\n %24 = arith.ori %21, %23 : i64\n %25 = arith.index_castui %24 : i64 to index\n %26 = arith.extui %4 : i32 to i64\n %27 = arith.extui %5 : i32 to i64\n %28 = arith.shli %27, %c32_i64 : i64\n %29 = arith.ori %26, %28 : i64\n %30 = arith.index_castui %29 : i64 to index\n %31 = arith.extui %6 : i32 to i64\n %32 = arith.extui %7 : i32 to i64\n %33 = arith.shli %32, %c32_i64 : i64\n %34 = arith.ori %31, %33 : i64\n %35 = arith.index_castui %34 : i64 to index\n %36 = arith.extui %8 : i32 to i64\n %37 = arith.extui %9 : i32 to i64\n %38 = arith.shli %37, %c32_i64 : i64\n %39 = arith.ori %36, %38 : i64\n %40 = arith.index_castui %39 : i64 to index\n %41 = arith.extui %10 : i32 to i64\n %42 = arith.extui %11 : i32 to i64\n %43 = arith.shli %42, %c32_i64 : i64\n %44 = arith.ori %41, %43 : i64\n %45 = arith.index_castui %44 : i64 to index\n %46 = arith.extui %12 : i32 to i64\n %47 = arith.extui %13 : i32 to i64\n %48 = arith.shli %47, %c32_i64 : i64\n %49 = arith.ori %46, %48 : i64\n %50 = arith.index_castui %49 : i64 to index\n %51 = arith.extui %14 : i32 to i64\n %52 = arith.extui %15 : i32 to i64\n %53 = arith.shli %52, %c32_i64 : i64\n %54 = arith.ori %51, %53 : i64\n %55 = arith.index_castui %54 : i64 to index\n %56 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : memref<?x?x16x1xf32, #hal.descriptor_type<storage_buffer>>{%30, %35}\n memref.assume_alignment %56, 64 : memref<?x?x16x1xf32, #hal.descriptor_type<storage_buffer>>\n %57 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%20) flags(ReadOnly) : memref<?x?x16x1xf32, strided<[?, 16, 1, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>{%40, %45}\n memref.assume_alignment %57, 1 : memref<?x?x16x1xf32, strided<[?, 16, 1, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>\n %58 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%25) : memref<?x?x16x16xf32, strided<[?, 256, 16, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>{%50, %55}\n memref.assume_alignment %58, 1 : memref<?x?x16x16xf32, strided<[?, 256, 16, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>\n %workgroup_id_x = hal.interface.workgroup.id[0] : index\n %workgroup_count_x = hal.interface.workgroup.count[0] : index\n %workgroup_id_y = hal.interface.workgroup.id[1] : index\n %workgroup_count_y = hal.interface.workgroup.count[1] : index\n scf.for %arg0 = %workgroup_id_y to %30 step %workgroup_count_y {\n %subview = memref.subview %56[%arg0, 0, 0, 0] [1, %35, 16, 1] [1, 1, 1, 1] : memref<?x?x16x1xf32, #hal.descriptor_type<storage_buffer>> to memref<1x?x16x1xf32, strided<[?, 16, 1, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>\n scf.for %arg1 = %workgroup_id_x to %40 step %workgroup_count_x {\n %subview_0 = memref.subview %57[%arg1, 0, 0, 0] [1, %35, 16, 1] [1, 1, 1, 1] : memref<?x?x16x1xf32, strided<[?, 16, 1, 1], offset: ?>, #hal.descriptor_type<storage_buffer>> to memref<1x?x16x1xf32, strided<[?, 16, 1, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>\n %subview_1 = memref.subview %58[%arg0, %arg1, 0, 0] [1, 1, 16, 16] [1, 1, 1, 1] : memref<?x?x16x16xf32, strided<[?, 256, 16, 1], offset: ?>, #hal.descriptor_type<storage_buffer>> to memref<1x1x16x16xf32, strided<[?, 256, 16, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>\n iree_codegen.ukernel.generic \"iree_uk_mmt4d\" ins(%subview, %subview_0 : memref<1x?x16x1xf32, strided<[?, 16, 1, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>, memref<1x?x16x1xf32, strided<[?, 16, 1, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>) outs(%subview_1 : memref<1x1x16x16xf32, strided<[?, 256, 16, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>) (%c1, %c1, %35, %c16_i32, %c16_i32, %c1_i32, %c1281_i32 : index, index, index, i32, i32, i32, i32) fn_def_attrs {hal.import.bitcode = true, hal.import.cconv = 1 : i32, hal.import.fields = [\"processor_data\"]} strided_outer_dims(1)\n }\n }\n return\n}\n
// -----// IR Dump After LowerUKernelOpsToCalls (iree-codegen-lower-ukernel-ops-to-calls) //----- //\nmodule {\n func.func private @iree_uk_mmt4d(memref<f32>, index, index, memref<f32>, index, index, memref<f32>, index, index, index, index, index, i32, i32, i32, i32) attributes {hal.import.bitcode = true, hal.import.cconv = 1 : i32, hal.import.fields = [\"processor_data\"], llvm.bareptr = true}\n func.func @matmul_dynamic_dispatch_3_mmt4d_DxDxDx16x16x1_f32() {\n %c1281_i32 = arith.constant 1281 : i32\n %c1_i32 = arith.constant 1 : i32\n %c16_i32 = arith.constant 16 : i32\n %c1 = arith.constant 1 : index\n %c0 = arith.constant 0 : index\n %c32_i64 = arith.constant 32 : i64\n %0 = hal.interface.constant.load[0] : i32\n %1 = hal.interface.constant.load[1] : i32\n %2 = hal.interface.constant.load[2] : i32\n %3 = hal.interface.constant.load[3] : i32\n %4 = hal.interface.constant.load[4] : i32\n %5 = hal.interface.constant.load[5] : i32\n %6 = hal.interface.constant.load[6] : i32\n %7 = hal.interface.constant.load[7] : i32\n %8 = hal.interface.constant.load[8] : i32\n %9 = hal.interface.constant.load[9] : i32\n %10 = hal.interface.constant.load[10] : i32\n %11 = hal.interface.constant.load[11] : i32\n %12 = hal.interface.constant.load[12] : i32\n %13 = hal.interface.constant.load[13] : i32\n %14 = hal.interface.constant.load[14] : i32\n %15 = hal.interface.constant.load[15] : i32\n %16 = arith.extui %0 : i32 to i64\n %17 = arith.extui %1 : i32 to i64\n %18 = arith.shli %17, %c32_i64 : i64\n %19 = arith.ori %16, %18 : i64\n %20 = arith.index_castui %19 : i64 to index\n %21 = arith.extui %2 : i32 to i64\n %22 = arith.extui %3 : i32 to i64\n %23 = arith.shli %22, %c32_i64 : i64\n %24 = arith.ori %21, %23 : i64\n %25 = arith.index_castui %24 : i64 to index\n %26 = arith.extui %4 : i32 to i64\n %27 = arith.extui %5 : i32 to i64\n %28 = arith.shli %27, %c32_i64 : i64\n %29 = arith.ori %26, %28 : i64\n %30 = arith.index_castui %29 : i64 to index\n %31 = arith.extui %6 : i32 to i64\n %32 = arith.extui %7 : i32 to i64\n %33 = arith.shli %32, %c32_i64 : i64\n %34 = arith.ori %31, %33 : i64\n %35 = arith.index_castui %34 : i64 to index\n %36 = arith.extui %8 : i32 to i64\n %37 = arith.extui %9 : i32 to i64\n %38 = arith.shli %37, %c32_i64 : i64\n %39 = arith.ori %36, %38 : i64\n %40 = arith.index_castui %39 : i64 to index\n %41 = arith.extui %10 : i32 to i64\n %42 = arith.extui %11 : i32 to i64\n %43 = arith.shli %42, %c32_i64 : i64\n %44 = arith.ori %41, %43 : i64\n %45 = arith.index_castui %44 : i64 to index\n %46 = arith.extui %12 : i32 to i64\n %47 = arith.extui %13 : i32 to i64\n %48 = arith.shli %47, %c32_i64 : i64\n %49 = arith.ori %46, %48 : i64\n %50 = arith.index_castui %49 : i64 to index\n %51 = arith.extui %14 : i32 to i64\n %52 = arith.extui %15 : i32 to i64\n %53 = arith.shli %52, %c32_i64 : i64\n %54 = arith.ori %51, %53 : i64\n %55 = arith.index_castui %54 : i64 to index\n %56 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : memref<?x?x16x1xf32>{%30, %35}\n memref.assume_alignment %56, 64 : memref<?x?x16x1xf32>\n %57 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%20) flags(ReadOnly) : memref<?x?x16x1xf32, strided<[?, 16, 1, 1], offset: ?>>{%40, %45}\n memref.assume_alignment %57, 1 : memref<?x?x16x1xf32, strided<[?, 16, 1, 1], offset: ?>>\n %58 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%25) : memref<?x?x16x16xf32, strided<[?, 256, 16, 1], offset: ?>>{%50, %55}\n memref.assume_alignment %58, 1 : memref<?x?x16x16xf32, strided<[?, 256, 16, 1], offset: ?>>\n %workgroup_id_x = hal.interface.workgroup.id[0] : index\n %workgroup_count_x = hal.interface.workgroup.count[0] : index\n %workgroup_id_y = hal.interface.workgroup.id[1] : index\n %workgroup_count_y = hal.interface.workgroup.count[1] : index\n scf.for %arg0 = %workgroup_id_y to %30 step %workgroup_count_y {\n %subview = memref.subview %56[%arg0, 0, 0, 0] [1, %35, 16, 1] [1, 1, 1, 1] : memref<?x?x16x1xf32> to memref<1x?x16x1xf32, strided<[?, 16, 1, 1], offset: ?>>\n scf.for %arg1 = %workgroup_id_x to %40 step %workgroup_count_x {\n %subview_0 = memref.subview %57[%arg1, 0, 0, 0] [1, %35, 16, 1] [1, 1, 1, 1] : memref<?x?x16x1xf32, strided<[?, 16, 1, 1], offset: ?>> to memref<1x?x16x1xf32, strided<[?, 16, 1, 1], offset: ?>>\n %subview_1 = memref.subview %58[%arg0, %arg1, 0, 0] [1, 1, 16, 16] [1, 1, 1, 1] : memref<?x?x16x16xf32, strided<[?, 256, 16, 1], offset: ?>> to memref<1x1x16x16xf32, strided<[?, 256, 16, 1], offset: ?>>\n %base_buffer, %offset, %sizes:4, %strides:4 = memref.extract_strided_metadata %subview : memref<1x?x16x1xf32, strided<[?, 16, 1, 1], offset: ?>> -> memref<f32>, index, index, index, index, index, index, index, index, index\n %base_buffer_2, %offset_3, %sizes_4:4, %strides_5:4 = memref.extract_strided_metadata %subview_0 : memref<1x?x16x1xf32, strided<[?, 16, 1, 1], offset: ?>> -> memref<f32>, index, index, index, index, index, index, index, index, index\n %base_buffer_6, %offset_7, %sizes_8:4, %strides_9:4 = memref.extract_strided_metadata %subview_1 : memref<1x1x16x16xf32, strided<[?, 256, 16, 1], offset: ?>> -> memref<f32>, index, index, index, index, index, index, index, index, index\n func.call @iree_uk_mmt4d(%base_buffer, %offset, %strides#0, %base_buffer_2, %offset_3, %strides_5#0, %base_buffer_6, %offset_7, %strides_9#0, %c1, %c1, %35, %c16_i32, %c16_i32, %c1_i32, %c1281_i32) : (memref<f32>, index, index, memref<f32>, index, index, memref<f32>, index, index, index, index, index, i32, i32, i32, i32) -> ()\n }\n }\n return\n }\n}\n
// -----// IR Dump After ConvertToLLVM (iree-convert-to-llvm) //----- //\nmodule attributes {llvm.data_layout = \"e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128\", llvm.target_triple = \"x86_64-unknown-unknown-eabi-elf\"} {\n llvm.func @iree_uk_mmt4d(!llvm.ptr) attributes {hal.import.bitcode = true, hal.import.cconv = 1 : i32, hal.import.fields = [\"processor_data\"], llvm.bareptr = true}\n llvm.func @matmul_dynamic_dispatch_3_mmt4d_DxDxDx16x16x1_f32(%arg0: !llvm.ptr {llvm.align = 16 : i64, llvm.noalias}, %arg1: !llvm.ptr {llvm.align = 16 : i64, llvm.noalias}, %arg2: !llvm.ptr {llvm.align = 16 : i64, llvm.noalias}) -> i32 {\n %0 = llvm.mlir.constant(4293970975 : i64) : i64\n %1 = llvm.mlir.constant(8 : i64) : i64\n %2 = llvm.mlir.constant(0 : i32) : i32\n %3 = llvm.mlir.constant(256 : index) : i64\n %4 = llvm.mlir.constant(-1 : index) : i64\n %5 = llvm.mlir.constant(4 : index) : i64\n %6 = llvm.mlir.constant(16 : index) : i64\n %7 = llvm.mlir.constant(0 : index) : i64\n %8 = llvm.mlir.constant(1281 : i32) : i32\n %9 = llvm.mlir.constant(1 : i32) : i32\n %10 = llvm.mlir.constant(16 : i32) : i32\n %11 = llvm.mlir.constant(1 : index) : i64\n %12 = llvm.mlir.constant(32 : i64) : i64\n %13 = llvm.load %arg1 : !llvm.ptr -> !llvm.struct<\"iree_hal_executable_dispatch_state_v0_t\", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>\n %14 = llvm.extractvalue %13[9] : !llvm.struct<\"iree_hal_executable_dispatch_state_v0_t\", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>\n %15 = llvm.load %14 : !llvm.ptr -> i32\n %16 = llvm.load %arg1 : !llvm.ptr -> !llvm.struct<\"iree_hal_executable_dispatch_state_v0_t\", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>\n %17 = llvm.extractvalue %16[9] : !llvm.struct<\"iree_hal_executable_dispatch_state_v0_t\", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>\n %18 = llvm.getelementptr %17[1] : (!llvm.ptr) -> !llvm.ptr, i32\n %19 = llvm.load %18 : !llvm.ptr -> i32\n %20 = llvm.load %arg1 : !llvm.ptr -> !llvm.struct<\"iree_hal_executable_dispatch_state_v0_t\", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>\n %21 = llvm.extractvalue %20[9] : !llvm.struct<\"iree_hal_executable_dispatch_state_v0_t\", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>\n %22 = llvm.getelementptr %21[2] : (!llvm.ptr) -> !llvm.ptr, i32\n %23 = llvm.load %22 : !llvm.ptr -> i32\n %24 = llvm.load %arg1 : !llvm.ptr -> !llvm.struct<\"iree_hal_executable_dispatch_state_v0_t\", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>\n %25 = llvm.extractvalue %24[9] : !llvm.struct<\"iree_hal_executable_dispatch_state_v0_t\", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>\n %26 = llvm.getelementptr %25[3] : (!llvm.ptr) -> !llvm.ptr, i32\n %27 = llvm.load %26 : !llvm.ptr -> i32\n %28 = llvm.load %arg1 : !llvm.ptr -> !llvm.struct<\"iree_hal_executable_dispatch_state_v0_t\", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>\n %29 = llvm.extractvalue %28[9] : !llvm.struct<\"iree_hal_executable_dispatch_state_v0_t\", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>\n %30 = llvm.getelementptr %29[4] : (!llvm.ptr) -> !llvm.ptr, i32\n %31 = llvm.load %30 : !llvm.ptr -> i32\n %32 = llvm.load %arg1 : !llvm.ptr -> !llvm.struct<\"iree_hal_executable_dispatch_state_v0_t\", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>\n %33 = llvm.extractvalue %32[9] : !llvm.struct<\"iree_hal_executable_dispatch_state_v0_t\", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>\n %34 = llvm.getelementptr %33[5] : (!llvm.ptr) -> !llvm.ptr, i32\n %35 = llvm.load %34 : !llvm.ptr -> i32\n %36 = llvm.load %arg1 : !llvm.ptr -> !llvm.struct<\"iree_hal_executable_dispatch_state_v0_t\", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>\n %37 = llvm.extractvalue %36[9] : !llvm.struct<\"iree_hal_executable_dispatch_state_v0_t\", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>\n %38 = llvm.getelementptr %37[6] : (!llvm.ptr) -> !llvm.ptr, i32\n %39 = llvm.load %38 : !llvm.ptr -> i32\n %40 = llvm.load %arg1 : !llvm.ptr -> !llvm.struct<\"iree_hal_executable_dispatch_state_v0_t\", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>\n %41 = llvm.extractvalue %40[9] : !llvm.struct<\"iree_hal_executable_dispatch_state_v0_t\", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>\n %42 = llvm.getelementptr %41[7] : (!llvm.ptr) -> !llvm.ptr, i32\n %43 = llvm.load %42 : !llvm.ptr -> i32\n %44 = llvm.load %arg1 : !llvm.ptr -> !llvm.struct<\"iree_hal_executable_dispatch_state_v0_t\", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>\n %45 = llvm.extractvalue %44[9] : !llvm.struct<\"iree_hal_executable_dispatch_state_v0_t\", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>\n %46 = llvm.getelementptr %45[8] : (!llvm.ptr) -> !llvm.ptr, i32\n %47 = llvm.load %46 : !llvm.ptr -> i32\n %48 = llvm.load %arg1 : !llvm.ptr -> !llvm.struct<\"iree_hal_executable_dispatch_state_v0_t\", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>\n %49 = llvm.extractvalue %48[9] : !llvm.struct<\"iree_hal_executable_dispatch_state_v0_t\", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>\n %50 = llvm.getelementptr %49[9] : (!llvm.ptr) -> !llvm.ptr, i32\n %51 = llvm.load %50 : !llvm.ptr -> i32\n %52 = llvm.load %arg1 : !llvm.ptr -> !llvm.struct<\"iree_hal_executable_dispatch_state_v0_t\", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>\n %53 = llvm.extractvalue %52[9] : !llvm.struct<\"iree_hal_executable_dispatch_state_v0_t\", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>\n %54 = llvm.getelementptr %53[10] : (!llvm.ptr) -> !llvm.ptr, i32\n %55 = llvm.load %54 : !llvm.ptr -> i32\n %56 = llvm.load %arg1 : !llvm.ptr -> !llvm.struct<\"iree_hal_executable_dispatch_state_v0_t\", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>\n %57 = llvm.extractvalue %56[9] : !llvm.struct<\"iree_hal_executable_dispatch_state_v0_t\", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>\n %58 = llvm.getelementptr %57[11] : (!llvm.ptr) -> !llvm.ptr, i32\n %59 = llvm.load %58 : !llvm.ptr -> i32\n %60 = llvm.load %arg1 : !llvm.ptr -> !llvm.struct<\"iree_hal_executable_dispatch_state_v0_t\", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>\n %61 = llvm.extractvalue %60[9] : !llvm.struct<\"iree_hal_executable_dispatch_state_v0_t\", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>\n %62 = llvm.getelementptr %61[14] : (!llvm.ptr) -> !llvm.ptr, i32\n %63 = llvm.load %62 : !llvm.ptr -> i32\n %64 = llvm.load %arg1 : !llvm.ptr -> !llvm.struct<\"iree_hal_executable_dispatch_state_v0_t\", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>\n %65 = llvm.extractvalue %64[9] : !llvm.struct<\"iree_hal_executable_dispatch_state_v0_t\", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>\n %66 = llvm.getelementptr %65[15] : (!llvm.ptr) -> !llvm.ptr, i32\n %67 = llvm.load %66 : !llvm.ptr -> i32\n %68 = llvm.zext %15 : i32 to i64\n %69 = llvm.zext %19 : i32 to i64\n %70 = llvm.shl %69, %12 : i64\n %71 = llvm.or %68, %70 : i64\n %72 = llvm.zext %23 : i32 to i64\n %73 = llvm.zext %27 : i32 to i64\n %74 = llvm.shl %73, %12 : i64\n %75 = llvm.or %72, %74 : i64\n %76 = llvm.zext %31 : i32 to i64\n %77 = llvm.zext %35 : i32 to i64\n %78 = llvm.shl %77, %12 : i64\n %79 = llvm.or %76, %78 : i64\n %80 = llvm.zext %39 : i32 to i64\n %81 = llvm.zext %43 : i32 to i64\n %82 = llvm.shl %81, %12 : i64\n %83 = llvm.or %80, %82 : i64\n %84 = llvm.zext %47 : i32 to i64\n %85 = llvm.zext %51 : i32 to i64\n %86 = llvm.shl %85, %12 : i64\n %87 = llvm.or %84, %86 : i64\n %88 = llvm.zext %55 : i32 to i64\n %89 = llvm.zext %59 : i32 to i64\n %90 = llvm.shl %89, %12 : i64\n %91 = llvm.or %88, %90 : i64\n %92 = llvm.zext %63 : i32 to i64\n %93 = llvm.zext %67 : i32 to i64\n %94 = llvm.shl %93, %12 : i64\n %95 = llvm.or %92, %94 : i64\n %96 = llvm.mul %83, %6 : i64\n %97 = llvm.load %arg1 : !llvm.ptr -> !llvm.struct<\"iree_hal_executable_dispatch_state_v0_t\", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>\n %98 = llvm.extractvalue %97[10] : !llvm.struct<\"iree_hal_executable_dispatch_state_v0_t\", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>\n %99 = llvm.load %98 : !llvm.ptr -> !llvm.ptr\n %100 = llvm.mul %91, %6 : i64\n %101 = llvm.load %arg1 : !llvm.ptr -> !llvm.struct<\"iree_hal_executable_dispatch_state_v0_t\", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>\n %102 = llvm.extractvalue %101[10] : !llvm.struct<\"iree_hal_executable_dispatch_state_v0_t\", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>\n %103 = llvm.load %102 : !llvm.ptr -> !llvm.ptr\n %104 = llvm.mul %95, %3 : i64\n %105 = llvm.load %arg1 : !llvm.ptr -> !llvm.struct<\"iree_hal_executable_dispatch_state_v0_t\", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>\n %106 = llvm.extractvalue %105[10] : !llvm.struct<\"iree_hal_executable_dispatch_state_v0_t\", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>\n %107 = llvm.getelementptr %106[1] : (!llvm.ptr) -> !llvm.ptr, !llvm.ptr\n %108 = llvm.load %107 : !llvm.ptr -> !llvm.ptr\n %109 = llvm.load %arg2 : !llvm.ptr -> !llvm.struct<\"iree_hal_executable_workgroup_state_v0_t\", (i32, i32, i16, i16, i32, ptr, i32)>\n %110 = llvm.extractvalue %109[0] : !llvm.struct<\"iree_hal_executable_workgroup_state_v0_t\", (i32, i32, i16, i16, i32, ptr, i32)>\n %111 = llvm.zext %110 : i32 to i64\n %112 = llvm.load %arg1 : !llvm.ptr -> !llvm.struct<\"iree_hal_executable_dispatch_state_v0_t\", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>\n %113 = llvm.extractvalue %112[4] : !llvm.struct<\"iree_hal_executable_dispatch_state_v0_t\", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>\n %114 = llvm.zext %113 : i32 to i64\n %115 = llvm.load %arg2 : !llvm.ptr -> !llvm.struct<\"iree_hal_executable_workgroup_state_v0_t\", (i32, i32, i16, i16, i32, ptr, i32)>\n %116 = llvm.extractvalue %115[1] : !llvm.struct<\"iree_hal_executable_workgroup_state_v0_t\", (i32, i32, i16, i16, i32, ptr, i32)>\n %117 = llvm.zext %116 : i32 to i64\n %118 = llvm.load %arg1 : !llvm.ptr -> !llvm.struct<\"iree_hal_executable_dispatch_state_v0_t\", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>\n %119 = llvm.extractvalue %118[5] : !llvm.struct<\"iree_hal_executable_dispatch_state_v0_t\", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>\n %120 = llvm.zext %119 : i32 to i64\n llvm.br ^bb1(%117 : i64)\n ^bb1(%121: i64): // 2 preds: ^bb0, ^bb4\n %122 = llvm.icmp \"slt\" %121, %79 : i64\n llvm.cond_br %122, ^bb2(%111 : i64), ^bb5\n ^bb2(%123: i64): // 2 preds: ^bb1, ^bb3\n %124 = llvm.icmp \"slt\" %123, %87 : i64\n llvm.cond_br %124, ^bb3, ^bb4\n ^bb3: // pred: ^bb2\n %125 = llvm.mul %83, %6 : i64\n %126 = llvm.mul %121, %125 : i64\n %127 = llvm.icmp \"slt\" %71, %7 : i64\n %128 = llvm.sub %4, %71 : i64\n %129 = llvm.select %127, %128, %71 : i1, i64\n %130 = llvm.sdiv %129, %5 : i64\n %131 = llvm.sub %4, %130 : i64\n %132 = llvm.select %127, %131, %130 : i1, i64\n %133 = llvm.mul %91, %6 : i64\n %134 = llvm.mul %123, %133 : i64\n %135 = llvm.add %132, %134 : i64\n %136 = llvm.mul %123, %3 : i64\n %137 = llvm.icmp \"slt\" %75, %7 : i64\n %138 = llvm.sub %4, %75 : i64\n %139 = llvm.select %137, %138, %75 : i1, i64\n %140 = llvm.sdiv %139, %5 : i64\n %141 = llvm.sub %4, %140 : i64\n %142 = llvm.select %137, %141, %140 : i1, i64\n %143 = llvm.add %136, %142 : i64\n %144 = llvm.mul %95, %3 : i64\n %145 = llvm.mul %121, %144 : i64\n %146 = llvm.add %143, %145 : i64\n %147 = llvm.getelementptr inbounds %arg0[4] : (!llvm.ptr) -> !llvm.ptr, !llvm.ptr\n %148 = llvm.alloca %1 x i64 {alignment = 8 : i64} : (i64) -> !llvm.ptr\n %149 = llvm.load %147 : !llvm.ptr -> i64\n %150 = llvm.or %149, %0 : i64\n llvm.store %150, %148 : i64, !llvm.ptr\n %151 = llvm.getelementptr inbounds %147[1] : (!llvm.ptr) -> !llvm.ptr, i64\n %152 = llvm.load %151 : !llvm.ptr -> i64\n %153 = llvm.getelementptr inbounds %148[1] : (!llvm.ptr) -> !llvm.ptr, i64\n llvm.store %152, %153 : i64, !llvm.ptr\n %154 = llvm.getelementptr inbounds %147[2] : (!llvm.ptr) -> !llvm.ptr, i64\n %155 = llvm.load %154 : !llvm.ptr -> i64\n %156 = llvm.getelementptr inbounds %148[2] : (!llvm.ptr) -> !llvm.ptr, i64\n llvm.store %155, %156 : i64, !llvm.ptr\n %157 = llvm.getelementptr inbounds %147[3] : (!llvm.ptr) -> !llvm.ptr, i64\n %158 = llvm.load %157 : !llvm.ptr -> i64\n %159 = llvm.getelementptr inbounds %148[3] : (!llvm.ptr) -> !llvm.ptr, i64\n llvm.store %158, %159 : i64, !llvm.ptr\n %160 = llvm.getelementptr inbounds %147[4] : (!llvm.ptr) -> !llvm.ptr, i64\n %161 = llvm.load %160 : !llvm.ptr -> i64\n %162 = llvm.getelementptr inbounds %148[4] : (!llvm.ptr) -> !llvm.ptr, i64\n llvm.store %161, %162 : i64, !llvm.ptr\n %163 = llvm.getelementptr inbounds %147[5] : (!llvm.ptr) -> !llvm.ptr, i64\n %164 = llvm.load %163 : !llvm.ptr -> i64\n %165 = llvm.getelementptr inbounds %148[5] : (!llvm.ptr) -> !llvm.ptr, i64\n llvm.store %164, %165 : i64, !llvm.ptr\n %166 = llvm.getelementptr inbounds %147[6] : (!llvm.ptr) -> !llvm.ptr, i64\n %167 = llvm.load %166 : !llvm.ptr -> i64\n %168 = llvm.getelementptr inbounds %148[6] : (!llvm.ptr) -> !llvm.ptr, i64\n llvm.store %167, %168 : i64, !llvm.ptr\n %169 = llvm.getelementptr inbounds %147[7] : (!llvm.ptr) -> !llvm.ptr, i64\n %170 = llvm.load %169 : !llvm.ptr -> i64\n %171 = llvm.getelementptr inbounds %148[7] : (!llvm.ptr) -> !llvm.ptr, i64\n llvm.store %170, %171 : i64, !llvm.ptr\n %172 = llvm.alloca %11 x !llvm.struct<(ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr)> : (i64) -> !llvm.ptr\n %173 = llvm.mlir.undef : !llvm.struct<(ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr)>\n %174 = llvm.insertvalue %99, %173[0] : !llvm.struct<(ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr)>\n %175 = llvm.insertvalue %126, %174[1] : !llvm.struct<(ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr)>\n %176 = llvm.insertvalue %96, %175[2] : !llvm.struct<(ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr)>\n %177 = llvm.insertvalue %103, %176[3] : !llvm.struct<(ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr)>\n %178 = llvm.insertvalue %135, %177[4] : !llvm.struct<(ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr)>\n %179 = llvm.insertvalue %100, %178[5] : !llvm.struct<(ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr)>\n %180 = llvm.insertvalue %108, %179[6] : !llvm.struct<(ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr)>\n %181 = llvm.insertvalue %146, %180[7] : !llvm.struct<(ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr)>\n %182 = llvm.insertvalue %104, %181[8] : !llvm.struct<(ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr)>\n %183 = llvm.insertvalue %11, %182[9] : !llvm.struct<(ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr)>\n %184 = llvm.insertvalue %11, %183[10] : !llvm.struct<(ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr)>\n %185 = llvm.insertvalue %83, %184[11] : !llvm.struct<(ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr)>\n %186 = llvm.insertvalue %10, %185[12] : !llvm.struct<(ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr)>\n %187 = llvm.insertvalue %10, %186[13] : !llvm.struct<(ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr)>\n %188 = llvm.insertvalue %9, %187[14] : !llvm.struct<(ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr)>\n %189 = llvm.insertvalue %8, %188[15] : !llvm.struct<(ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr)>\n %190 = llvm.insertvalue %148, %189[16] : !llvm.struct<(ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr)>\n llvm.store %190, %172 : !llvm.struct<(ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr)>, !llvm.ptr\n llvm.call @iree_uk_mmt4d(%172) : (!llvm.ptr) -> ()\n %191 = llvm.add %123, %114 : i64\n llvm.br ^bb2(%191 : i64)\n ^bb4: // pred: ^bb2\n %192 = llvm.add %121, %120 : i64\n llvm.br ^bb1(%192 : i64)\n ^bb5: // pred: ^bb1\n llvm.return %2 : i32\n }\n}\n
define internal i32 @matmul_dynamic_dispatch_3_mmt4d_DxDxDx16x16x1_f32(ptr noalias nonnull align 16 %0, ptr noalias nonnull align 16 %1, ptr noalias nonnull align 16 %2) #0 !dbg !90 {\n %4 = load %iree_hal_executable_dispatch_state_v0_t.7, ptr %1, align 8, !dbg !91\n %5 = extractvalue %iree_hal_executable_dispatch_state_v0_t.7 %4, 9, !dbg !91\n %6 = load i32, ptr %5, align 4, !dbg !91\n %7 = getelementptr i32, ptr %5, i32 1, !dbg !91\n %8 = load i32, ptr %7, align 4, !dbg !91\n %9 = getelementptr i32, ptr %5, i32 2, !dbg !91\n %10 = load i32, ptr %9, align 4, !dbg !91\n %11 = getelementptr i32, ptr %5, i32 3, !dbg !91\n %12 = load i32, ptr %11, align 4, !dbg !91\n %13 = getelementptr i32, ptr %5, i32 4, !dbg !91\n %14 = load i32, ptr %13, align 4, !dbg !91\n %15 = getelementptr i32, ptr %5, i32 5, !dbg !91\n %16 = load i32, ptr %15, align 4, !dbg !91\n %17 = getelementptr i32, ptr %5, i32 6, !dbg !91\n %18 = load i32, ptr %17, align 4, !dbg !91\n %19 = getelementptr i32, ptr %5, i32 7, !dbg !91\n %20 = load i32, ptr %19, align 4, !dbg !91\n %21 = getelementptr i32, ptr %5, i32 8, !dbg !91\n %22 = load i32, ptr %21, align 4, !dbg !91\n %23 = getelementptr i32, ptr %5, i32 9, !dbg !91\n %24 = load i32, ptr %23, align 4, !dbg !91\n %25 = getelementptr i32, ptr %5, i32 10, !dbg !91\n %26 = load i32, ptr %25, align 4, !dbg !91\n %27 = getelementptr i32, ptr %5, i32 11, !dbg !91\n %28 = load i32, ptr %27, align 4, !dbg !91\n %29 = getelementptr i32, ptr %5, i32 14, !dbg !91\n %30 = load i32, ptr %29, align 4, !dbg !91\n %31 = getelementptr i32, ptr %5, i32 15, !dbg !91\n %32 = load i32, ptr %31, align 4, !dbg !91\n %33 = zext i32 %6 to i64, !dbg !91\n %34 = zext i32 %8 to i64, !dbg !91\n %35 = shl i64 %34, 32, !dbg !91\n %36 = or i64 %33, %35, !dbg !91\n %37 = zext i32 %10 to i64, !dbg !91\n %38 = zext i32 %12 to i64, !dbg !91\n %39 = shl i64 %38, 32, !dbg !91\n %40 = or i64 %37, %39, !dbg !91\n %41 = zext i32 %14 to i64, !dbg !91\n %42 = zext i32 %16 to i64, !dbg !91\n %43 = shl i64 %42, 32, !dbg !91\n %44 = or i64 %41, %43, !dbg !91\n %45 = zext i32 %18 to i64, !dbg !91\n %46 = zext i32 %20 to i64, !dbg !91\n %47 = shl i64 %46, 32, !dbg !91\n %48 = or i64 %45, %47, !dbg !91\n %49 = zext i32 %22 to i64, !dbg !91\n %50 = zext i32 %24 to i64, !dbg !91\n %51 = shl i64 %50, 32, !dbg !91\n %52 = or i64 %49, %51, !dbg !91\n %53 = zext i32 %26 to i64, !dbg !91\n %54 = zext i32 %28 to i64, !dbg !91\n %55 = shl i64 %54, 32, !dbg !91\n %56 = or i64 %53, %55, !dbg !91\n %57 = zext i32 %30 to i64, !dbg !91\n %58 = zext i32 %32 to i64, !dbg !91\n %59 = shl i64 %58, 32, !dbg !91\n %60 = or i64 %57, %59, !dbg !91\n %61 = mul i64 %48, 16, !dbg !91\n %62 = extractvalue %iree_hal_executable_dispatch_state_v0_t.7 %4, 10, !dbg !91\n %63 = load ptr, ptr %62, align 8, !dbg !91\n %64 = mul i64 %56, 16, !dbg !91\n %65 = mul i64 %60, 256, !dbg !91\n %66 = getelementptr ptr, ptr %62, i32 1, !dbg !91\n %67 = load ptr, ptr %66, align 8, !dbg !91\n %68 = load %iree_hal_executable_workgroup_state_v0_t.8, ptr %2, align 8, !dbg !91\n %69 = extractvalue %iree_hal_executable_workgroup_state_v0_t.8 %68, 0, !dbg !91\n %70 = zext i32 %69 to i64, !dbg !91\n %71 = extractvalue %iree_hal_executable_dispatch_state_v0_t.7 %4, 4, !dbg !91\n %72 = zext i32 %71 to i64, !dbg !91\n %73 = extractvalue %iree_hal_executable_workgroup_state_v0_t.8 %68, 1, !dbg !91\n %74 = zext i32 %73 to i64, !dbg !91\n %75 = extractvalue %iree_hal_executable_dispatch_state_v0_t.7 %4, 5, !dbg !91\n %76 = zext i32 %75 to i64, !dbg !91\n br label %77, !dbg !91\n\n77: ; preds = %147, %3\n %78 = phi i64 [ %148, %147 ], [ %74, %3 ]\n %79 = icmp slt i64 %78, %44, !dbg !91\n br i1 %79, label %80, label %149, !dbg !91\n\n80: ; preds = %83, %77\n %81 = phi i64 [ %146, %83 ], [ %70, %77 ]\n %82 = icmp slt i64 %81, %52, !dbg !91\n br i1 %82, label %83, label %147, !dbg !91\n\n83: ; preds = %80\n %84 = mul i64 %78, %61, !dbg !91\n %85 = icmp slt i64 %36, 0, !dbg !91\n %86 = sub i64 -1, %36, !dbg !91\n %87 = select i1 %85, i64 %86, i64 %36, !dbg !91\n %88 = sdiv i64 %87, 4, !dbg !91\n %89 = sub i64 -1, %88, !dbg !91\n %90 = select i1 %85, i64 %89, i64 %88, !dbg !91\n %91 = mul i64 %81, %64, !dbg !91\n %92 = add i64 %90, %91, !dbg !91\n %93 = mul i64 %81, 256, !dbg !91\n %94 = icmp slt i64 %40, 0, !dbg !91\n %95 = sub i64 -1, %40, !dbg !91\n %96 = select i1 %94, i64 %95, i64 %40, !dbg !91\n %97 = sdiv i64 %96, 4, !dbg !91\n %98 = sub i64 -1, %97, !dbg !91\n %99 = select i1 %94, i64 %98, i64 %97, !dbg !91\n %100 = add i64 %93, %99, !dbg !91\n %101 = mul i64 %78, %65, !dbg !91\n %102 = add i64 %100, %101, !dbg !91\n %103 = getelementptr inbounds ptr, ptr %0, i32 4, !dbg !91\n %104 = alloca i64, i64 8, align 8, !dbg !91\n %105 = load i64, ptr %103, align 4, !dbg !91\n %106 = or i64 %105, 4293970975, !dbg !91\n store i64 %106, ptr %104, align 4, !dbg !91\n %107 = getelementptr inbounds i64, ptr %103, i32 1, !dbg !91\n %108 = load i64, ptr %107, align 4, !dbg !91\n %109 = getelementptr inbounds i64, ptr %104, i32 1, !dbg !91\n store i64 %108, ptr %109, align 4, !dbg !91\n %110 = getelementptr inbounds i64, ptr %103, i32 2, !dbg !91\n %111 = load i64, ptr %110, align 4, !dbg !91\n %112 = getelementptr inbounds i64, ptr %104, i32 2, !dbg !91\n store i64 %111, ptr %112, align 4, !dbg !91\n %113 = getelementptr inbounds i64, ptr %103, i32 3, !dbg !91\n %114 = load i64, ptr %113, align 4, !dbg !91\n %115 = getelementptr inbounds i64, ptr %104, i32 3, !dbg !91\n store i64 %114, ptr %115, align 4, !dbg !91\n %116 = getelementptr inbounds i64, ptr %103, i32 4, !dbg !91\n %117 = load i64, ptr %116, align 4, !dbg !91\n %118 = getelementptr inbounds i64, ptr %104, i32 4, !dbg !91\n store i64 %117, ptr %118, align 4, !dbg !91\n %119 = getelementptr inbounds i64, ptr %103, i32 5, !dbg !91\n %120 = load i64, ptr %119, align 4, !dbg !91\n %121 = getelementptr inbounds i64, ptr %104, i32 5, !dbg !91\n store i64 %120, ptr %121, align 4, !dbg !91\n %122 = getelementptr inbounds i64, ptr %103, i32 6, !dbg !91\n %123 = load i64, ptr %122, align 4, !dbg !91\n %124 = getelementptr inbounds i64, ptr %104, i32 6, !dbg !91\n store i64 %123, ptr %124, align 4, !dbg !91\n %125 = getelementptr inbounds i64, ptr %103, i32 7, !dbg !91\n %126 = load i64, ptr %125, align 4, !dbg !91\n %127 = getelementptr inbounds i64, ptr %104, i32 7, !dbg !91\n store i64 %126, ptr %127, align 4, !dbg !91\n %128 = alloca { ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr }, i64 1, align 8, !dbg !91\n %129 = insertvalue { ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr } undef, ptr %63, 0, !dbg !91\n %130 = insertvalue { ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr } %129, i64 %84, 1, !dbg !91\n %131 = insertvalue { ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr } %130, i64 %61, 2, !dbg !91\n %132 = insertvalue { ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr } %131, ptr %63, 3, !dbg !91\n %133 = insertvalue { ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr } %132, i64 %92, 4, !dbg !91\n %134 = insertvalue { ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr } %133, i64 %64, 5, !dbg !91\n %135 = insertvalue { ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr } %134, ptr %67, 6, !dbg !91\n %136 = insertvalue { ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr } %135, i64 %102, 7, !dbg !91\n %137 = insertvalue { ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr } %136, i64 %65, 8, !dbg !91\n %138 = insertvalue { ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr } %137, i64 1, 9, !dbg !91\n %139 = insertvalue { ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr } %138, i64 1, 10, !dbg !91\n %140 = insertvalue { ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr } %139, i64 %48, 11, !dbg !91\n %141 = insertvalue { ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr } %140, i32 16, 12, !dbg !91\n %142 = insertvalue { ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr } %141, i32 16, 13, !dbg !91\n %143 = insertvalue { ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr } %142, i32 1, 14, !dbg !91\n %144 = insertvalue { ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr } %143, i32 1281, 15, !dbg !91\n %145 = insertvalue { ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr } %144, ptr %104, 16, !dbg !91\n store { ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr } %145, ptr %128, align 8, !dbg !91\n call void @iree_uk_mmt4d(ptr %128), !dbg !91\n %146 = add i64 %81, %72, !dbg !91\n br label %80, !dbg !91\n\n147: ; preds = %80\n %148 = add i64 %78, %76, !dbg !91\n br label %77, !dbg !91\n\n149: ; preds = %77\n ret i32 0, !dbg !91\n}\n
; Function Attrs: nounwind\ndefine dso_local noundef i32 @iree_uk_mmt4d(ptr noundef %0) local_unnamed_addr #10 {\n %2 = getelementptr inbounds %struct.iree_uk_mmt4d_params_t, ptr %0, i64 0, i32 9\n %3 = load i64, ptr %2, align 8, !tbaa !1001\n %4 = icmp eq i64 %3, 0\n br i1 %4, label %133, label %5\n\n5: ; preds = %1\n %6 = getelementptr inbounds %struct.iree_uk_mmt4d_params_t, ptr %0, i64 0, i32 10\n %7 = load i64, ptr %6, align 8, !tbaa !1002\n %8 = icmp eq i64 %7, 0\n br i1 %8, label %133, label %9\n\n9: ; preds = %5\n %10 = getelementptr inbounds %struct.iree_uk_mmt4d_params_t, ptr %0, i64 0, i32 11\n %11 = load i64, ptr %10, align 8, !tbaa !19\n %12 = icmp eq i64 %11, 0\n br i1 %12, label %13, label %18\n\n13: ; preds = %9\n %14 = getelementptr inbounds %struct.iree_uk_mmt4d_params_t, ptr %0, i64 0, i32 15\n %15 = load i32, ptr %14, align 4, !tbaa !9\n %16 = and i32 %15, 256\n %17 = icmp eq i32 %16, 0\n br i1 %17, label %18, label %133\n\n18: ; preds = %13, %9\n %19 = tail call ptr @iree_uk_mmt4d_select_tile_func(ptr noundef nonnull %0) #14\n %20 = load i64, ptr %2, align 8, !tbaa !1001\n %21 = trunc i64 %20 to i32\n %22 = load i64, ptr %6, align 8, !tbaa !1002\n %23 = trunc i64 %22 to i32\n %24 = getelementptr inbounds %struct.iree_uk_mmt4d_params_t, ptr %0, i64 0, i32 15\n %25 = load i32, ptr %24, align 4, !tbaa !9\n %26 = zext i32 %25 to i64\n %27 = shl i64 %26, 56\n %28 = add i64 %27, -72057594037927936\n %29 = ashr exact i64 %28, 56\n %30 = getelementptr inbounds [9 x i32], ptr @switch.table.iree_uk_mmt4d, i64 0, i64 %29\n %31 = load i32, ptr %30, align 4\n %32 = lshr i32 %31, 8\n %33 = and i32 %31, 7\n %34 = and i32 %32, 7\n %35 = and i32 %31, 327680\n %36 = add nsw i32 %35, -196608\n %37 = lshr exact i32 %36, 16\n %38 = zext nneg i32 %37 to i64\n %39 = zext nneg i32 %33 to i64\n %40 = getelementptr inbounds %struct.iree_uk_mmt4d_params_t, ptr %0, i64 0, i32 3\n %41 = load ptr, ptr %40, align 8, !tbaa !1003\n %42 = getelementptr inbounds %struct.iree_uk_mmt4d_params_t, ptr %0, i64 0, i32 4\n %43 = load i64, ptr %42, align 8, !tbaa !1004\n %44 = zext nneg i32 %34 to i64\n %45 = shl i64 %43, %44\n %46 = sdiv i64 %45, 8\n %47 = getelementptr inbounds i8, ptr %41, i64 %46\n %48 = getelementptr inbounds %struct.iree_uk_mmt4d_params_t, ptr %0, i64 0, i32 2\n %49 = load i64, ptr %48, align 8, !tbaa !1005\n %50 = shl i64 %49, %39\n %51 = sdiv i64 %50, 8\n %52 = getelementptr inbounds %struct.iree_uk_mmt4d_params_t, ptr %0, i64 0, i32 5\n %53 = load i64, ptr %52, align 8, !tbaa !1006\n %54 = shl i64 %53, %44\n %55 = sdiv i64 %54, 8\n %56 = getelementptr inbounds %struct.iree_uk_mmt4d_params_t, ptr %0, i64 0, i32 8\n %57 = load i64, ptr %56, align 8, !tbaa !1007\n %58 = shl i64 %57, %38\n %59 = icmp sgt i32 %21, 0\n br i1 %59, label %60, label %133\n\n60: ; preds = %18\n %61 = getelementptr inbounds %struct.iree_uk_mmt4d_params_t, ptr %0, i64 0, i32 13\n %62 = load i32, ptr %61, align 4, !tbaa !996\n %63 = getelementptr inbounds %struct.iree_uk_mmt4d_params_t, ptr %0, i64 0, i32 12\n %64 = load i32, ptr %63, align 8, !tbaa !1000\n %65 = shl i32 %62, 16\n %66 = ashr exact i32 %65, 16\n %67 = shl i32 %64, 16\n %68 = ashr exact i32 %67, 16\n %69 = mul nsw i32 %66, %68\n %70 = shl i32 %69, %37\n %71 = load ptr, ptr %0, align 8, !tbaa !1008\n %72 = getelementptr inbounds %struct.iree_uk_mmt4d_params_t, ptr %0, i64 0, i32 1\n %73 = load i64, ptr %72, align 8, !tbaa !1009\n %74 = shl i64 %73, %39\n %75 = sdiv i64 %74, 8\n %76 = getelementptr inbounds i8, ptr %71, i64 %75\n %77 = getelementptr inbounds %struct.iree_uk_mmt4d_params_t, ptr %0, i64 0, i32 6\n %78 = load ptr, ptr %77, align 8, !tbaa !1010\n %79 = getelementptr inbounds %struct.iree_uk_mmt4d_params_t, ptr %0, i64 0, i32 7\n %80 = load i64, ptr %79, align 8, !tbaa !1011\n %81 = shl i64 %80, %38\n %82 = getelementptr inbounds i8, ptr %78, i64 %81\n %83 = icmp sgt i32 %23, 0\n %84 = sext i32 %70 to i64\n br i1 %83, label %90, label %85\n\n85: ; preds = %60\n %86 = and i32 %21, 3\n %87 = icmp ult i32 %21, 4\n br i1 %87, label %121, label %88\n\n88: ; preds = %85\n %89 = and i32 %21, 2147483644\n br label %107\n\n90: ; preds = %60, %102\n %91 = phi i32 [ %105, %102 ], [ 0, %60 ]\n %92 = phi ptr [ %103, %102 ], [ %82, %60 ]\n %93 = phi ptr [ %104, %102 ], [ %76, %60 ]\n tail call void @llvm.prefetch.p0(ptr %92, i32 1, i32 1, i32 1)\n tail call void @llvm.prefetch.p0(ptr %93, i32 0, i32 3, i32 1)\n tail call void @llvm.prefetch.p0(ptr %47, i32 0, i32 3, i32 1)\n br label %94\n\n94: ; preds = %94, %90\n %95 = phi i32 [ 0, %90 ], [ %100, %94 ]\n %96 = phi ptr [ %47, %90 ], [ %99, %94 ]\n %97 = phi ptr [ %92, %90 ], [ %98, %94 ]\n tail call void %19(ptr noundef %97, ptr noundef %93, ptr noundef %96, ptr noundef nonnull %0) #14\n %98 = getelementptr inbounds i8, ptr %97, i64 %84\n %99 = getelementptr inbounds i8, ptr %96, i64 %55\n %100 = add nuw nsw i32 %95, 1\n %101 = icmp eq i32 %100, %23\n br i1 %101, label %102, label %94, !llvm.loop !1012\n\n102: ; preds = %94\n %103 = getelementptr inbounds i8, ptr %92, i64 %58\n %104 = getelementptr inbounds i8, ptr %93, i64 %51\n %105 = add nuw nsw i32 %91, 1\n %106 = icmp eq i32 %105, %21\n br i1 %106, label %133, label %90, !llvm.loop !1013\n\n107: ; preds = %107, %88\n %108 = phi ptr [ %82, %88 ], [ %117, %107 ]\n %109 = phi ptr [ %76, %88 ], [ %118, %107 ]\n %110 = phi i32 [ 0, %88 ], [ %119, %107 ]\n tail call void @llvm.prefetch.p0(ptr %108, i32 1, i32 1, i32 1)\n tail call void @llvm.prefetch.p0(ptr %109, i32 0, i32 3, i32 1)\n tail call void @llvm.prefetch.p0(ptr %47, i32 0, i32 3, i32 1)\n %111 = getelementptr inbounds i8, ptr %108, i64 %58\n %112 = getelementptr inbounds i8, ptr %109, i64 %51\n tail call void @llvm.prefetch.p0(ptr %111, i32 1, i32 1, i32 1)\n tail call void @llvm.prefetch.p0(ptr %112, i32 0, i32 3, i32 1)\n tail call void @llvm.prefetch.p0(ptr %47, i32 0, i32 3, i32 1)\n %113 = getelementptr inbounds i8, ptr %111, i64 %58\n %114 = getelementptr inbounds i8, ptr %112, i64 %51\n tail call void @llvm.prefetch.p0(ptr %113, i32 1, i32 1, i32 1)\n tail call void @llvm.prefetch.p0(ptr %114, i32 0, i32 3, i32 1)\n tail call void @llvm.prefetch.p0(ptr %47, i32 0, i32 3, i32 1)\n %115 = getelementptr inbounds i8, ptr %113, i64 %58\n %116 = getelementptr inbounds i8, ptr %114, i64 %51\n tail call void @llvm.prefetch.p0(ptr %115, i32 1, i32 1, i32 1)\n tail call void @llvm.prefetch.p0(ptr %116, i32 0, i32 3, i32 1)\n tail call void @llvm.prefetch.p0(ptr %47, i32 0, i32 3, i32 1)\n %117 = getelementptr inbounds i8, ptr %115, i64 %58\n %118 = getelementptr inbounds i8, ptr %116, i64 %51\n %119 = add i32 %110, 4\n %120 = icmp eq i32 %119, %89\n br i1 %120, label %121, label %107, !llvm.loop !1013\n\n121: ; preds = %107, %85\n %122 = phi ptr [ %82, %85 ], [ %117, %107 ]\n %123 = phi ptr [ %76, %85 ], [ %118, %107 ]\n %124 = icmp eq i32 %86, 0\n br i1 %124, label %133, label %125\n\n125: ; preds = %121, %125\n %126 = phi ptr [ %129, %125 ], [ %122, %121 ]\n %127 = phi ptr [ %130, %125 ], [ %123, %121 ]\n %128 = phi i32 [ %131, %125 ], [ 0, %121 ]\n tail call void @llvm.prefetch.p0(ptr %126, i32 1, i32 1, i32 1)\n tail call void @llvm.prefetch.p0(ptr %127, i32 0, i32 3, i32 1)\n tail call void @llvm.prefetch.p0(ptr %47, i32 0, i32 3, i32 1)\n %129 = getelementptr inbounds i8, ptr %126, i64 %58\n %130 = getelementptr inbounds i8, ptr %127, i64 %51\n %131 = add i32 %128, 1\n %132 = icmp eq i32 %131, %86\n br i1 %132, label %133, label %125, !llvm.loop !1014\n\n133: ; preds = %121, %125, %102, %1, %5, %13, %18\n ret i32 0\n}\n
; Function Attrs: nofree norecurse nosync nounwind memory(read, argmem: readwrite, inaccessiblemem: readwrite)\ndefine dso_local void @iree_uk_mmt4d_tile_f32f32f32_16x16x1_x86_64_avx512_base(ptr noalias nocapture noundef %0, ptr noalias nocapture noundef readonly %1, ptr noalias nocapture noundef readonly %2, ptr nocapture noundef readonly %3) #4 {\n tail call void @llvm.experimental.noalias.scope.decl(metadata !367)\n tail call void @llvm.experimental.noalias.scope.decl(metadata !370)\n tail call void @llvm.experimental.noalias.scope.decl(metadata !372)\n tail call void @llvm.prefetch.p0(ptr %1, i32 0, i32 3, i32 1), !noalias !374\n tail call void @llvm.prefetch.p0(ptr %2, i32 0, i32 3, i32 1), !noalias !375\n %5 = getelementptr inbounds %struct.iree_uk_mmt4d_params_t, ptr %3, i64 0, i32 15\n %6 = load i32, ptr %5, align 4, !tbaa !9, !noalias !376\n %7 = and i32 %6, 256\n %8 = icmp eq i32 %7, 0\n br i1 %8, label %41, label %9\n\n9: ; preds = %4\n %10 = load <16 x float>, ptr %0, align 1, !tbaa !17, !alias.scope !367, !noalias !377\n %11 = getelementptr inbounds float, ptr %0, i64 16\n %12 = load <16 x float>, ptr %11, align 1, !tbaa !17, !alias.scope !367, !noalias !377\n %13 = getelementptr inbounds float, ptr %0, i64 32\n %14 = load <16 x float>, ptr %13, align 1, !tbaa !17, !alias.scope !367, !noalias !377\n %15 = getelementptr inbounds float, ptr %0, i64 48\n %16 = load <16 x float>, ptr %15, align 1, !tbaa !17, !alias.scope !367, !noalias !377\n %17 = getelementptr inbounds float, ptr %0, i64 64\n %18 = load <16 x float>, ptr %17, align 1, !tbaa !17, !alias.scope !367, !noalias !377\n %19 = getelementptr inbounds float, ptr %0, i64 80\n %20 = load <16 x float>, ptr %19, align 1, !tbaa !17, !alias.scope !367, !noalias !377\n %21 = getelementptr inbounds float, ptr %0, i64 96\n %22 = load <16 x float>, ptr %21, align 1, !tbaa !17, !alias.scope !367, !noalias !377\n %23 = getelementptr inbounds float, ptr %0, i64 112\n %24 = load <16 x float>, ptr %23, align 1, !tbaa !17, !alias.scope !367, !noalias !377\n %25 = getelementptr inbounds float, ptr %0, i64 128\n %26 = load <16 x float>, ptr %25, align 1, !tbaa !17, !alias.scope !367, !noalias !377\n %27 = getelementptr inbounds float, ptr %0, i64 144\n %28 = load <16 x float>, ptr %27, align 1, !tbaa !17, !alias.scope !367, !noalias !377\n %29 = getelementptr inbounds float, ptr %0, i64 160\n %30 = load <16 x float>, ptr %29, align 1, !tbaa !17, !alias.scope !367, !noalias !377\n %31 = getelementptr inbounds float, ptr %0, i64 176\n %32 = load <16 x float>, ptr %31, align 1, !tbaa !17, !alias.scope !367, !noalias !377\n %33 = getelementptr inbounds float, ptr %0, i64 192\n %34 = load <16 x float>, ptr %33, align 1, !tbaa !17, !alias.scope !367, !noalias !377\n %35 = getelementptr inbounds float, ptr %0, i64 208\n %36 = load <16 x float>, ptr %35, align 1, !tbaa !17, !alias.scope !367, !noalias !377\n %37 = getelementptr inbounds float, ptr %0, i64 224\n %38 = load <16 x float>, ptr %37, align 1, !tbaa !17, !alias.scope !367, !noalias !377\n %39 = getelementptr inbounds float, ptr %0, i64 240\n %40 = load <16 x float>, ptr %39, align 1, !tbaa !17, !alias.scope !367, !noalias !377\n br label %41\n\n41: ; preds = %4, %9\n %42 = phi <16 x float> [ %40, %9 ], [ zeroinitializer, %4 ]\n %43 = phi <16 x float> [ %38, %9 ], [ zeroinitializer, %4 ]\n %44 = phi <16 x float> [ %36, %9 ], [ zeroinitializer, %4 ]\n %45 = phi <16 x float> [ %34, %9 ], [ zeroinitializer, %4 ]\n %46 = phi <16 x float> [ %32, %9 ], [ zeroinitializer, %4 ]\n %47 = phi <16 x float> [ %30, %9 ], [ zeroinitializer, %4 ]\n %48 = phi <16 x float> [ %28, %9 ], [ zeroinitializer, %4 ]\n %49 = phi <16 x float> [ %26, %9 ], [ zeroinitializer, %4 ]\n %50 = phi <16 x float> [ %24, %9 ], [ zeroinitializer, %4 ]\n %51 = phi <16 x float> [ %22, %9 ], [ zeroinitializer, %4 ]\n %52 = phi <16 x float> [ %20, %9 ], [ zeroinitializer, %4 ]\n %53 = phi <16 x float> [ %18, %9 ], [ zeroinitializer, %4 ]\n %54 = phi <16 x float> [ %16, %9 ], [ zeroinitializer, %4 ]\n %55 = phi <16 x float> [ %14, %9 ], [ zeroinitializer, %4 ]\n %56 = phi <16 x float> [ %12, %9 ], [ zeroinitializer, %4 ]\n %57 = phi <16 x float> [ %10, %9 ], [ zeroinitializer, %4 ]\n %58 = getelementptr inbounds %struct.iree_uk_mmt4d_params_t, ptr %3, i64 0, i32 11\n %59 = load i64, ptr %58, align 8, !tbaa !19, !noalias !376\n %60 = icmp sgt i64 %59, 0\n br i1 %60, label %61, label %167\n\n61: ; preds = %41, %61\n %62 = phi <16 x float> [ %161, %61 ], [ %42, %41 ]\n %63 = phi <16 x float> [ %156, %61 ], [ %43, %41 ]\n %64 = phi <16 x float> [ %151, %61 ], [ %44, %41 ]\n %65 = phi <16 x float> [ %146, %61 ], [ %45, %41 ]\n %66 = phi <16 x float> [ %141, %61 ], [ %46, %41 ]\n %67 = phi <16 x float> [ %136, %61 ], [ %47, %41 ]\n %68 = phi <16 x float> [ %131, %61 ], [ %48, %41 ]\n %69 = phi <16 x float> [ %126, %61 ], [ %49, %41 ]\n %70 = phi <16 x float> [ %121, %61 ], [ %50, %41 ]\n %71 = phi <16 x float> [ %116, %61 ], [ %51, %41 ]\n %72 = phi <16 x float> [ %111, %61 ], [ %52, %41 ]\n %73 = phi <16 x float> [ %106, %61 ], [ %53, %41 ]\n %74 = phi <16 x float> [ %101, %61 ], [ %54, %41 ]\n %75 = phi <16 x float> [ %96, %61 ], [ %55, %41 ]\n %76 = phi <16 x float> [ %91, %61 ], [ %56, %41 ]\n %77 = phi <16 x float> [ %86, %61 ], [ %57, %41 ]\n %78 = phi i64 [ %165, %61 ], [ 0, %41 ]\n %79 = phi ptr [ %164, %61 ], [ %1, %41 ]\n %80 = phi ptr [ %162, %61 ], [ %2, %41 ]\n %81 = load <16 x float>, ptr %80, align 1, !tbaa !17, !alias.scope !372, !noalias !375\n %82 = getelementptr inbounds float, ptr %80, i64 128\n tail call void @llvm.prefetch.p0(ptr nonnull %82, i32 0, i32 3, i32 1), !noalias !375\n %83 = load float, ptr %79, align 4, !tbaa !331, !alias.scope !370, !noalias !374\n %84 = insertelement <16 x float> poison, float %83, i64 0\n %85 = shufflevector <16 x float> %84, <16 x float> poison, <16 x i32> zeroinitializer\n %86 = tail call <16 x float> @llvm.fma.v16f32(<16 x float> %85, <16 x float> %81, <16 x float> %77)\n %87 = getelementptr inbounds float, ptr %79, i64 1\n %88 = load float, ptr %87, align 4, !tbaa !331, !alias.scope !370, !noalias !374\n %89 = insertelement <16 x float> poison, float %88, i64 0\n %90 = shufflevector <16 x float> %89, <16 x float> poison, <16 x i32> zeroinitializer\n %91 = tail call <16 x float> @llvm.fma.v16f32(<16 x float> %90, <16 x float> %81, <16 x float> %76)\n %92 = getelementptr inbounds float, ptr %79, i64 2\n %93 = load float, ptr %92, align 4, !tbaa !331, !alias.scope !370, !noalias !374\n %94 = insertelement <16 x float> poison, float %93, i64 0\n %95 = shufflevector <16 x float> %94, <16 x float> poison, <16 x i32> zeroinitializer\n %96 = tail call <16 x float> @llvm.fma.v16f32(<16 x float> %95, <16 x float> %81, <16 x float> %75)\n %97 = getelementptr inbounds float, ptr %79, i64 3\n %98 = load float, ptr %97, align 4, !tbaa !331, !alias.scope !370, !noalias !374\n %99 = insertelement <16 x float> poison, float %98, i64 0\n %100 = shufflevector <16 x float> %99, <16 x float> poison, <16 x i32> zeroinitializer\n %101 = tail call <16 x float> @llvm.fma.v16f32(<16 x float> %100, <16 x float> %81, <16 x float> %74)\n %102 = getelementptr inbounds float, ptr %79, i64 4\n %103 = load float, ptr %102, align 4, !tbaa !331, !alias.scope !370, !noalias !374\n %104 = insertelement <16 x float> poison, float %103, i64 0\n %105 = shufflevector <16 x float> %104, <16 x float> poison, <16 x i32> zeroinitializer\n %106 = tail call <16 x float> @llvm.fma.v16f32(<16 x float> %105, <16 x float> %81, <16 x float> %73)\n %107 = getelementptr inbounds float, ptr %79, i64 5\n %108 = load float, ptr %107, align 4, !tbaa !331, !alias.scope !370, !noalias !374\n %109 = insertelement <16 x float> poison, float %108, i64 0\n %110 = shufflevector <16 x float> %109, <16 x float> poison, <16 x i32> zeroinitializer\n %111 = tail call <16 x float> @llvm.fma.v16f32(<16 x float> %110, <16 x float> %81, <16 x float> %72)\n %112 = getelementptr inbounds float, ptr %79, i64 6\n %113 = load float, ptr %112, align 4, !tbaa !331, !alias.scope !370, !noalias !374\n %114 = insertelement <16 x float> poison, float %113, i64 0\n %115 = shufflevector <16 x float> %114, <16 x float> poison, <16 x i32> zeroinitializer\n %116 = tail call <16 x float> @llvm.fma.v16f32(<16 x float> %115, <16 x float> %81, <16 x float> %71)\n %117 = getelementptr inbounds float, ptr %79, i64 7\n %118 = load float, ptr %117, align 4, !tbaa !331, !alias.scope !370, !noalias !374\n %119 = insertelement <16 x float> poison, float %118, i64 0\n %120 = shufflevector <16 x float> %119, <16 x float> poison, <16 x i32> zeroinitializer\n %121 = tail call <16 x float> @llvm.fma.v16f32(<16 x float> %120, <16 x float> %81, <16 x float> %70)\n %122 = getelementptr inbounds float, ptr %79, i64 8\n %123 = load float, ptr %122, align 4, !tbaa !331, !alias.scope !370, !noalias !374\n %124 = insertelement <16 x float> poison, float %123, i64 0\n %125 = shufflevector <16 x float> %124, <16 x float> poison, <16 x i32> zeroinitializer\n %126 = tail call <16 x float> @llvm.fma.v16f32(<16 x float> %125, <16 x float> %81, <16 x float> %69)\n %127 = getelementptr inbounds float, ptr %79, i64 9\n %128 = load float, ptr %127, align 4, !tbaa !331, !alias.scope !370, !noalias !374\n %129 = insertelement <16 x float> poison, float %128, i64 0\n %130 = shufflevector <16 x float> %129, <16 x float> poison, <16 x i32> zeroinitializer\n %131 = tail call <16 x float> @llvm.fma.v16f32(<16 x float> %130, <16 x float> %81, <16 x float> %68)\n %132 = getelementptr inbounds float, ptr %79, i64 10\n %133 = load float, ptr %132, align 4, !tbaa !331, !alias.scope !370, !noalias !374\n %134 = insertelement <16 x float> poison, float %133, i64 0\n %135 = shufflevector <16 x float> %134, <16 x float> poison, <16 x i32> zeroinitializer\n %136 = tail call <16 x float> @llvm.fma.v16f32(<16 x float> %135, <16 x float> %81, <16 x float> %67)\n %137 = getelementptr inbounds float, ptr %79, i64 11\n %138 = load float, ptr %137, align 4, !tbaa !331, !alias.scope !370, !noalias !374\n %139 = insertelement <16 x float> poison, float %138, i64 0\n %140 = shufflevector <16 x float> %139, <16 x float> poison, <16 x i32> zeroinitializer\n %141 = tail call <16 x float> @llvm.fma.v16f32(<16 x float> %140, <16 x float> %81, <16 x float> %66)\n %142 = getelementptr inbounds float, ptr %79, i64 12\n %143 = load float, ptr %142, align 4, !tbaa !331, !alias.scope !370, !noalias !374\n %144 = insertelement <16 x float> poison, float %143, i64 0\n %145 = shufflevector <16 x float> %144, <16 x float> poison, <16 x i32> zeroinitializer\n %146 = tail call <16 x float> @llvm.fma.v16f32(<16 x float> %145, <16 x float> %81, <16 x float> %65)\n %147 = getelementptr inbounds float, ptr %79, i64 13\n %148 = load float, ptr %147, align 4, !tbaa !331, !alias.scope !370, !noalias !374\n %149 = insertelement <16 x float> poison, float %148, i64 0\n %150 = shufflevector <16 x float> %149, <16 x float> poison, <16 x i32> zeroinitializer\n %151 = tail call <16 x float> @llvm.fma.v16f32(<16 x float> %150, <16 x float> %81, <16 x float> %64)\n %152 = getelementptr inbounds float, ptr %79, i64 14\n %153 = load float, ptr %152, align 4, !tbaa !331, !alias.scope !370, !noalias !374\n %154 = insertelement <16 x float> poison, float %153, i64 0\n %155 = shufflevector <16 x float> %154, <16 x float> poison, <16 x i32> zeroinitializer\n %156 = tail call <16 x float> @llvm.fma.v16f32(<16 x float> %155, <16 x float> %81, <16 x float> %63)\n %157 = getelementptr inbounds float, ptr %79, i64 15\n %158 = load float, ptr %157, align 4, !tbaa !331, !alias.scope !370, !noalias !374\n %159 = insertelement <16 x float> poison, float %158, i64 0\n %160 = shufflevector <16 x float> %159, <16 x float> poison, <16 x i32> zeroinitializer\n %161 = tail call <16 x float> @llvm.fma.v16f32(<16 x float> %160, <16 x float> %81, <16 x float> %62)\n %162 = getelementptr inbounds float, ptr %80, i64 16\n %163 = getelementptr inbounds float, ptr %79, i64 128\n tail call void @llvm.prefetch.p0(ptr nonnull %163, i32 0, i32 3, i32 1), !noalias !374\n %164 = getelementptr inbounds float, ptr %79, i64 16\n %165 = add nuw nsw i64 %78, 1\n %166 = icmp eq i64 %165, %59\n br i1 %166, label %167, label %61, !llvm.loop !333\n\n167: ; preds = %61, %41\n %168 = phi <16 x float> [ %42, %41 ], [ %161, %61 ]\n %169 = phi <16 x float> [ %43, %41 ], [ %156, %61 ]\n %170 = phi <16 x float> [ %44, %41 ], [ %151, %61 ]\n %171 = phi <16 x float> [ %45, %41 ], [ %146, %61 ]\n %172 = phi <16 x float> [ %46, %41 ], [ %141, %61 ]\n %173 = phi <16 x float> [ %47, %41 ], [ %136, %61 ]\n %174 = phi <16 x float> [ %48, %41 ], [ %131, %61 ]\n %175 = phi <16 x float> [ %49, %41 ], [ %126, %61 ]\n %176 = phi <16 x float> [ %50, %41 ], [ %121, %61 ]\n %177 = phi <16 x float> [ %51, %41 ], [ %116, %61 ]\n %178 = phi <16 x float> [ %52, %41 ], [ %111, %61 ]\n %179 = phi <16 x float> [ %53, %41 ], [ %106, %61 ]\n %180 = phi <16 x float> [ %54, %41 ], [ %101, %61 ]\n %181 = phi <16 x float> [ %55, %41 ], [ %96, %61 ]\n %182 = phi <16 x float> [ %56, %41 ], [ %91, %61 ]\n %183 = phi <16 x float> [ %57, %41 ], [ %86, %61 ]\n store <16 x float> %183, ptr %0, align 1, !tbaa !17, !alias.scope !367, !noalias !377\n %184 = getelementptr inbounds float, ptr %0, i64 16\n store <16 x float> %182, ptr %184, align 1, !tbaa !17, !alias.scope !367, !noalias !377\n %185 = getelementptr inbounds float, ptr %0, i64 32\n store <16 x float> %181, ptr %185, align 1, !tbaa !17, !alias.scope !367, !noalias !377\n %186 = getelementptr inbounds float, ptr %0, i64 48\n store <16 x float> %180, ptr %186, align 1, !tbaa !17, !alias.scope !367, !noalias !377\n %187 = getelementptr inbounds float, ptr %0, i64 64\n store <16 x float> %179, ptr %187, align 1, !tbaa !17, !alias.scope !367, !noalias !377\n %188 = getelementptr inbounds float, ptr %0, i64 80\n store <16 x float> %178, ptr %188, align 1, !tbaa !17, !alias.scope !367, !noalias !377\n %189 = getelementptr inbounds float, ptr %0, i64 96\n store <16 x float> %177, ptr %189, align 1, !tbaa !17, !alias.scope !367, !noalias !377\n %190 = getelementptr inbounds float, ptr %0, i64 112\n store <16 x float> %176, ptr %190, align 1, !tbaa !17, !alias.scope !367, !noalias !377\n %191 = getelementptr inbounds float, ptr %0, i64 128\n store <16 x float> %175, ptr %191, align 1, !tbaa !17, !alias.scope !367, !noalias !377\n %192 = getelementptr inbounds float, ptr %0, i64 144\n store <16 x float> %174, ptr %192, align 1, !tbaa !17, !alias.scope !367, !noalias !377\n %193 = getelementptr inbounds float, ptr %0, i64 160\n store <16 x float> %173, ptr %193, align 1, !tbaa !17, !alias.scope !367, !noalias !377\n %194 = getelementptr inbounds float, ptr %0, i64 176\n store <16 x float> %172, ptr %194, align 1, !tbaa !17, !alias.scope !367, !noalias !377\n %195 = getelementptr inbounds float, ptr %0, i64 192\n store <16 x float> %171, ptr %195, align 1, !tbaa !17, !alias.scope !367, !noalias !377\n %196 = getelementptr inbounds float, ptr %0, i64 208\n store <16 x float> %170, ptr %196, align 1, !tbaa !17, !alias.scope !367, !noalias !377\n %197 = getelementptr inbounds float, ptr %0, i64 224\n store <16 x float> %169, ptr %197, align 1, !tbaa !17, !alias.scope !367, !noalias !377\n %198 = getelementptr inbounds float, ptr %0, i64 240\n store <16 x float> %168, ptr %198, align 1, !tbaa !17, !alias.scope !367, !noalias !377\n ret void\n}\n
; Function Attrs: nofree norecurse nosync nounwind\ndefine internal noundef i32 @matmul_dynamic_dispatch_3_mmt4d_DxDxDx16x16x1_f32(ptr noalias nocapture nonnull readonly align 16 %0, ptr noalias nocapture nonnull readonly align 16 %1, ptr noalias nocapture nonnull readonly align 16 %2) #1 !dbg !90 {\n %.elt7 = getelementptr inbounds %iree_hal_executable_dispatch_state_v0_t.19, ptr %1, i64 0, i32 4, !dbg !91\n %.unpack8 = load i32, ptr %.elt7, align 4, !dbg !91\n %.elt9 = getelementptr inbounds %iree_hal_executable_dispatch_state_v0_t.19, ptr %1, i64 0, i32 5, !dbg !91\n %.unpack10 = load i32, ptr %.elt9, align 16, !dbg !91\n %.elt17 = getelementptr inbounds %iree_hal_executable_dispatch_state_v0_t.19, ptr %1, i64 0, i32 9, !dbg !91\n %.unpack18 = load ptr, ptr %.elt17, align 8, !dbg !91\n %.elt19 = getelementptr inbounds %iree_hal_executable_dispatch_state_v0_t.19, ptr %1, i64 0, i32 10, !dbg !91\n %.unpack20 = load ptr, ptr %.elt19, align 16, !dbg !91\n %4 = getelementptr i32, ptr %.unpack18, i64 4, !dbg !91\n %5 = load i64, ptr %4, align 4, !dbg !91\n %6 = getelementptr i32, ptr %.unpack18, i64 6, !dbg !91\n %7 = load i32, ptr %6, align 4, !dbg !91\n %8 = getelementptr i32, ptr %.unpack18, i64 7, !dbg !91\n %9 = load i32, ptr %8, align 4, !dbg !91\n %10 = getelementptr i32, ptr %.unpack18, i64 8, !dbg !91\n %11 = load i64, ptr %10, align 4, !dbg !91\n %12 = getelementptr i32, ptr %.unpack18, i64 10, !dbg !91\n %13 = load i64, ptr %12, align 4, !dbg !91\n %14 = shl i64 %13, 4, !dbg !91\n %15 = getelementptr i32, ptr %.unpack18, i64 14, !dbg !91\n %16 = load i64, ptr %15, align 4, !dbg !91\n %17 = shl i64 %16, 8, !dbg !91\n %18 = zext i32 %7 to i64, !dbg !91\n %19 = zext i32 %9 to i64, !dbg !91\n %20 = shl nuw i64 %19, 32, !dbg !91\n %21 = or disjoint i64 %20, %18, !dbg !91\n %22 = load ptr, ptr %.unpack20, align 8, !dbg !91\n %23 = getelementptr ptr, ptr %.unpack20, i64 1, !dbg !91\n %24 = load ptr, ptr %23, align 8, !dbg !91\n %25 = load %iree_hal_executable_workgroup_state_v0_t.20, ptr %2, align 16, !dbg !91\n %26 = extractvalue %iree_hal_executable_workgroup_state_v0_t.20 %25, 0, !dbg !91\n %27 = zext i32 %26 to i64, !dbg !91\n %28 = zext i32 %.unpack8 to i64, !dbg !91\n %29 = extractvalue %iree_hal_executable_workgroup_state_v0_t.20 %25, 1, !dbg !91\n %30 = zext i32 %29 to i64, !dbg !91\n %31 = zext i32 %.unpack10 to i64, !dbg !91\n %32 = icmp sgt i64 %5, %30, !dbg !91\n br i1 %32, label %.preheader.lr.ph, label %._crit_edge58, !dbg !91\n\n.preheader.lr.ph: ; preds = %3\n %33 = getelementptr i32, ptr %.unpack18, i64 3, !dbg !91\n %34 = load i32, ptr %33, align 4, !dbg !91\n %35 = zext i32 %34 to i64, !dbg !91\n %36 = shl nuw i64 %35, 32, !dbg !91\n %37 = getelementptr i32, ptr %.unpack18, i64 2, !dbg !91\n %38 = load i32, ptr %37, align 4, !dbg !91\n %39 = zext i32 %38 to i64, !dbg !91\n %40 = or disjoint i64 %36, %39, !dbg !91\n %41 = getelementptr i32, ptr %.unpack18, i64 1, !dbg !91\n %42 = load i32, ptr %41, align 4, !dbg !91\n %43 = zext i32 %42 to i64, !dbg !91\n %44 = shl nuw i64 %43, 32, !dbg !91\n %45 = load i32, ptr %.unpack18, align 4, !dbg !91\n %46 = zext i32 %45 to i64, !dbg !91\n %47 = or disjoint i64 %44, %46, !dbg !91\n %48 = icmp sgt i64 %11, %27\n %.lobit = ashr i64 %44, 63\n %49 = xor i64 %47, %.lobit\n %50 = sdiv i64 %49, 4\n %51 = xor i64 %50, %.lobit\n %.lobit24 = ashr i64 %36, 63\n %52 = xor i64 %40, %.lobit24\n %53 = sdiv i64 %52, 4\n %54 = xor i64 %53, %.lobit24\n %55 = icmp eq i64 %21, 0\n %56 = shl i64 %21, 9\n %57 = icmp sgt i64 %21, 0\n br label %.preheader, !dbg !91\n\n.preheader: ; preds = %._crit_edge, %.preheader.lr.ph\n %58 = phi i64 [ %30, %.preheader.lr.ph ], [ %228, %._crit_edge ]\n br i1 %48, label %.lr.ph, label %._crit_edge, !dbg !91\n\n.lr.ph: ; preds = %.preheader\n %59 = mul i64 %17, %58\n %60 = add i64 %59, %54\n %61 = mul i64 %56, %58\n %62 = ashr exact i64 %61, 3\n %63 = getelementptr inbounds i8, ptr %22, i64 %62\n %64 = shl i64 %60, 2\n %invariant.gep = getelementptr i8, ptr %24, i64 %64, !dbg !91\n br label %65, !dbg !91\n\n65: ; preds = %iree_uk_mmt4d.exit, %.lr.ph\n %66 = phi i64 [ %27, %.lr.ph ], [ %226, %iree_uk_mmt4d.exit ]\n br i1 %55, label %iree_uk_mmt4d.exit, label %67, !dbg !91\n\n67: ; preds = %65\n %68 = mul i64 %14, %66, !dbg !91\n %69 = add i64 %68, %51, !dbg !91\n %70 = shl i64 %69, 5, !dbg !91\n %71 = ashr exact i64 %70, 3, !dbg !91\n %72 = getelementptr inbounds i8, ptr %22, i64 %71, !dbg !91\n %73 = shl i64 %66, 10, !dbg !91\n %gep = getelementptr i8, ptr %invariant.gep, i64 %73, !dbg !91\n tail call void @llvm.prefetch.p0(ptr %gep, i32 1, i32 1, i32 1), !dbg !91\n tail call void @llvm.prefetch.p0(ptr %63, i32 0, i32 3, i32 1), !dbg !91\n tail call void @llvm.prefetch.p0(ptr %72, i32 0, i32 3, i32 1), !dbg !91\n tail call void @llvm.experimental.noalias.scope.decl(metadata !92), !dbg !91\n tail call void @llvm.experimental.noalias.scope.decl(metadata !95), !dbg !91\n tail call void @llvm.experimental.noalias.scope.decl(metadata !97), !dbg !91\n tail call void @llvm.experimental.noalias.scope.decl(metadata !99), !dbg !91\n tail call void @llvm.experimental.noalias.scope.decl(metadata !102), !dbg !91\n tail call void @llvm.experimental.noalias.scope.decl(metadata !104), !dbg !91\n tail call void @llvm.prefetch.p0(ptr %63, i32 0, i32 3, i32 1), !dbg !91, !noalias !106\n tail call void @llvm.prefetch.p0(ptr %72, i32 0, i32 3, i32 1), !dbg !91, !noalias !107\n %74 = load <16 x float>, ptr %gep, align 1, !dbg !91, !tbaa !108, !alias.scope !111, !noalias !112\n %75 = getelementptr inbounds float, ptr %gep, i64 16, !dbg !91\n %76 = load <16 x float>, ptr %75, align 1, !dbg !91, !tbaa !108, !alias.scope !111, !noalias !112\n %77 = getelementptr inbounds float, ptr %gep, i64 32, !dbg !91\n %78 = load <16 x float>, ptr %77, align 1, !dbg !91, !tbaa !108, !alias.scope !111, !noalias !112\n %79 = getelementptr inbounds float, ptr %gep, i64 48, !dbg !91\n %80 = load <16 x float>, ptr %79, align 1, !dbg !91, !tbaa !108, !alias.scope !111, !noalias !112\n %81 = getelementptr inbounds float, ptr %gep, i64 64, !dbg !91\n %82 = load <16 x float>, ptr %81, align 1, !dbg !91, !tbaa !108, !alias.scope !111, !noalias !112\n %83 = getelementptr inbounds float, ptr %gep, i64 80, !dbg !91\n %84 = load <16 x float>, ptr %83, align 1, !dbg !91, !tbaa !108, !alias.scope !111, !noalias !112\n %85 = getelementptr inbounds float, ptr %gep, i64 96, !dbg !91\n %86 = load <16 x float>, ptr %85, align 1, !dbg !91, !tbaa !108, !alias.scope !111, !noalias !112\n %87 = getelementptr inbounds float, ptr %gep, i64 112, !dbg !91\n %88 = load <16 x float>, ptr %87, align 1, !dbg !91, !tbaa !108, !alias.scope !111, !noalias !112\n %89 = getelementptr inbounds float, ptr %gep, i64 128, !dbg !91\n %90 = load <16 x float>, ptr %89, align 1, !dbg !91, !tbaa !108, !alias.scope !111, !noalias !112\n %91 = getelementptr inbounds float, ptr %gep, i64 144, !dbg !91\n %92 = load <16 x float>, ptr %91, align 1, !dbg !91, !tbaa !108, !alias.scope !111, !noalias !112\n %93 = getelementptr inbounds float, ptr %gep, i64 160, !dbg !91\n %94 = load <16 x float>, ptr %93, align 1, !dbg !91, !tbaa !108, !alias.scope !111, !noalias !112\n %95 = getelementptr inbounds float, ptr %gep, i64 176, !dbg !91\n %96 = load <16 x float>, ptr %95, align 1, !dbg !91, !tbaa !108, !alias.scope !111, !noalias !112\n %97 = getelementptr inbounds float, ptr %gep, i64 192, !dbg !91\n %98 = load <16 x float>, ptr %97, align 1, !dbg !91, !tbaa !108, !alias.scope !111, !noalias !112\n %99 = getelementptr inbounds float, ptr %gep, i64 208, !dbg !91\n %100 = load <16 x float>, ptr %99, align 1, !dbg !91, !tbaa !108, !alias.scope !111, !noalias !112\n %101 = getelementptr inbounds float, ptr %gep, i64 224, !dbg !91\n %102 = load <16 x float>, ptr %101, align 1, !dbg !91, !tbaa !108, !alias.scope !111, !noalias !112\n %103 = getelementptr inbounds float, ptr %gep, i64 240, !dbg !91\n %104 = load <16 x float>, ptr %103, align 1, !dbg !91, !tbaa !108, !alias.scope !111, !noalias !112\n br i1 %57, label %.preheader.i, label %iree_uk_mmt4d_tile_f32f32f32_16x16x1_x86_64_avx512_base.exit, !dbg !91\n\n.preheader.i: ; preds = %.preheader.i, %67\n %105 = phi <16 x float> [ %204, %.preheader.i ], [ %104, %67 ], !dbg !91\n %106 = phi <16 x float> [ %199, %.preheader.i ], [ %102, %67 ], !dbg !91\n %107 = phi <16 x float> [ %194, %.preheader.i ], [ %100, %67 ], !dbg !91\n %108 = phi <16 x float> [ %189, %.preheader.i ], [ %98, %67 ], !dbg !91\n %109 = phi <16 x float> [ %184, %.preheader.i ], [ %96, %67 ], !dbg !91\n %110 = phi <16 x float> [ %179, %.preheader.i ], [ %94, %67 ], !dbg !91\n %111 = phi <16 x float> [ %174, %.preheader.i ], [ %92, %67 ], !dbg !91\n %112 = phi <16 x float> [ %169, %.preheader.i ], [ %90, %67 ], !dbg !91\n %113 = phi <16 x float> [ %164, %.preheader.i ], [ %88, %67 ], !dbg !91\n %114 = phi <16 x float> [ %159, %.preheader.i ], [ %86, %67 ], !dbg !91\n %115 = phi <16 x float> [ %154, %.preheader.i ], [ %84, %67 ], !dbg !91\n %116 = phi <16 x float> [ %149, %.preheader.i ], [ %82, %67 ], !dbg !91\n %117 = phi <16 x float> [ %144, %.preheader.i ], [ %80, %67 ], !dbg !91\n %118 = phi <16 x float> [ %139, %.preheader.i ], [ %78, %67 ], !dbg !91\n %119 = phi <16 x float> [ %134, %.preheader.i ], [ %76, %67 ], !dbg !91\n %120 = phi <16 x float> [ %129, %.preheader.i ], [ %74, %67 ], !dbg !91\n %121 = phi i64 [ %208, %.preheader.i ], [ 0, %67 ], !dbg !91\n %122 = phi ptr [ %207, %.preheader.i ], [ %63, %67 ], !dbg !91\n %123 = phi ptr [ %205, %.preheader.i ], [ %72, %67 ], !dbg !91\n %124 = load <16 x float>, ptr %123, align 1, !dbg !91, !tbaa !108, !alias.scope !113, !noalias !107\n %125 = getelementptr inbounds float, ptr %123, i64 128, !dbg !91\n tail call void @llvm.prefetch.p0(ptr nonnull %125, i32 0, i32 3, i32 1), !dbg !91, !noalias !107\n %126 = load float, ptr %122, align 4, !dbg !91, !tbaa !114, !alias.scope !116, !noalias !106\n %127 = insertelement <16 x float> poison, float %126, i64 0, !dbg !91\n %128 = shufflevector <16 x float> %127, <16 x float> poison, <16 x i32> zeroinitializer, !dbg !91\n %129 = tail call <16 x float> @llvm.fma.v16f32(<16 x float> %128, <16 x float> %124, <16 x float> %120), !dbg !91\n %130 = getelementptr inbounds float, ptr %122, i64 1, !dbg !91\n %131 = load float, ptr %130, align 4, !dbg !91, !tbaa !114, !alias.scope !116, !noalias !106\n %132 = insertelement <16 x float> poison, float %131, i64 0, !dbg !91\n %133 = shufflevector <16 x float> %132, <16 x float> poison, <16 x i32> zeroinitializer, !dbg !91\n %134 = tail call <16 x float> @llvm.fma.v16f32(<16 x float> %133, <16 x float> %124, <16 x float> %119), !dbg !91\n %135 = getelementptr inbounds float, ptr %122, i64 2, !dbg !91\n %136 = load float, ptr %135, align 4, !dbg !91, !tbaa !114, !alias.scope !116, !noalias !106\n %137 = insertelement <16 x float> poison, float %136, i64 0, !dbg !91\n %138 = shufflevector <16 x float> %137, <16 x float> poison, <16 x i32> zeroinitializer, !dbg !91\n %139 = tail call <16 x float> @llvm.fma.v16f32(<16 x float> %138, <16 x float> %124, <16 x float> %118), !dbg !91\n %140 = getelementptr inbounds float, ptr %122, i64 3, !dbg !91\n %141 = load float, ptr %140, align 4, !dbg !91, !tbaa !114, !alias.scope !116, !noalias !106\n %142 = insertelement <16 x float> poison, float %141, i64 0, !dbg !91\n %143 = shufflevector <16 x float> %142, <16 x float> poison, <16 x i32> zeroinitializer, !dbg !91\n %144 = tail call <16 x float> @llvm.fma.v16f32(<16 x float> %143, <16 x float> %124, <16 x float> %117), !dbg !91\n %145 = getelementptr inbounds float, ptr %122, i64 4, !dbg !91\n %146 = load float, ptr %145, align 4, !dbg !91, !tbaa !114, !alias.scope !116, !noalias !106\n %147 = insertelement <16 x float> poison, float %146, i64 0, !dbg !91\n %148 = shufflevector <16 x float> %147, <16 x float> poison, <16 x i32> zeroinitializer, !dbg !91\n %149 = tail call <16 x float> @llvm.fma.v16f32(<16 x float> %148, <16 x float> %124, <16 x float> %116), !dbg !91\n %150 = getelementptr inbounds float, ptr %122, i64 5, !dbg !91\n %151 = load float, ptr %150, align 4, !dbg !91, !tbaa !114, !alias.scope !116, !noalias !106\n %152 = insertelement <16 x float> poison, float %151, i64 0, !dbg !91\n %153 = shufflevector <16 x float> %152, <16 x float> poison, <16 x i32> zeroinitializer, !dbg !91\n %154 = tail call <16 x float> @llvm.fma.v16f32(<16 x float> %153, <16 x float> %124, <16 x float> %115), !dbg !91\n %155 = getelementptr inbounds float, ptr %122, i64 6, !dbg !91\n %156 = load float, ptr %155, align 4, !dbg !91, !tbaa !114, !alias.scope !116, !noalias !106\n %157 = insertelement <16 x float> poison, float %156, i64 0, !dbg !91\n %158 = shufflevector <16 x float> %157, <16 x float> poison, <16 x i32> zeroinitializer, !dbg !91\n %159 = tail call <16 x float> @llvm.fma.v16f32(<16 x float> %158, <16 x float> %124, <16 x float> %114), !dbg !91\n %160 = getelementptr inbounds float, ptr %122, i64 7, !dbg !91\n %161 = load float, ptr %160, align 4, !dbg !91, !tbaa !114, !alias.scope !116, !noalias !106\n %162 = insertelement <16 x float> poison, float %161, i64 0, !dbg !91\n %163 = shufflevector <16 x float> %162, <16 x float> poison, <16 x i32> zeroinitializer, !dbg !91\n %164 = tail call <16 x float> @llvm.fma.v16f32(<16 x float> %163, <16 x float> %124, <16 x float> %113), !dbg !91\n %165 = getelementptr inbounds float, ptr %122, i64 8, !dbg !91\n %166 = load float, ptr %165, align 4, !dbg !91, !tbaa !114, !alias.scope !116, !noalias !106\n %167 = insertelement <16 x float> poison, float %166, i64 0, !dbg !91\n %168 = shufflevector <16 x float> %167, <16 x float> poison, <16 x i32> zeroinitializer, !dbg !91\n %169 = tail call <16 x float> @llvm.fma.v16f32(<16 x float> %168, <16 x float> %124, <16 x float> %112), !dbg !91\n %170 = getelementptr inbounds float, ptr %122, i64 9, !dbg !91\n %171 = load float, ptr %170, align 4, !dbg !91, !tbaa !114, !alias.scope !116, !noalias !106\n %172 = insertelement <16 x float> poison, float %171, i64 0, !dbg !91\n %173 = shufflevector <16 x float> %172, <16 x float> poison, <16 x i32> zeroinitializer, !dbg !91\n %174 = tail call <16 x float> @llvm.fma.v16f32(<16 x float> %173, <16 x float> %124, <16 x float> %111), !dbg !91\n %175 = getelementptr inbounds float, ptr %122, i64 10, !dbg !91\n %176 = load float, ptr %175, align 4, !dbg !91, !tbaa !114, !alias.scope !116, !noalias !106\n %177 = insertelement <16 x float> poison, float %176, i64 0, !dbg !91\n %178 = shufflevector <16 x float> %177, <16 x float> poison, <16 x i32> zeroinitializer, !dbg !91\n %179 = tail call <16 x float> @llvm.fma.v16f32(<16 x float> %178, <16 x float> %124, <16 x float> %110), !dbg !91\n %180 = getelementptr inbounds float, ptr %122, i64 11, !dbg !91\n %181 = load float, ptr %180, align 4, !dbg !91, !tbaa !114, !alias.scope !116, !noalias !106\n %182 = insertelement <16 x float> poison, float %181, i64 0, !dbg !91\n %183 = shufflevector <16 x float> %182, <16 x float> poison, <16 x i32> zeroinitializer, !dbg !91\n %184 = tail call <16 x float> @llvm.fma.v16f32(<16 x float> %183, <16 x float> %124, <16 x float> %109), !dbg !91\n %185 = getelementptr inbounds float, ptr %122, i64 12, !dbg !91\n %186 = load float, ptr %185, align 4, !dbg !91, !tbaa !114, !alias.scope !116, !noalias !106\n %187 = insertelement <16 x float> poison, float %186, i64 0, !dbg !91\n %188 = shufflevector <16 x float> %187, <16 x float> poison, <16 x i32> zeroinitializer, !dbg !91\n %189 = tail call <16 x float> @llvm.fma.v16f32(<16 x float> %188, <16 x float> %124, <16 x float> %108), !dbg !91\n %190 = getelementptr inbounds float, ptr %122, i64 13, !dbg !91\n %191 = load float, ptr %190, align 4, !dbg !91, !tbaa !114, !alias.scope !116, !noalias !106\n %192 = insertelement <16 x float> poison, float %191, i64 0, !dbg !91\n %193 = shufflevector <16 x float> %192, <16 x float> poison, <16 x i32> zeroinitializer, !dbg !91\n %194 = tail call <16 x float> @llvm.fma.v16f32(<16 x float> %193, <16 x float> %124, <16 x float> %107), !dbg !91\n %195 = getelementptr inbounds float, ptr %122, i64 14, !dbg !91\n %196 = load float, ptr %195, align 4, !dbg !91, !tbaa !114, !alias.scope !116, !noalias !106\n %197 = insertelement <16 x float> poison, float %196, i64 0, !dbg !91\n %198 = shufflevector <16 x float> %197, <16 x float> poison, <16 x i32> zeroinitializer, !dbg !91\n %199 = tail call <16 x float> @llvm.fma.v16f32(<16 x float> %198, <16 x float> %124, <16 x float> %106), !dbg !91\n %200 = getelementptr inbounds float, ptr %122, i64 15, !dbg !91\n %201 = load float, ptr %200, align 4, !dbg !91, !tbaa !114, !alias.scope !116, !noalias !106\n %202 = insertelement <16 x float> poison, float %201, i64 0, !dbg !91\n %203 = shufflevector <16 x float> %202, <16 x float> poison, <16 x i32> zeroinitializer, !dbg !91\n %204 = tail call <16 x float> @llvm.fma.v16f32(<16 x float> %203, <16 x float> %124, <16 x float> %105), !dbg !91\n %205 = getelementptr inbounds float, ptr %123, i64 16, !dbg !91\n %206 = getelementptr inbounds float, ptr %122, i64 128, !dbg !91\n tail call void @llvm.prefetch.p0(ptr nonnull %206, i32 0, i32 3, i32 1), !dbg !91, !noalias !106\n %207 = getelementptr inbounds float, ptr %122, i64 16, !dbg !91\n %208 = add nuw nsw i64 %121, 1, !dbg !91\n %209 = icmp eq i64 %208, %21, !dbg !91\n br i1 %209, label %iree_uk_mmt4d_tile_f32f32f32_16x16x1_x86_64_avx512_base.exit, label %.preheader.i, !dbg !91, !llvm.loop !117\n\niree_uk_mmt4d_tile_f32f32f32_16x16x1_x86_64_avx512_base.exit: ; preds = %.preheader.i, %67\n %210 = phi <16 x float> [ %104, %67 ], [ %204, %.preheader.i ], !dbg !91\n %211 = phi <16 x float> [ %102, %67 ], [ %199, %.preheader.i ], !dbg !91\n %212 = phi <16 x float> [ %100, %67 ], [ %194, %.preheader.i ], !dbg !91\n %213 = phi <16 x float> [ %98, %67 ], [ %189, %.preheader.i ], !dbg !91\n %214 = phi <16 x float> [ %96, %67 ], [ %184, %.preheader.i ], !dbg !91\n %215 = phi <16 x float> [ %94, %67 ], [ %179, %.preheader.i ], !dbg !91\n %216 = phi <16 x float> [ %92, %67 ], [ %174, %.preheader.i ], !dbg !91\n %217 = phi <16 x float> [ %90, %67 ], [ %169, %.preheader.i ], !dbg !91\n %218 = phi <16 x float> [ %88, %67 ], [ %164, %.preheader.i ], !dbg !91\n %219 = phi <16 x float> [ %86, %67 ], [ %159, %.preheader.i ], !dbg !91\n %220 = phi <16 x float> [ %84, %67 ], [ %154, %.preheader.i ], !dbg !91\n %221 = phi <16 x float> [ %82, %67 ], [ %149, %.preheader.i ], !dbg !91\n %222 = phi <16 x float> [ %80, %67 ], [ %144, %.preheader.i ], !dbg !91\n %223 = phi <16 x float> [ %78, %67 ], [ %139, %.preheader.i ], !dbg !91\n %224 = phi <16 x float> [ %76, %67 ], [ %134, %.preheader.i ], !dbg !91\n %225 = phi <16 x float> [ %74, %67 ], [ %129, %.preheader.i ], !dbg !91\n store <16 x float> %225, ptr %gep, align 1, !dbg !91, !tbaa !108, !alias.scope !111, !noalias !112\n store <16 x float> %224, ptr %75, align 1, !dbg !91, !tbaa !108, !alias.scope !111, !noalias !112\n store <16 x float> %223, ptr %77, align 1, !dbg !91, !tbaa !108, !alias.scope !111, !noalias !112\n store <16 x float> %222, ptr %79, align 1, !dbg !91, !tbaa !108, !alias.scope !111, !noalias !112\n store <16 x float> %221, ptr %81, align 1, !dbg !91, !tbaa !108, !alias.scope !111, !noalias !112\n store <16 x float> %220, ptr %83, align 1, !dbg !91, !tbaa !108, !alias.scope !111, !noalias !112\n store <16 x float> %219, ptr %85, align 1, !dbg !91, !tbaa !108, !alias.scope !111, !noalias !112\n store <16 x float> %218, ptr %87, align 1, !dbg !91, !tbaa !108, !alias.scope !111, !noalias !112\n store <16 x float> %217, ptr %89, align 1, !dbg !91, !tbaa !108, !alias.scope !111, !noalias !112\n store <16 x float> %216, ptr %91, align 1, !dbg !91, !tbaa !108, !alias.scope !111, !noalias !112\n store <16 x float> %215, ptr %93, align 1, !dbg !91, !tbaa !108, !alias.scope !111, !noalias !112\n store <16 x float> %214, ptr %95, align 1, !dbg !91, !tbaa !108, !alias.scope !111, !noalias !112\n store <16 x float> %213, ptr %97, align 1, !dbg !91, !tbaa !108, !alias.scope !111, !noalias !112\n store <16 x float> %212, ptr %99, align 1, !dbg !91, !tbaa !108, !alias.scope !111, !noalias !112\n store <16 x float> %211, ptr %101, align 1, !dbg !91, !tbaa !108, !alias.scope !111, !noalias !112\n store <16 x float> %210, ptr %103, align 1, !dbg !91, !tbaa !108, !alias.scope !111, !noalias !112\n br label %iree_uk_mmt4d.exit, !dbg !91\n\niree_uk_mmt4d.exit: ; preds = %iree_uk_mmt4d_tile_f32f32f32_16x16x1_x86_64_avx512_base.exit, %65\n %226 = add i64 %66, %28, !dbg !91\n %227 = icmp slt i64 %226, %11, !dbg !91\n br i1 %227, label %65, label %._crit_edge, !dbg !91\n\n._crit_edge: ; preds = %iree_uk_mmt4d.exit, %.preheader\n %228 = add i64 %58, %31, !dbg !91\n %229 = icmp slt i64 %228, %5, !dbg !91\n br i1 %229, label %.preheader, label %._crit_edge58, !dbg !91\n\n._crit_edge58: ; preds = %._crit_edge, %3\n ret i32 0, !dbg !91\n}\n
.section .text.matmul_dynamic_dispatch_3_mmt4d_DxDxDx16x16x1_f32,\"ax\",@progbits\n .p2align 4, 0x90\n .type matmul_dynamic_dispatch_3_mmt4d_DxDxDx16x16x1_f32,@function\nmatmul_dynamic_dispatch_3_mmt4d_DxDxDx16x16x1_f32:\n.Lfunc_begin3:\n .loc 1 1 0 is_stmt 1\n .cfi_startproc\n push rbp\n .cfi_def_cfa_offset 16\n .cfi_offset rbp, -16\n mov rbp, rsp\n .cfi_def_cfa_register rbp\n.Ltmp6:\n push r15\n push r14\n push r13\n push r12\n push rbx\n .cfi_offset rbx, -56\n .cfi_offset r12, -48\n .cfi_offset r13, -40\n .cfi_offset r14, -32\n .cfi_offset r15, -24\n .loc 1 1 1 prologue_end\n mov rcx, qword ptr [rsi + 24]\n mov edi, dword ptr [rdx + 4]\n mov rax, qword ptr [rcx + 16]\n mov qword ptr [rbp - 48], rdi\n mov qword ptr [rbp - 112], rax\n cmp rax, rdi\n jle .LBB3_11\n mov eax, dword ptr [rsi + 16]\n mov edi, dword ptr [rsi + 12]\n mov r12, qword ptr [rsi + 32]\n mov rsi, qword ptr [rcx + 40]\n mov r9, qword ptr [rcx + 56]\n mov ebx, dword ptr [rcx + 4]\n mov r10d, dword ptr [rcx]\n mov r11, qword ptr [rcx + 24]\n mov r14, qword ptr [rcx + 32]\n mov r8, rsi\n shl r8, 4\n mov qword ptr [rbp - 104], rax\n shl r9, 8\n mov rax, qword ptr [r12 + 8]\n shl rbx, 32\n mov qword ptr [rbp - 128], r8\n mov r8d, dword ptr [rcx + 12]\n mov qword ptr [rbp - 96], r9\n mov r9d, dword ptr [rcx + 8]\n or r10, rbx\n sar rbx, 63\n xor r10, rbx\n lea r15, [r10 + 3]\n mov qword ptr [rbp - 80], rax\n mov eax, dword ptr [rdx]\n shl r8, 32\n or r9, r8\n test r10, r10\n cmovns r15, r10\n sar r8, 63\n sar r15, 2\n xor r9, r8\n xor r15, rbx\n lea rcx, [r9 + 3]\n test r9, r9\n mov qword ptr [rbp - 56], rax\n cmovns rcx, r9\n imul rax, rsi\n mov r9, qword ptr [r12]\n imul rsi, rdi\n mov qword ptr [rbp - 120], r15\n sar rcx, 2\n xor rcx, r8\n shl rax, 6\n mov qword ptr [rbp - 88], rcx\n mov rcx, r11\n shl rcx, 9\n shl rsi, 6\n lea rax, [rax + 4*r15]\n mov qword ptr [rbp - 72], rcx\n mov qword ptr [rbp - 64], rax\n jmp .LBB3_2\n .p2align 4, 0x90\n.LBB3_10:\n .loc 1 0 1 is_stmt 0\n mov rax, qword ptr [rbp - 48]\n .loc 1 1 1\n add rax, qword ptr [rbp - 104]\n mov qword ptr [rbp - 48], rax\n cmp rax, qword ptr [rbp - 112]\n jge .LBB3_11\n.LBB3_2:\n .loc 1 0 1\n cmp r14, qword ptr [rbp - 56]\n .loc 1 1 1\n jle .LBB3_10\n .loc 1 0 1\n mov rax, qword ptr [rbp - 96]\n mov rcx, qword ptr [rbp - 48]\n mov r10, qword ptr [rbp - 72]\n mov rdx, qword ptr [rbp - 80]\n mov r8, qword ptr [rbp - 64]\n imul rax, rcx\n add rax, qword ptr [rbp - 88]\n imul r10, rcx\n sar r10, 3\n lea r13, [r9 + r10]\n .loc 1 1 1\n lea r15, [rdx + 4*rax]\n mov rax, qword ptr [rbp - 56]\n jmp .LBB3_4\n .p2align 4, 0x90\n.LBB3_8:\n add rdx, r15\n vmovups zmmword ptr [rdx], zmm15\n vmovups zmmword ptr [rdx + 64], zmm14\n vmovups zmmword ptr [rdx + 128], zmm13\n vmovups zmmword ptr [rdx + 192], zmm12\n vmovups zmmword ptr [rdx + 256], zmm11\n vmovups zmmword ptr [rdx + 320], zmm10\n vmovups zmmword ptr [rdx + 384], zmm9\n vmovups zmmword ptr [rdx + 448], zmm8\n vmovups zmmword ptr [rdx + 512], zmm7\n vmovups zmmword ptr [rdx + 576], zmm6\n vmovups zmmword ptr [rdx + 640], zmm5\n vmovups zmmword ptr [rdx + 704], zmm4\n vmovups zmmword ptr [rdx + 768], zmm3\n vmovups zmmword ptr [rdx + 832], zmm2\n vmovups zmmword ptr [rdx + 896], zmm1\n vmovups zmmword ptr [rdx + 960], zmm0\n.LBB3_9:\n add rax, rdi\n add r8, rsi\n cmp rax, r14\n jge .LBB3_10\n.LBB3_4:\n .loc 1 0 1\n test r11, r11\n .loc 1 1 1\n je .LBB3_9\n .loc 1 0 1\n mov rcx, qword ptr [rbp - 128]\n .loc 1 1 1\n mov rdx, rax\n shl rdx, 10\n prefetchw byte ptr [r15 + rdx]\n prefetcht0 byte ptr [r13]\n imul rcx, rax\n add rcx, qword ptr [rbp - 120]\n shl rcx, 5\n sar rcx, 3\n prefetcht0 byte ptr [r9 + rcx]\n prefetcht0 byte ptr [r13]\n prefetcht0 byte ptr [r9 + rcx]\n vmovups zmm15, zmmword ptr [r15 + rdx]\n vmovups zmm14, zmmword ptr [r15 + rdx + 64]\n vmovups zmm13, zmmword ptr [r15 + rdx + 128]\n vmovups zmm12, zmmword ptr [r15 + rdx + 192]\n vmovups zmm11, zmmword ptr [r15 + rdx + 256]\n vmovups zmm10, zmmword ptr [r15 + rdx + 320]\n vmovups zmm9, zmmword ptr [r15 + rdx + 384]\n vmovups zmm8, zmmword ptr [r15 + rdx + 448]\n vmovups zmm7, zmmword ptr [r15 + rdx + 512]\n vmovups zmm6, zmmword ptr [r15 + rdx + 576]\n vmovups zmm5, zmmword ptr [r15 + rdx + 640]\n vmovups zmm4, zmmword ptr [r15 + rdx + 704]\n vmovups zmm3, zmmword ptr [r15 + rdx + 768]\n vmovups zmm2, zmmword ptr [r15 + rdx + 832]\n vmovups zmm1, zmmword ptr [r15 + rdx + 896]\n vmovups zmm0, zmmword ptr [r15 + rdx + 960]\n test r11, r11\n jle .LBB3_8\n .loc 1 0 1\n lea rcx, [8*r8]\n mov r12, r9\n mov rbx, r11\n sar rcx, 3\n add rcx, 512\n .p2align 4, 0x90\n.LBB3_7:\n .loc 1 1 1\n vmovups zmm16, zmmword ptr [r12 + rcx - 512]\n prefetcht0 byte ptr [r12 + rcx]\n vfmadd231ps zmm15, zmm16, dword ptr [r12 + r10]{1to16}\n vfmadd231ps zmm14, zmm16, dword ptr [r12 + r10 + 4]{1to16}\n vfmadd231ps zmm13, zmm16, dword ptr [r12 + r10 + 8]{1to16}\n vfmadd231ps zmm12, zmm16, dword ptr [r12 + r10 + 12]{1to16}\n vfmadd231ps zmm11, zmm16, dword ptr [r12 + r10 + 16]{1to16}\n vfmadd231ps zmm10, zmm16, dword ptr [r12 + r10 + 20]{1to16}\n vfmadd231ps zmm9, zmm16, dword ptr [r12 + r10 + 24]{1to16}\n vfmadd231ps zmm8, zmm16, dword ptr [r12 + r10 + 28]{1to16}\n vfmadd231ps zmm7, zmm16, dword ptr [r12 + r10 + 32]{1to16}\n vfmadd231ps zmm6, zmm16, dword ptr [r12 + r10 + 36]{1to16}\n vfmadd231ps zmm5, zmm16, dword ptr [r12 + r10 + 40]{1to16}\n vfmadd231ps zmm4, zmm16, dword ptr [r12 + r10 + 44]{1to16}\n vfmadd231ps zmm3, zmm16, dword ptr [r12 + r10 + 48]{1to16}\n vfmadd231ps zmm2, zmm16, dword ptr [r12 + r10 + 52]{1to16}\n vfmadd231ps zmm1, zmm16, dword ptr [r12 + r10 + 56]{1to16}\n vfmadd231ps zmm0, zmm16, dword ptr [r12 + r10 + 60]{1to16}\n prefetcht0 byte ptr [r12 + r10 + 512]\n add r12, 64\n dec rbx\n jne .LBB3_7\n jmp .LBB3_8\n.LBB3_11:\n xor eax, eax\n .loc 1 1 1 epilogue_begin\n pop rbx\n pop r12\n pop r13\n pop r14\n pop r15\n pop rbp\n .cfi_def_cfa rsp, 8\n vzeroupper\n ret\n.Ltmp7:\n.Lfunc_end3:\n .size matmul_dynamic_dispatch_3_mmt4d_DxDxDx16x16x1_f32, .Lfunc_end3-matmul_dynamic_dispatch_3_mmt4d_DxDxDx16x16x1_f32\n .cfi_endproc\n
Matrix multiplication (matmul) is an important operation in ML workloads that poses specific challenges to code generation. For example, matmul makes repeated accesses to the same data, which makes locality of reference a top concern.
Moreover, modern CPUs instruction set architectures (ISAs) offer specialized SIMD instructions that the matmul implementation needs to use to achieve optimal performance, and these instructions expect data to be in a particular layout.
This article is about an in-development MLIR operation, linalg.mmt4d, offering a compilation path for linalg.matmul that is designed from the ground up for these efficiency considerations.
We are still in the early implementation phase of this linalg.mmt4d plan, but we feel confident that we know where we are going because what we are really doing here is importing into the compiler what we have learned working on optimized matrix multiplication libraries, particularly Ruy. We know what loop schedule and kernel we want the compiler to generate \u2014 essentially the same as we wrote in Ruy, give or take additional optimizations such as fusions and constant folding that become possible now that we are doing this within a compiler. This allows us to focus on how we get the compiler to generate that schedule and kernel with purely algebraic transformations that compose and enable further compiler optimizations.
At the basis of this work is the extensible op system of the Linalg dialect in the MLIR compiler toolkit. In this case, a general purpose, mixed precision mmt4d op is defined via a high level description directly in the compiler and is then available to both users of the compiler (as a linalg.mmt4d op) or for direct emission via Python based IR construction (i.e. for direct integration into high level frameworks without rebuilding the compiler). The ability to define such new special forms cheaply, and without any systemic framework level cost, is part of the extensibility and composition story that we expect will become increasingly important in development and deployment scenarios in the future, and in this case, it let us spring board off of high quality code generation which was already well integrated and composed well with other features of the compiler.
Let us start by discussing IREE\u2019s existing matmul code generation and highlight the issues that mmt4d aims to overcome.
The existing approach operates in-place on the source matrices. When we discuss \"tiling\" in this paragraph, we refer exclusively to the traversal \u2014 how these source matrices are traversed by the matmul loop. There is no \"tiled layout\" here, which will be the key difference with mmt4d below.
The destination matrix is tiled into workgroups (CPU threads) tiles, then each workgroup tile is tiled to fit some level of CPU cache, and finally each tile is further tiled to fit target architecture registers (e.g. 8x8).
That multi-level tiling means that the code works like the following loop nest:
def tiled_matmul(A, B, C, tile_m, tile_n, tile_k, tile_m_v, tile_n_v, tile_k_v):\n m = A.shape[0]\n k = A.shape[1]\n n = B.shape[1]\n for m1 in range(0, m, tile_m):\n for n1 in range(0, n, tile_n):\n for k1 in range(0, k, tile_k):\n # First level of tiling views...\n lhs_tile = A[m1:m1+tile_m, k1:k1+tile_k]\n rhs_tile = B[k1:k1+tile_k, n1:n1+tile_n]\n dst_tile = C[m1:m1+tile_m, n1:n1+tile_n]\n for mv in range(0, tile_m, tile_m_v):\n for nv in range(0, tile_n, tile_n_v):\n for kv in range(0, tile_k, tile_k_v):\n # Register tiling views...\n lhs_tile_v = lhs_tile[mv:mv+tile_m_v, kv:kv+tile_k_v]\n rhs_tile_v = rhs_tile[kv:kv+tile_k_v, nv:nv+tile_n_v]\n # kernel.\n dst_tile[mv:mv+tile_m_v, nv:nv+tile_n_v] += np.matmul(lhs_tile_v, rhs_tile_v)\n return C\n
The two main problems with this approach are:
Overhead to meet SIMD ISA layout requirements: In practice, the kernel needs to use specific SIMD instructions to perform the arithmetic. They expect small tiles of the matrices to be loaded in registers, in a specific layout. If the matrix data wasn't already stored in memory in such a tiled layout, then the kernel has to perform such a data rearrangement on the fly, incurring substantial overhead. For NxN matrix multiplication, the kernel performs O(N3) work on O(N2) data, so doing that rearrangement there means O(N3) overhead where O(N2) should have sufficed, as this could have been done as a pre-processing step on O(N2) data.
Inefficent memory traversal: For efficiency reasons, we always need tile_m_v>1 and tile_n_v>1. That is because the higher these values, the fewer memory-load instructions are needed overall; and this is also dictated by the SIMD instructions that we want to use. But that means that the kernel is accessing simultaneously multiple rows or columns of the left-hand and right-hand side matrices. And in this existing approach, they are stored in linear layout, not in a tiled layout, so these accesses are not contiguous in memory. This is detrimental to memory access performance, meaning the CPU caches, in multiple ways. One is that these multiple non-contiguous accesses may alias each other in the L1 cache because of low associativity.
tile_m_v>1
tile_n_v>1
For the reasons above, an efficient matmul implementation must reorder data into a tiled layout matching the target SIMD ISA and making the memory access patterns as contiguous as possible.
IREE/MLIR defaults to bufferizing all tensors into a \"row-major\" order, meaning that the last-enumerated dimension is the one that is contiguous in memory. As we prefer not to write custom bufferization code, we can't specify an alternative layout for a tensor. Fortunately, it is possible to represent a 2D tiled layout as a 4D layout. For example, tensor<2x2x2x2xf32> can represent a 4x4 matrix made of 2x2 tiles, each of which is 2x2. The row-major layout on tensor<2x2x2x2xf32> makes each 2x2 tile contiguous and row-major, and arranges the 2x2 tiles themselves into a row-major 2x2 layout in the overall 4x4 matrix.
tensor<2x2x2x2xf32>
Such a row-major-tiled layout is exactly what we need for the left-hand side of a matrix multiplication, because matrix multiplication traverses the left-hand side matrix row by row. But for the right-hand side matrix, we want a column-major-tiled layout. To solve this problem, we decide to implement not matrix multiplication, but matrix-multiplication-by-transposed-right-hand-side which is where the t in the linalg.mmt4d came from. Now such an op is happy with both the left and right-hand sides being row-major-tiled.
t
The following example illustrates that. In these diagrams, each matrix element is rendered its memory offset.
To compute the 2x2 block in the destination matrix, we will have to load two yellow blocks from LHS, RHS matrices respectively compute their matmul results (i.e. call the kernel), then the two blue blocks, and so on. As we can see, each tile loads data that is not contiguous. It would be better if we rearranged the elements in the following layout:
Now tiles are stored contiguously in memory and the kernel can simply load them from memory into the registers that will be directly consumed by the SIMD instructions performing the multiplications. Moreover, the kernel is now loading from just two contiguous data streams, a simple memory access pattern which is sure to be efficient (regarding caches, etc) on any reasonable target hardware.
We introduce a linalg.mmt4d operation that performs such a matrix multiplication on matrices in a tiled layout represented as 4D tensors. That leaves the question of how to represent, within the linalg dialect, the conversions between ordinary matrices represented as 2D tensors, and these tiled matrices represented as 4D tensors. Moreover, these conversions should be tileable and decompose well. Thankfully, the transformation from 2D to 4D can be written as a reshape followed by a transpose as in the following digram:
So we can think of the outermost two dimensions of the 4D representations as the tile position in the overall matrix, and the innermost two as the element position within one tile. Hopefully the following Python pseudocode makes it more concrete:
def pack_2d_4d(operand, parallel_size, reduction_size):\n i1 = operand.shape[0] // parallel_size # M1\n i2 = parallel_size # M0\n j1 = operand.shape[1] // reduction_size # K1\n j2 = reduction_size # K0\n operand_4d = np.reshape(operand, [i1, i2, j1, j2])\n return np.transpose(operand_4d, [0, 2, 1, 3]) # [M1, K1, M0, K0]\n
Now the mmt4d operation will follow a structure as the multi level tiling, for simplicity we considered the case here where no L1 tiling is required only first level of distribution to workgroups:
def mmt4d(A, B, C, M0, N0, K0):\n M = A.shape[0]\n N = B.shape[1]\n Bt = np.transpose(B, [1, 0])\n A4d = pack_2d_4d(A, M0, K0)\n Bt4d = pack_2d_4d(Bt, N0, K0)\n M1 = A4d.shape[0]\n N1 = Bt4d.shape[0]\n K1 = A4d.shape[1]\n for m1 in range(0, M1):\n for n1 in range(0, N1):\n for k1 in range(0, K1):\n # Tile views that are contiguous in memory.\n lhs_tile = np.reshape(A4d[m1, k1, :, :], [M0, K0])\n rhs_tile = np.reshape(Bt4d[n1, k1, :, :], [N0, K0])\n # Inner kernel.\n C[m1, n1, :, :] += np.matmul(lhs_tile, np.transpose(rhs_tile, [1, 0]))\n # 4d -> 2D\n C2d = unpack_4d_2d(C)\n return C2d\n
The resulting 4D tiled matrix still needs be rearranged back to the original layout as 2D tensor:
def unpack_4d_2d(operand):\n i1 = operand.shape[0] # M1\n j1 = operand.shape[1] # N1\n i2 = operand.shape[2] # M0\n j2 = operand.shape[3] # N0\n operand_transposed = operand.transpose([0, 2, 1, 3]) # [M1, M0, N1, N0]\n return operand_transposed.reshape([i1 * i2, j1 * j2]) # [M, N]\n
We benchmarked various float32 matmul problems of different sizes and the result showed that mmt4d is faster than the existing matmul implementation for bigger matrices as we can see the in the following chart:
The SIMD instruction being used here is the simplest kind, a vector*scalar multiplication, and the storage orders of the matrices allow the existing implementation to directly load the vectors from the source matrices without any rearrangement overhead. So this case is particularly friendly to the existing code, which is why the mmt4d code is only faster for bigger matrices. To understand why mmt4d is faster in that case, we collected statistics of L1 cache misses:
vector*scalar
This shows that in this case, the better cache-friendliness of mmt4d, thanks to its simple contiguous memory access pattern, accounts for its higher performance.
As we proceed with increasingly sophisticated SIMD targets, starting with the dot-product instructions found in current mobile devices for the int8 case and going to become generalized to all data types all the way to float32 over the next few years with upcoming ARM SIMD instructions, the advantage of mmt4d will widen for all sizes, not just the larger ones.
Part of why we feel confident about the eventual performance that our approach will achieve is that, as mentioned in the introduction, we are rebuilding within the compiler an existing library's schedule and kernel, and we have benchmark results about it.
We introduced a 4d tiled representation for 2d matrix-matrix multiplication with a decomposable algebric transformations that requires only reshape and transpose of input operands, we discussed and empirically showed how that solves major drawbacks in row-major linear matmul by providing a flexible way to match different ISA layout along with better cache locality achieving near peak performance.
As was mentioned in the introduction, this work in under active development and the next immediate steps are to prove the rest of the hypothesis by:
Handling dynamic sizes and padding to the next multiple of the target tile size.
Implementing the integer case (int32 += int8 * int8).
int32 += int8 * int8
Implementing the dispatch to different SIMD ISA variants at runtime.
Implementing cache-friendly traversal for larger matmuls and multi-threading by interfacing with IREE's runtime dispatch.
Improving the generated code by fusing the 4d tiled layout with the producers and consumers of the linalg.mmt4d.
IREE can now execute TensorFlow Lite (TFLite) models through the use of TOSA, an open standard of common tensor operations, and a part of MLIR core. TOSA\u2019s high-level representation of tensor operations provides a common front-end for ingesting models from different frameworks. In this case we ingest a TFLite FlatBuffer and compile it to TOSA IR, which IREE takes as an input format to compile to its various backends.
Using TFLite as a frontend for IREE provides an alternative ingestion method for already existing models that could benefit from IREE\u2019s design. This enables models already designed for on-device inference to have an alternative path for execution without requiring any additional porting, while benefiting from IREE\u2019s improvements in buffer management, work dispatch system, and compact binary format. With continued improvements to IREE/MLIR\u2019s compilation performance, more optimized versions can be compiled and distributed to target devices without an update to the clientside environment.
Today, we have validated floating point support for a variety of models, including mobilenet (v1, v2, and v3) and mobilebert. More work is in progress to support fully quantized models, and TFLite\u2019s hybrid quantization, along with dynamic shape support.
TFLite with IREE is available in Python and Java. We have a colab notebook that shows how to use IREE\u2019s python bindings and TFLite compiler tools to compile a pre-trained TFLite model from a FlatBuffer and run using IREE. We also have an Android Java app that was forked from an existing TFLite demo app, swapping out the TFLite library for our own AAR. More information on IREE\u2019s TFLite frontend is available here.
These pages cover topics useful for project maintainers and contributors.
Some of these pages may be stale. Contributions are always welcome!
This page contains a list of best practices for getting the most out of IREE, spanning model authoring, ahead-of-time compilation, and runtime use. Treat these as a collection of ideas to consider or areas to start benchmarking when working on your own applications.
Common themes include:
If your model is stateful prefer to store that state directly within your program rather than externalizing it through arguments and return values. By keeping state inside your program the compiler is better able to reason about it and function calls will have lower overhead.
If you do externalize state, try to pack that state into a limited number of arguments.
See the variables and state sample for further guidance on tracking and using state.
While IREE aims to support general dynamic shapes use, it is better able to optimize parts of programs where shapes are static. Slow varying dimensions like batch index or timestamp are safer uses of dynamic shapes than faster varying dimensions like the x/y/channel dimensions of images.
See the dynamic shapes sample for further guidance on using dynamic shapes.
TODO: which compiler targets to use (try both CUDA and Vulkan?)
TODO: use the most specific LLVM target triple you can?
IREE runs its own suite of benchmarks continuously using the definitions at https://github.com/iree-org/iree/tree/main/benchmarks. The flags set for these benchmarks represent the latest manually tuned values for workloads we track closely and referencing them may help with your own search for peak performance. You can use these flags in your own explorations, but note that as compiler performance matures, the existing flags will gradually be replaced with attributes for autotuning or command line options for experimental features.
TODO: sample code, profile numbers
When running on the CPU, the task system flags specified in iree/task/api.c give control over how worker threads will be created. For example, the --task_topology_group_count=3 flag can be set to explicitly run on three workers rather than rely on heuristic selection that defaults to one worker per detected physical core.
--task_topology_group_count=3
If running on a single thread or system with no threading support the local-sync HAL driver can be used instead of the multithreaded local-task HAL driver to reduce dependencies and code size. When running with the local-sync driver all execution happens inline on the thread invoking the IREE runtime and will block until it has completed.
local-sync
When using IREE's runtime libraries, try to front-load queries, particularly queries using strings that look up into maps like iree_runtime_session_call_by_name, so that hot sections of code are doing the minimum amount of work: routing inputs through buffers, scheduling runtime calls, and routing outputs through other buffers.
iree_runtime_session_call_by_name
Vulkan is a new generation graphics and compute API that provides high-efficiency, cross-platform access to modern GPUs used in a wide variety of devices from PCs and consoles to mobile phones and embedded platforms.
This page lists steps and tips for setting up and troubleshooting a Vulkan development environment. The information here is meant to be generic.
Vulkan adopts a layered architecture, which aims to better support extensiblity. There are four components involved in this architecture:
The Vulkan loader sits between the Vulkan application, which calls Vulkan APIs, and the ICDs, which implements these Vulkan APIs. Vulkan layers agument the Vulkan system to provide optional features like validation and debugging. The Vulkan loader composes a chain of requested layers, which processes the Vulkan application's API calls one by one, and finally redirects the API calls made by the Vulkan application to one or more ICDs.
It's highly recommned to read the Architecture of the Vulkan Loader Interfaces Overview to get a general understanding of what these components are and how they interact with one another.
You need to install the Vulkan SDK from LunarG to get the Vulkan loader.
Typically the Vulkan SDK will be installed at C:\\VulkanSDK\\<version>\\ and there will be an environment variable VULKAN_SDK pointing to it. You can run the vulkancube executable under the Bin\\ subdirectory of the Vulkan SDK to make sure everything works properly. If not, you probably need to check whether the graphics card is Vulkan capable or update the driver.
C:\\VulkanSDK\\<version>\\
VULKAN_SDK
vulkancube
Bin\\
For Ubuntu 20.04/22.04, it's recommended to directly install the full Vulkan SDK from LunarG's APT sources for the loader and various developer tools.
If you want to have a minimal environment, the following packages should be installed for a proper Vulkan runtime:
libvulkan1
libvulkan.so
mesa-vulkan-drivers
nvidia-vulkan-icd
nvidia-driver-*
The above packages provide the Vulkan loader and ICDs. With them a Vulkan application should be able to run. You may additionally want to install
vulkaninfo
In order to develop Vulkan applications, you additionally need the following packages:
VkLayer_standard_validation
For other Linux distros, please consult the corresponding package management tools for the packages needed. (And please feel free to update this doc regarding them.)
You can also download and install the Vulkan SDK tarball from LunarG. It packages the loader with many useful layers and other shader tools.
You can also build the Vulkan SDK component projects like Vulkan-Loader and Vulkan-ValidationLayers from source. But note that building these components separately you need to make sure they are consistent with one another (e.g., using the same version of Vulkan headers) to function together.
Please make sure your Android device is Vulkan capable. Vulkan is supported on Android since 7, but we track newer Android versions (10+) closely and haven't set a clear min version yet.
If you have multiple versions of Vulkan loaders exist, you may also need to set LD_LIBRARY_PATH and LD_PRELOAD to load the desired version of the loader. For example:
LD_LIBRARY_PATH
LD_PRELOAD
LD_LIBRARY_PATH={PATH_TO_VULKAN_SDK}/x86_64/lib/\nLD_PRELOAD=libvulkan.so.1\n
This can also be done by sourcing the proper setup-env.sh from one of the downloaded Vulkan SDKs.
setup-env.sh
There are a few environment variables that can alter the default Vulkan loader behavior and print verbose information, notably:
VK_LOADER_DEBUG
all
VK_ICD_FILENAMES
VK_INSTANCE_LAYERS
VK_LAYER_LUNARG_api_dump
VK_LAYER_LUNARG_core_validation
VK_LAYER_PATH
Please see the Vulkan loader's documentation for detailed explanation for these variables.
Bazel runs tests in a sandbox and environment variables must be passed through to the test runner. Consider putting environment setup in a user.bazelrc to save typing. For example:
user.bazelrc
test --test_env=\"LD_LIBRARY_PATH=/absolute/path/to/vulkan/sdk/x86_64/lib/\"\ntest --test_env=\"LD_PRELOAD=libvulkan.so.1\"\ntest --test_env=\"VK_LAYER_PATH=/absolute/path/to/additional/layers/:$VK_LAYER_PATH\"\n
vkCreateInstance
Since Android 8 Oreo, Android re-architected the OS framework with project Treble. Framework libraries and vendor libraries have a more strict and clear separation. Their dependencies are carefully scrutinized and only selected cases are allowed. This is enforced with linker namespaces.
/data/local/tmp is the preferred directory for automating native binary tests built using NDK toolchain. They should be allowed to access libraries like libvulkan.so for their functionality. However, there was an issue with fully treblized Android 10 where /data/local/tmp did not have access to the linker namespaces needed by libvulkan.so. This should be fixed now. But as typically in the Android system, it takes a long time to see the fix getting propagated, if ever.
/data/local/tmp
A known workaround is to symlink the vendor Vulkan implementation under /vendor/lib[64] as libvulkan.so under /data/local/tmp and use LD_LIBRARY_PATH=/data/local/tmp when invoking IREE executables.
/vendor/lib[64]
LD_LIBRARY_PATH=/data/local/tmp
For Qualcomm Adreno GPUs, the vendor Vulkan implementation is at /vendor/lib[64]/hw/vulkan.*.so. So for example for Snapdragon 865:
/vendor/lib[64]/hw/vulkan.*.so
adb shell ln -s /vendor/lib64/hw/vulkan.kona.so /data/local/tmp/libvulkan.so\n
For ARM Mali GPUs, there is only one monolithic driver (/vendor/lib[64]/libGLES_mali.so) for OpenGL and Vulkan and the Vulkan vendor driver (/vendor/lib[64]/hw/vulkan.*.so) is just a symlink to it. So for example:
/vendor/lib[64]/libGLES_mali.so
adb shell ln -s /vendor/lib64/libGLES_mali.so /data/local/tmp/libvulkan.so\n
Physical devices enumerated on NVIDIA drivers can be affected by the DISPLAY environment variable. If you are running under an SSH session to Linux or using chrome remote desktop and have problems with physical device enumeration, you probably want to check the DISPLAY environment and set it to point to a display at the server side, for example:
DISPLAY
export DISPLAY=:0\n
This page walks through building IREE from source using the Bazel build system.
Warning
Bazel build support is primarily for internal project infrastructure. We strongly recommend using CMake instead.
Our Bazel configuration is also only tested on Linux. Windows and macOS may be unstable.
Install Bazel, matching IREE's .bazelversion by following the official docs.
.bazelversion
Install a compiler such as Clang (GCC is not fully supported).
sudo apt install clang\n
Set environment variables for Bazel:
export CC=clang\nexport CXX=clang++\n
Install Python build requirements:
python -m pip install -r runtime/bindings/python/iree/runtime/build_requirements.txt\n
Install Homebrew:
/bin/bash -c \"$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install.sh)\"\n
Install Bazel, matching IREE's .bazelversion by following the official docs or via Homebrew:
brew install bazel\n
Tip
You can simplify installation by using a package manager like Scoop or Chocolatey.
Also install MSYS2 by following Bazel's documentation.
Install Python3 (docs here) and Python build requirements:
Install the full Visual Studio or \"Build Tools For Visual Studio\" from the downloads page then set the BAZEL_VS environment variable:
BAZEL_VS
> $env:BAZEL_VS = \"C:\\Program Files (x86)\\Microsoft Visual Studio\\2022\\BuildTools\"\n
Configure Bazel:
# This generates a `configured.bazelrc` file by analyzing your environment.\n# Skipping this step will make it difficult to select your platform/compiler.\npython3 configure_bazel.py\n
(No Linux-specific tips for configuring)
(No macOS-specific tips for configuring)
Clone to a short path like C:\\projects\\ to avoid issues with Windows maximum path lengths (260 characters).
C:\\projects\\
configure_bazel.py only detects that you have Windows and will output the default --config=windows to configured.bazelrc, which assumes the latest version of MSVC. To avoid some warnings, you may want to replace it with (for example) --config=msvc2022.
configure_bazel.py
--config=windows
configured.bazelrc
--config=msvc2022
Run all tests:
bazel test -k //...\n
Run all tests except those that require CUDA:
bazel test -k //... \\\n --iree_drivers=local-sync,local-task,vulkan \\\n --test_tag_filters=\"-driver=cuda,-target=cuda\" \\\n --build_tag_filters=\"-driver=cuda,-target=cuda\"\n
Run all tests except those that require a GPU (any API):
bazel test -k //... \\\n --iree_drivers=local-sync,local-task,vulkan \\\n --test_tag_filters=\"-driver=vulkan,-driver=metal,-driver=cuda,-target=cuda\" \\\n --build_tag_filters=\"-driver=cuda,-target=cuda\"\n
See the build_tools/bazel/build_test_all.sh script for examples of other flags and environment variables that can be used to configure what Bazel runs.
build_tools/bazel/build_test_all.sh
In general, build artifacts will be under the bazel-bin directory at the top level.
bazel-bin
You can put a user.bazelrc at the root of the repository and it will be ignored by git.
build --disk_cache=/tmp/bazel-cache\n\n# Use --config=debug to compile IREE and LLVM without optimizations\n# and with assertions enabled.\nbuild:debug --config=asserts --compilation_mode=opt '--per_file_copt=iree|llvm@-O0' --strip=never\n\n# Use --config=asserts to enable assertions. This has to be done globally:\n# Code compiled with and without assertions can't be linked together (ODR violation).\nbuild:asserts --compilation_mode=opt '--copt=-UNDEBUG'\n
build --disk_cache=c:/bazelcache\nbuild:debug --compilation_mode=dbg --copt=/O2 --per_file_copt=iree@/Od --strip=never\n
Build all of IREE's 'tools' directory:
bazel build tools/...\n
Check out what was built:
ls bazel-bin/tools/\n./bazel-bin/tools/iree-compile --help\n
Translate a MLIR file and execute a function in the compiled module:
# iree-run-mlir <compiler flags> [input.mlir] <runtime flags>\n$ ./bazel-bin/tools/iree-run-mlir \\\n --iree-hal-target-backends=vmvx --print-mlir \\\n ./samples/models/simple_abs.mlir \\\n --input=f32=-2\n
CMAKE_BUILD_TYPE
Sets the build type. Possible values are Release, Debug, RelWithDebInfo and MinSizeRel. If unset, build type is set to Release.
CMAKE_<LANG>_COMPILER
This is the command that will be used as the <LANG> compiler, which are C and CXX in IREE. These variables are set to compile IREE with clang or rather clang++. Once set, these variables can not be changed.
<LANG>
C
CXX
clang++
This gives a brief explanation of IREE specific CMake options and variables.
IREE_ENABLE_RUNTIME_TRACING
Enables instrumented runtime tracing. Defaults to OFF.
OFF
IREE_ENABLE_COMPILER_TRACING
Enables instrumented compiler tracing. This requires that IREE_ENABLE_RUNTIME_TRACING also be set. Defaults to OFF.
IREE_BUILD_COMPILER
Builds the IREE compiler. Defaults to ON.
ON
IREE_BUILD_TESTS
Builds IREE unit tests. Defaults to ON.
IREE_BUILD_DOCS
Builds IREE documentation files. Defaults to OFF.
IREE_BUILD_SAMPLES
Builds IREE sample projects. Defaults to ON.
Builds the IREE python bindings. Defaults to OFF.
IREE_BUILD_BINDINGS_TFLITE
Builds the IREE TFLite C API compatibility shim. Defaults to ON.
IREE_BUILD_BINDINGS_TFLITE_JAVA
Builds the IREE TFLite Java bindings with the C API compatibility shim. Defaults to ON.
IREE_BUILD_EXPERIMENTAL_REMOTING
Builds experimental remoting component. Defaults to OFF.
IREE_HAL_DRIVER_DEFAULTS
Default setting for each IREE_HAL_DRIVER_* option.
IREE_HAL_DRIVER_*
Individual options enabling the build for each runtime HAL driver.
IREE_TARGET_BACKEND_DEFAULTS
Default setting for each IREE_TARGET_BACKEND_* option.
IREE_TARGET_BACKEND_*
Individual options enabling the build for each compiler target backend.
IREE_INPUT_*
Individual options enabling each set of input dialects.
IREE_OUTPUT_FORMAT_C
Enables the vm-c compiler output format, using MLIR EmitC. Defaults to ON.
IREE_DEV_MODE
Configure settings to optimize for IREE development (as opposed to CI or release). Defaults to OFF. For example, this will downgrade some compiler diagnostics from errors to warnings.
IREE_ENABLE_LLD
Use lld when linking. Defaults to OFF. This option is equivalent to -DIREE_USE_LINKER=lld. The option IREE_ENABLE_LLD and IREE_USE_LINKER can not be set at the same time.
-DIREE_USE_LINKER=lld
IREE_USE_LINKER
IREE_ENABLE_ASAN
Enable address sanitizer if the current build type is Debug and the compiler supports it.
IREE_ENABLE_MSAN
Enable memory sanitizer if the current build type is Debug and the compiler supports it.
IREE_ENABLE_TSAN
Enable thread sanitizer if the current build type is Debug and the compiler supports it.
IREE_ENABLE_UBSAN
Enable undefiend behavior sanitizer if the current build type is Debug and the compiler supports it.
When cross compiling (using a toolchain file like android.toolchain.cmake), first build and install IREE's tools for your host configuration, then use the IREE_HOST_BIN_DIR CMake option to point the cross compiled build at the host tools.
android.toolchain.cmake
IREE_HOST_BIN_DIR
ccache is a compilation cache. In principle, just prepending compiler invocations with ccache is all one needs to enable it, e.g.
ccache clang foo.c -c -o foo.o\n
takes care of executing clang with these arguments and caches the output file foo.o. The next invocation then skips executing clang altogether.
foo.o
When the cache is hit, the speedup is such that the \"compilation\" becomes essentially free. However, ccache only caches compilation, not linking.
Here a few scenarios where ccache helps:
cmake
ccache is available on most platforms. On Debian-based Linux distributions, do:
sudo apt install ccache\n
The one ccache setting that you probably need to configure is the maximum cache size. The default 5G is too small for our purposes. To set the cache max size, do this once:
5G
ccache --max-size=20G\n
Tip: At the moment (late 2020), most of the code we're building is third_party/llvm-project so the fundamental limiting factor to how far we can cache away rebuilds is how often that dependency gets updated. Given how frequently it currently is updated, I'm finding that 20G is enough to make the ccache size not be the limiting factor.
third_party/llvm-project
20G
Use the CMake COMPILER_LAUNCHER functionality by setting CMAKE_C_COMPILER_LAUNCHER=ccache and CMAKE_CXX_COMPILER_LAUNCHER=ccache in your
CMAKE_C_COMPILER_LAUNCHER=ccache
CMAKE_CXX_COMPILER_LAUNCHER=ccache
Notes:
Ninja
Makefile
cmake -G
The ccache -s command dumps statistics, including a cache hit count and ratio. It's convenient to run periodically with watch in a separate terminal:
ccache -s
watch
watch -n 0.1 ccache -s # update the stats readout every 0.1 seconds\n
Emscripten is a complete compiler toolchain to WebAssembly, using LLVM, with a special focus on speed, size, and the Web platform. Emscripten can be used to compile parts of IREE to WebAssembly for execution within web browsers or other Wasm runtimes.
IREE's runtime can be compiled through Emscripten in some limited configurations. More of the runtime will be supported over time.
IREE's compiler can be compiled through Emscripten with local changes. More work is needed for this to be generally supported.
Read https://emscripten.org/docs/getting_started/downloads.html and run
./emsdk install latest\n./emsdk activate latest\nsource ./emsdk_env.sh\n
Build and install at least the compiler tools on your host machine, or install them from a binary distribution:
$ cmake -G Ninja -B ../iree-build-host/ \\\n -DCMAKE_C_COMPILER=clang \\\n -DCMAKE_CXX_COMPILER=clang++ \\\n -DCMAKE_INSTALL_PREFIX=../iree-build-host/install \\\n .\n$ cmake --build ../iree-build-host/ --target install\n
$ emcmake cmake -G Ninja -B ../iree-build-emscripten/ \\\n -DCMake_BUILD_TYPE=Release \\\n -DIREE_HOST_BIN_DIR=$(realpath ../iree-build-host/install/bin) \\\n -DIREE_BUILD_TESTS=OFF \\\n -DIREE_BUILD_COMPILER=OFF \\\n .\n
Build:
cmake --build ../iree-build-emscripten/ \\\n --target iree_samples_simple_embedding_simple_embedding_vmvx_sync\n
Copy the outputs from the build process (e.g. simple_embedding_vmvx_sync.js and simple_embedding_vmvx_sync.wasm) into your application and follow instructions at either https://webassembly.org/getting-started/developers-guide/ or https://developer.mozilla.org/en-US/docs/WebAssembly/Loading_and_running.
simple_embedding_vmvx_sync.js
simple_embedding_vmvx_sync.wasm
This doc shows how to use LLDB to debug native binaries on Android. For a more complete explanation, see the official LLDB documentation on remote debugging.
We assume the following setup:
adb
adb shell
<your-binary> [program args...]
Push the toolchain files, including lldb-server, to your device:
lldb-server
adb shell \"mkdir -p /data/local/tmp/tools\"\nadb push \"$ANDROID_NDK\"/toolchains/llvm/prebuilt/linux-x86_64/lib64/clang/14.0.6/lib/linux/aarch64/* /data/local/tmp/tools\n
You may need to adjust the clang toolchain version to match the one in your NDK. You can find it with find \"$ANDROID_NDK/toolchains/llvm/prebuilt\" -name lldb-server.
find \"$ANDROID_NDK/toolchains/llvm/prebuilt\" -name lldb-server
Set up port forwarding. We are going to use port 5039 but you are free to pick a different one:
adb forward tcp:5039 tcp:5039\n
Start an lldb-server in a new interactive adb shell:
adb shell\n/data/local/tmp/tools/lldb-server platform --listen '*:5039' --server\n
Launch lldb, connect to the server and run the binary:
lldb
lldb -o 'platform select remote-android' \\\n -o 'platform connect connect://:5039' \\\n -o 'platform shell cd /data/local/tmp'\ntarget create <your-binary>\nrun [program args...]\n
You can either use the system lldb or a prebuilt under \"$ANDROID_NDK\"/toolchains/llvm/prebuilt/linux-x86_64/lib64/clang/14.0.6/lib/linux/<your-host-arch>.
\"$ANDROID_NDK\"/toolchains/llvm/prebuilt/linux-x86_64/lib64/clang/14.0.6/lib/linux/<your-host-arch>
Explanation: each -o (short for --one-shot) tells lldb to execute a command on startup. You can run those manually in the lldb shell, if you prefer. Then, we tell lldb which working directory to use, where to find the executable, and what command line arguments to use.
-o
--one-shot
So the IREE compiler used to compile a program quickly, but it is now slower. What do you do?
Try to answer as many of these questions as you can:
When did compilation get slower?
A specific git commit is ideal, but \"sometime in the last week\" is a good starting point. You'll ultimately want to find a culprit release or git commit that changed the compiler code.
How much slower did compilation get?
Be specific - did it jump from 1 minute to 2 minutes, or 1 minute to 1 hour? Identifying the scale of the regression can help set the priority to investigate it.
What is the full compile command?
Try to extract the input program and full list of flags passed to the compiler binary so that others can reproduce what you're seeing. Try to distill this as much as possible to using just native tools (no Python or other framework layers).
What environment is the compiler running in?
Are you using a Debug build, or a release build? What operating system and size machine is running the compiler (e.g. Linux developer machine, or a smaller system)?
If you only have a rough idea of when something changed and want to narrow that down to a specific code change, bisecting can help.
git bisect
Building the compiler from source and using git bisect will let you pinpoint specific commits in IREE, though it typically won't let you step through changes in submodules (e.g. MLIR updates in third_party/llvm-project/).
third_party/llvm-project/
Tip: Configure ccache if you'll be rebuilding the compiler while bisecting
A manual workflow with git bisect looks like this:
git bisect start --first-parent\ngit bisect good [<rev>]\ngit bisect bad [<rev>]\n\n# Read the prompts from the command as it runs\n# At each step, test the compiler:\n# git submodule update\n# cmake --build build/ --target iree-compile\n# ./build/tools/iree-compile <args>\n# attach Tracy, observe timing, print IR, etc. to determine if fast or slow\n# if fast, `git bisect good`\n# if slow, `git bisect bad`\n# repeat\n
An automated workflow can use git bisect run and a script:
git bisect run
# run_bisect.sh\ngit submodule update\ncmake --build build/ --target iree-compile\n# Other logic here\n
git bisect start --first-parent\ngit bisect good [<rev>]\ngit bisect bad [<rev>]\ngit bisect run run_bisect.sh\n
#!/bin/bash\n\nset -xeuo pipefail\n\n# --------------------------------------------------------------------------- #\n# Settings #\n# --------------------------------------------------------------------------- #\n\nINPUT_FILE_PATH=\"/path/to/program.mlirbc\"\nTMP_DIR=\"../iree-tmp\"\n\ndeclare -a COMPILER_FLAGS=(\n \"--iree-input-type=stablehlo\"\n \"--iree-hal-target-backends=cuda\"\n \"--iree-hal-cuda-llvm-target-arch=sm_80\"\n)\n\nTIMEOUT_SECONDS_FOR_COMPILING_EACH_SOURCE=10\n\n# --------------------------------------------------------------------------- #\n# Utility functions #\n# --------------------------------------------------------------------------- #\n\n# Call to have `git bisect` skip this commit (don't mark as good _or_ bad)\n# https://git-scm.com/docs/git-bisect#_bisect_run\nskip_on_error() {\n >&2 echo \"** Skipping due to error: $1 **\"\n exit 125 # Special exit code for `git bisect skip`\n}\n\n# --------------------------------------------------------------------------- #\n# Main script #\n# --------------------------------------------------------------------------- #\n\n# Store git version hash, so we can dump artifacts to unique directories later.\nGIT_SHA=\"$(git rev-parse --short HEAD)\"\n\necho \"** Building iree-compile at ${GIT_SHA} **\"\n\n# The `git bisect` command only checks out a commit, so update submodules.\ngit submodule update\n\n# Build the compiler. You'll want ccache configured to make this fast!\ncmake --build ../iree-build/ --target iree-compile || skip_on_error \"CMake build failed\"\n\n# Run the compiler, dumping executable sources and stopping.\nSOURCES_DIR=\"${TMP_DIR}/sources-${GIT_SHA}\"\necho \"** Running iree-compile at ${GIT_SHA}, dumping sources to ${SOURCES_DIR} **\"\n../iree-build/tools/iree-compile \\\n ${INPUT_FILE_PATH} \\\n ${COMPILER_FLAGS[@]} \\\n --iree-hal-dump-executable-sources-to=${SOURCES_DIR} \\\n --compile-to=executable-sources \\\n -o /dev/null\n\n# Run the compiler again on each executable individually.\necho \"** Running iree-compile at ${GIT_SHA} for each executable source **\"\nSOURCES=($(ls -1 ${SOURCES_DIR}))\nfor SOURCE in \"${SOURCES[@]}\"; do\n echo \" * Compiling: ${SOURCE} *\"\n timeout --verbose ${TIMEOUT_SECONDS_FOR_COMPILING_EACH_SOURCE} \\\n ../iree-build/tools/iree-compile ${SOURCES_DIR}/${SOURCE} \\\n ${COMPILER_FLAGS[@]} \\\n --compile-mode=hal-executable \\\n -o /dev/null\ndone\n
If you want to understand why the compiler is fast or slow, or if you want to compare performance in detail between two versions, consider these profiling options.
The -mlir-timing flag enables Pass Timing instrumentation. Once the compiler finishes running, this prints a report like
-mlir-timing
===-------------------------------------------------------------------------===\n ... Pass execution timing report ...\n===-------------------------------------------------------------------------===\n Total Execution Time: 0.0203 seconds\n\n ---Wall Time--- --- Name ---\n 0.0047 ( 55.9%) Canonicalizer\n 0.0019 ( 22.2%) VerifierPass\n 0.0016 ( 18.5%) LLVMLoweringPass\n 0.0003 ( 3.4%) CSE\n 0.0002 ( 1.9%) (A) DominanceInfo\n 0.0084 (100.0%) Total\n
This is easy data to collect, especially remotely over SSH, but it might not paint a complete picture and requires waiting for compilation to finish.
See our documentation on profiling with Tracy, in particular the section on tracing iree-compile. For compile time regressions, pay particular attention to the compilation phases (Flow/Stream/HAL), how many times TranslateExecutablesPass runs, and if there are outlier passes that take significantly longer to run than others.
TranslateExecutablesPass
Here are some previous analyses for inspiration:
Example slow trace:
Example fast trace:
Example sampling statistics showing 10s of minutes in LLVM codegen:
perf
pprof
These linux tools allow for fine-grained profiling. Below we present a list of steps to profile iree-compile and visualize the results as a flame graph.
Compile IREE tools with debug information (line tables at minimum) and frame pointers. You can do that by selecting the RelWithDebInfo build type and adding -fno-omit-frame-pointers to your compiler flags:
-fno-omit-frame-pointers
cmake <Your-CMAKE-Flags> \\\n -DCMAKE_BUILD_TYPE=RelWithDebInfo \\\n -DCMAKE_CXX_FLAGS=\"-fno-omit-frame-pointer\" \\\n -DCMAKE_C_FLAGS=\"-fno-omit-frame-pointer\"\n
Set perf event scope/access to the appropriate level with perf_event_paranoid.
perf_event_paranoid
echo 0 | sudo tee /proc/sys/kernel/perf_event_paranoid\n
Run iree-compile under the perf profiler and collect profile data. This requires sudo.
sudo
sudo perf record -F 999 -g -- tools/iree-compile <Your-Compile-Arguments>\nsudo chown \"$USER:$USER\" perf.data\n
Use pprof to process perf.data from the previous step and start a local http server with the visualized profile. See the pprof's README for installation instructions and make sure to build perf_data_converter and add it to your PATH.
perf.data
perf_data_converter
PATH
pprof -http ':' perf.data\n
Debugging an MLIR-based compiler like IREE usually involves reading IR at some point. For compile time regressions, it helps to snapshot the IR at a few key phases and look for differences between fast compilation and slow compilation.
Here is one useful flag combination:
--mlir-disable-threading \\\n--mlir-elide-elementsattrs-if-larger=8 \\\n--mlir-print-ir-after=iree-hal-materialize-interfaces\n
This page aims to provide general approaches and practical tips for debugging GPU compiler/runtime correctness/performance issues in IREE.
GPUs fundamentally have similar architectures and software stacks. We target GPUs from various vendors using different GPU APIs, but they share quite a lot common infrastructure in IREE. So the approaches and tips here should be widely applicable.
For the ones that are specific to a particular kind of problem/component/GPU, they are prefixed with proper icons to be clear. Here are what those icons represents--
The difficulties associated with debugging typically arise from isolating the problematic component and pinpointing the culprit. Once done, the solution typically derives naturally.
There are many components in the IREE stack; hierarchically we can categorize them into either the compiler or runtime bucket:
Any of the above components/layers can have bugs. It's important to reduce the potential surface area to make the problem more tractable.
Once we have a more isolated case, the general methodology to pinpoint the exact culprit is to
The above procedure is for facing a large problem with no clue, for example, when bringing up a new model end-to-end via IREE.
Though most of the time, we can leverage existing facilities to avoid going down the full top-down hiearchical debugging procedure. For example, for regression happening on an existing model, CI or git bitsect might tell us directly the culprit commit.
git bitsect
For issues with strong signals like crashing, it's also easier to pinpoint the exact culprit with dedicated tools--we can leverage various sanitizers or debuggers.
If we are facing a large problem without a clear clue, we need to isolate the problematic compiler or runtime layer first, typically by comparing with a working solution:
[correctness/performance]
Sanitize the environment first. Asking these questions and making sure the environment is proper can save you hours of debugging sometimes:
We have multiple GPU targets/drivers in IREE--LLVMGPU/CUDA, LLVMGPU/HIP, SPIR-V/Vulkan, SPIR-V/Metal.
For the same GPU, we typically have two paths to target, e.g., CUDA/HIP or Vulkan for NVIDIA/AMD GPUs, Metal or Vulkan for Apple GPUs.
If one path is correct/performant, we can diff against it to try isolate the problem--the common/shared compiler/runtime code is likely okay; what differs between paths is likely problematic.
[correctness/performance] [vulkan]
Vulkan supports different GPUs. Similarly, if one GPU gives correct/performant result, we diff against it to find clues.
Even more code in compiler/runtime are shared here; what's problematic is likely different capabilities triggering different CodeGen pipelines so revealing bugs in a particular CodeGen pipeline. Or there are driver issues from a particular vendor.
[correctness]
If the CPU is working properly, we can use the same dispatch region formation and diff against the CPU dispatches one by one to isolate the problem. See this issue as an example.
--iree-flow-trace-dispatch-tensors and/or --iree-flow-break-dispatch= to iree-compile is quite helpful to inspect the output after all/each dispatch(es).
--iree-flow-trace-dispatch-tensors
--iree-flow-break-dispatch=
iree-reduce is a great tool to reduce and isolate issues programmatically. See more details here.
iree-reduce
Once we identified that the problem is due to some compiler issue, we can investigate by comparing with different paths and inputs:
For the same dispatch, we may have different CodeGen pipelines, e.g., for matmul we can have simple SIMT pipeline or using tensor/matrix cores. We can try to switch between different pipelines to isolate the problem.
Assuming we have a small repro, we can also try to see if there are \"patterns\" in the wrong result (e.g., this issue). Or mutate the input to see if the failure has some \"consistency\".
--mlir-print-ir-* and --debug* to iree-opt is our best friend. Sometimes it just takes eyeballing the IRs between stages to find clues.
--mlir-print-ir-*
--debug*
[performance]
For identifying performance issues, we typically need to use:
On the other side, if we suspect that it's a runtime issue, here are some useful approachs and tips:
Tracy profiling is a great way to view how the application runs dynamically. It can help to show problematic GPU API call sequences and performance bottlenecks.
-DIREE_ENABLE_RUNTIME_TRACING=ON
IREE_PY_RUNTIME=tracy
GPU validation can sometimes give us hints:
export METAL_DEVICE_WRAPPER_TYPE=1
--vulkan_validation_layers=true
export VK_INSTANCE_LAYERS=VK_LAYER_LUNARG_standard_validation
export VK_LAYER_PATH=$VULKAN_SDK/etc/vulkan/explicit_layer.d
export LD_LIBRARY_PATH=$VULKAN_SDK/lib
Turning on verbose output can give us more information:
-DCMAKE_C_FLAGS=-DIREE_VM_EXECUTION_TRACING_FORCE_ENABLE=1
--vulkan_debug_verbosity=4
export VK_INSTANCE_LAYERS=VK_LAYER_LUNARG_api_dump
Try different \"debugging modes\" provided by HAL drivers:
--cuda_use_streams=
true
false
--cuda_async_allocations=false
--metal_serial_command_dispatch=true
--metal_command_buffer_retain_resources=true
--metal_resource_hazard_tracking=true
--vulkan_robust_buffer_access=true
This document includes tips for triaging integration test correctness issues. Feel free to reach out to @hanhanW or ask questions on Discord for more help.
Once a suspicious dispatch is identified, we can create a test case based on the dispatch function. The dispatch function can be derived after the OutlineDispatchRegions pass. The function signatures have to be modified manually. You'll have to put flow.dispatch.tensor.load variables to function arguments, and replace flow.dispatch.tensor.store with return op.
OutlineDispatchRegions
flow.dispatch.tensor.load
flow.dispatch.tensor.store
return
Note: This only works when dispatch formation logics are identical between runs.
Follow README to run the model. The MLIR files will be generated. You'll find the saved file from log. E.g.,
[ RUN ] MobilenetV2Int8Test.test_compile_tflite\nI0401 17:27:04.084272 140182373025024 test_util.py:119] Setting up for IREE\nI0401 17:27:04.085064 140182373025024 binaries.py:218] Invoke IREE Pipeline:\n /tmp/iree-experimental/iree-experimental.venv/lib/python3.9/site-packages/iree/tools/tflite/iree-import-tflite\n /tmp/iree-experimental/tflitehub/tmp/mobilenet_v2_int8_test.py/model.tflite\n --mlir-print-debuginfo\n --save-temp-tfl-input=/tmp/iree-experimental/tflitehub/tmp/mobilenet_v2_int8_test.py/tflite.mlir\n --save-temp-iree-input=/tmp/iree-experimental/tflitehub/tmp/mobilenet_v2_int8_test.py/tosa.mlir\n
Unfortunately, the artifacts are not dumped in the runs. There is an issue for tracking this. A workaround can be found in the issue.
These are steps to reproduce/address failures in TF/TFLite integration tests. These instructions are most stable on Linux, though they may work with a few tweaks on Windows and macOS.
All steps here assume starting from the IREE root directory.
First create a Python virtual environment to install packages into:
python -m venv iree-tf.venv\nsource iree-tf.venv/bin/activate\n\n# Install test requirements\npython -m pip install -r ./integrations/tensorflow/test/requirements.txt\n
Install IREE's tools and Python bindings or build them from source
Install distributed packages
# Install packages from nightly releases\n# This should work for most cases, as the importers change infrequently\npython -m pip install \\\n iree-compiler iree-runtime iree-tools-tf iree-tools-tflite \\\n --find-links https://iree.dev/pip-release-links.html\n
OR build from source
# Build Python bindings from source\ncmake -G Ninja -B ../iree-build/ -DIREE_BUILD_PYTHON_BINDINGS=ON .\ncmake --build ../iree-build/\n\n# Add IREE built-from-source Python packages to PYTHONPATH\nsource .env\n\n# Install IREE TF/TFLite Python packages\npython -m pip install integrations/tensorflow/python_projects/iree_tf\npython -m pip install integrations/tensorflow/python_projects/iree_tflite\n
Run the python test command line
The command can be obtained from the run file. For example, if iree_tfl_tests/llvmcpu_posenet_i8.run failed,
iree_tfl_tests/llvmcpu_posenet_i8.run
cd integrations/tensorflow/test/\ncat iree_tfl_tests/llvmcpu_posenet_i8.run\n\n# REQUIRES: llvmcpu\n# RUN: %PYTHON -m iree_tfl_tests.posenet_i8_test --target_backend=llvmcpu --artifacts_dir=%t\n\ncd python/\npython -m iree_tfl_tests.posenet_i8_test --target_backend=llvmcpu --artifacts_dir=/tmp/posenet_i8_failure\n
Note that the command can only be run under integrations/tensorflow/test/python directory.
integrations/tensorflow/test/python
Extract intermediate files and use with native tools
The test will create an iree_input.mlir in the temp directory specified. Those can then be fed into iree-compile (built locally to reproduce the error)
iree_input.mlir
iree-compile \\\n --iree-hal-target-backends=llvm-cpu \\\n --iree-input-type=stablehlo \\\n iree_input.mlir\n
.github/workflows/build_package.yml
build_tools/github_actions/build_dist.py
c
iree-build
iree-install
The source IREE commit SHA is embeded into pip releases in a few places. Starting in a python venv, you can find the IREE commit from both the shell:
\"$(find . -name 'iree-compile' -executable)\" --version\nIREE (https://iree.dev):\n IREE compiler version 20231016.553 @ f1cb2692a086738d7f16274b9b3af6d2c15ef133\n LLVM version 18.0.0git\n Optimized build\n
and the Python API:
python -c \"import iree.compiler.version as v; print(v.REVISIONS['IREE'])\"\nf1cb2692a086738d7f16274b9b3af6d2c15ef133\n
The Linux releases are done in a manylinux2014 docker container. At the time of this writing, it has gcc 9.3.1 and Python versions 3.5 - 3.9 under /opt/python. Note that this docker image approximates a 2014 era RHEL distro, patched with backported (newer) dev packages. It builds with gcc and BFD linker unless if you arrange otherwise. yum can be used to get some packages.
/opt/python
yum
Get a docker shell (see exact docker image in build_package.yml workflow):
docker run --rm -it -v $(pwd):/work/c stellaraccident/manylinux2014_x86_64-bazel-4.2.2:latest /bin/bash\n
Remember that docker runs as root unless if you take steps otherwise. Don't touch write files in the /work/c directory to avoid scattering root owned files on your workstation.
/work/c
The default system Python is 2.x, so you must select one of the more modern ones:
export PATH=/opt/python/cp39-cp39/bin:$PATH\n
Build core installation:
# (from within docker)\ncd /work\npython ./c/build_tools/github_actions/build_dist.py main-dist\n\n# Also supports:\n# main-dist\n# py-runtime-pkg\n# py-xla-compiler-tools-pkg\n# py-tflite-compiler-tools-pkg\n# py-tf-compiler-tools-pkg\n
You can git bisect on the host and keep running the above in the docker container. Note that every time you run build_dist.py, it deletes the cmake cache but otherwise leaves the build directory (so it pays the configure cost but is otherwise incremental). You can just cd iree-build and run ninja for faster iteration (after the first build or if changing cmake flags). Example:
build_dist.py
cd iree-build
ninja
Extended debugging in the manylinux container:
cd /work/iree-build\n# If doing extended debugging in the container, these may make you happier.\nyum install ccache devtoolset-9-libasan-devel gdb\n\n# Get an LLVM symbolizer.\nyum install llvm9.0\nln -s /usr/bin/llvm-symbolizer-9.0 /usr/bin/llvm-symbolizer\n\n# You can manipulate cmake flags. These may get you a better debug experience.\ncmake -DCMAKE_BUILD_TYPE=RelWithDebInfo -DIREE_ENABLE_ASAN=ON -DCMAKE_EXE_LINKER_FLAGS=-fuse-ld=gold -DCMAKE_C_COMPILER_LAUNCHER=ccache -DCMAKE_CXX_COMPILER_LAUNCHER=ccache .\n\nninja\n\n# Or you may need this if buggy LLVM tools (like mlir-tblgen) are leaking :(\nASAN_OPTIONS=\"detect_leaks=0\" ninja\n
Other tips:
main-dist
py-runtime-pkg
To avoid interrupting the regular releases published on the IREE github, you can test any changes to the release process on your own fork. Some setup is required before these github actions will work on your fork and development branch.
You can run schedule_candidate_release.yml with a workflow dispatch from the actions tab. If you want to test using a commit other than the latest green on your main branch, modify the section that identifies the latest green commit to search from another commit or just hardcode one.
schedule_candidate_release.yml
main
To speed up build_package.yml, you may want to comment out some of the builds here. The py-pure-pkgs build takes only ~2 minutes and the py-runtime-pkg build takes ~5, while the others can take several hours.
build_package.yml
py-pure-pkgs
From your development branch, you can manually run the Schedule Snapshot Release action, which invokes the Build Release Packages action, which finally invokes the Validate and Publish Release action. If you already have a draft release and know the release id, package version, and run ID from a previous Build Release Packages run, you can also manually run just the Validate and Publish Release action.
AddressSanitizer, MemorySanitizer and ThreadSanitizer are tools provided by clang to detect certain classes of errors in C/C++ programs. They consist of compiler instrumentation (so your program's executable code is modified) and runtime libraries (so e.g. the malloc function may get replaced).
malloc
They are abbreviated as \"ASan\", \"MSan\" and \"TSan\" respectively.
They all incur large overhead, so only enable them while debugging.
See this documentation on leak detection. It is only enabled by default on some platforms.
To enable ASan:
cmake -DIREE_ENABLE_ASAN=ON ...\n
Several _asan tests like iree/tests/e2e/stablehlo_ops/check_llvm-cpu_local-task_asan_abs.mlir are also defined when using this configuration. These tests include AddressSanitizer in compiled CPU code as well by using these iree-compile flags:
_asan
iree/tests/e2e/stablehlo_ops/check_llvm-cpu_local-task_asan_abs.mlir
--iree-llvmcpu-link-embedded=false\n--iree-llvmcpu-sanitize=address\n
You may want to use ASan when using the python bindings. One way to achieve this is to build Python (or whatever executable that is going to use IREE as a shared library) with Asan. Another option is to link to the ASan runtime dynamically instead of linking it statically into an executable.
Using clang-12 (other versions should also work) as a example, configure IREE with something like:
cmake \\\n -DIREE_ENABLE_ASAN=ON \\\n -DCMAKE_EXE_LINKER_FLAGS=-shared-libasan \\\n -DCMAKE_SHARED_LINKER_FLAGS=-shared-libasan \\\n -DCMAKE_C_COMPILER=clang-12 \\\n -DCMAKE_CXX_COMPILER=clang++-12 \\\n ...\n
Then when running things the ASan runtime will have to be preloaded.
LD_PRELOAD=/usr/lib/llvm-12/lib/clang/12.0.0/lib/linux/libclang_rt.asan-x86_64.so \\\nASAN_SYMBOLIZER_PATH=/usr/lib/llvm-12/bin/llvm-symbolizer \\\n python ...\n
On Ubuntu the corresponding ASan runtime is provided by a package like libclang-common-12-dev depending on your Clang version. E.g.
libclang-common-12-dev
sudo apt install libclang-common-12-dev llvm-12 clang-12\n
Note that during building would also need to preload the ASan runtime, since the build executes its own binaries that are linked against the runtime.
LD_PRELOAD=/usr/lib/llvm-12/lib/clang/12.0.0/lib/linux/libclang_rt.asan-x86_64.so \\\nASAN_OPTIONS=detect_leaks=0 \\\nASAN_SYMBOLIZER_PATH=/usr/lib/llvm-12/bin/llvm-symbolizer \\\n cmake --build ...\n
If you want to run the IREE CUDA runtime driver it is likely you would need.
ASAN_OPTIONS=\"protect_shadow_gap=0\"\n
Like this
LD_PRELOAD=/usr/lib/llvm-12/lib/clang/12.0.0/lib/linux/libclang_rt.asan-x86_64.so \\\nASAN_SYMBOLIZER_PATH=/usr/lib/llvm-12/bin/llvm-symbolizer \\\nASAN_OPTIONS=\"protect_shadow_gap=0\" \\\n python ...\n
For best results to avoid false positives/negatives TSan needs all userspace code to be compiled with Tsan. This includes libstdc++ or libc++. libstdc++ is usually the default C++ runtime on Linux.
libstdc++
libc++
Building GCC's 12 libstdc++ on Ubuntu 22.04 with Clang has build errors. It seems that GCC and Clang shared their TSan implementation. They may be interoperable, but to avoid problems we should build everything with GCC. This means using GCC both as a compiler and linker.
Get GCC 12.3 source code.
git clone --depth 1 --branch releases/gcc-12.3.0 \\\n https://github.com/gcc-mirror/gcc.git\n
SRC_DIR=$PWD/gcc\nBIN_DIR=$PWD/gcc/build\n
Building all dependencies of libstdc++ with TSan has errors during linking of libgcc. libgcc is a dependency of libstdc++. It is desirable to build everything with TSan, but it seems this excludes libgcc, as the TSan runtime libtsan has it as a dependency. We build it without TSan. We do that to make libstdc++'s configuration find gthr-default.h, which is generated during building of libgcc. If not found C++ threads will silently have missing symbols.
libgcc
libtsan
gthr-default.h
LIBGCC_BIN_DIR=$BIN_DIR/libgcc\nmkdir -p $LIBGCC_BIN_DIR\ncd $LIBGCC_BIN_DIR\n\n$SRC_DIR/configure \\\n CC=gcc-12 \\\n CXX=g++-12 \\\n --disable-multilib \\\n --disable-bootstrap \\\n --enable-languages=c,c++\n\nmake -j$(nproc) --keep-going all-target-libgcc\n
Now build libstdc++.
LIBSTDCXX_BIN_DIR=$BIN_DIR/libstdc++\nmkdir -p $LIBSTDCXX_BIN_DIR\nLIBSTDCXX_INSTALL_DIR=$BIN_DIR/install/libstdc++\nmkdir -p $LIBSTDCXX_INSTALL_DIR\n\nGTHREAD_INCLUDE_DIR=$LIBGCC_BIN_DIR/x86_64-pc-linux-gnu/libgcc\nCXX_AND_C_FLAGS=\"-I$GTHREAD_INCLUDE_DIR -g -fno-omit-frame-pointer -fsanitize=thread\"\n\ncd $LIBSTDCXX_BIN_DIR\n$SRC_DIR/libstdc++-v3/configure \\\n CC=gcc-12 \\\n CXX=g++-12 \\\n CFLAGS=\"$CXX_AND_C_FLAGS\" \\\n CXXFLAGS=\"$CXX_AND_C_FLAGS\" \\\n LDFLAGS=\"-fsanitize=thread\" \\\n --prefix=$LIBSTDCXX_INSTALL_DIR \\\n --disable-multilib \\\n --disable-libstdcxx-pch \\\n --enable-libstdcxx-threads=yes \\\n --with-default-libstdcxx-abi=new\n\nmake -j$(nproc)\nmake install\n
When running programs you would need to use the sanitized version of libstdc++.
LD_LIBRARY_PATH=\"$LIBSTDCXX_INSTALL_DIR/lib\" \\\n my-program ...\n
To enable TSan:
cmake -DIREE_ENABLE_TSAN=ON ...\n
Several _tsan tests like iree/tests/e2e/stablehlo_ops/check_llvm-cpu_local-task_tsan_abs.mlir are also defined when using this configuration. These tests include ThreadSanitizer in compiled CPU code as well by using these iree-compile flags:
_tsan
iree/tests/e2e/stablehlo_ops/check_llvm-cpu_local-task_tsan_abs.mlir
Note that a IREE runtime built with TSan cannot load a IREE compiled LLVM/CPU module unless those flags are used, so other tests are excluded using the notsan label.
notsan
In theory that should be a simple matter of
-DIREE_ENABLE_MSAN=ON\n
However, that requires making and using a custom build of libc++ with MSan as explained in this documentation.
As of April 2022, all of IREE's tests succeeded with MSan on Linux/x86-64, provided that the vulkan driver was disabled (due to lack of MSan instrumentation in the NVIDIA Vulkan driver).
Enabling UBSan in the IREE build is a simple matter of setting the IREE_ENABLE_UBSAN CMake option:
cmake -DIREE_ENABLE_UBSAN=ON ...\n
Note that both ASan and UBSan can be enabled in the same build.
On desktop platforms, getting nicely symbolized reports is covered in this documentation. The gist of it is make sure that llvm-symbolizer is in your PATH, or make the ASAN_SYMBOLIZER_PATH environment variable point to it.
llvm-symbolizer
ASAN_SYMBOLIZER_PATH
On Android it's more complicated due to this Android NDK issue. Fortunately, we have a script to perform the symbolization. Copy the raw output from the sanitizer and feed it into the stdin of the build_tools/scripts/android_symbolize.sh script, with the ANDROID_NDK environment variable pointing to the NDK root directory, like this:
stdin
build_tools/scripts/android_symbolize.sh
ANDROID_NDK=~/android-ndk-r21d ./build_tools/scripts/android_symbolize.sh < /tmp/asan.txt\n
Where /tmp/asan.txt is where you've pasted the raw sanitizer report.
/tmp/asan.txt
This script will happily just echo any line that isn't a stack frame. That means you can feed it the whole ASan report at once, and it will output a symbolized version of it. DO NOT run it on a single stack at a time! That is unlike the symbolizer tool that's being added in NDK r22, and one of the reasons why we prefer to keep our own script. For more details see this comment.
ASan
This document lists technical details regarding the CUDA implemenation of IREE's Hardware Abstraction Layer, called a CUDA HAL driver.
IREE provides a Hardware Abstraction Layer (HAL) as a common interface to different compute accelerators. IREE HAL's design draws inspiration from modern GPU architecture and APIs; so implementing a HAL driver using CUDA is mostly straightforward; though there are places we need emulation given no direct mapping concepts or mechanisms.
IREE HAL's design draws inspiration from modern GPU APIs--it provides explicit control of low-level GPU objects. The compiler is expected to plan the object lifetime and schedule workload and synchronization in an optimized way; IREE HAL implementation and the underlying GPU driver stack is expected to be a thin layer without much smarts and magic.
Therefore when implementing the IREE HAL using CUDA, we use the driver API instead of the runtime API. At runtime the HAL CUDA driver will load the libcuda.so/nvcuda.dll library dynamically and query a subset of the CUDA driver API used in HAL via the cuGetProcAddress() API.
libcuda.so
nvcuda.dll
cuGetProcAddress()
There is no direct CUDA construct that map to the IREE HAL iree_hal_driver_t abstraction. We use it to hold the dynamic symbols loaded for all devices, and device enumeration and creation.
iree_hal_driver_t
iree_hal_cuda_device_t implements iree_hal_device_t to provide the interface to CUDA GPU device by wrapping a CUdevice. For each device, right now we create two CUstreams--one for issuing commands for memory allocation and kernel lauches as instructed by the program; the other for issue host callback functions after dispatched command buffers completes. See synchronization section regarding the details.
iree_hal_cuda_device_t
iree_hal_device_t
CUdevice
CUstream
The CUDA HAL drivers supports async allocation (iree_hal_device_queue_alloca() and iree_hal_device_queue_dealloca()) via CUDA stream ordered memory allocation.
iree_hal_device_queue_alloca()
iree_hal_device_queue_dealloca()
The async_allocations in the iree_hal_cuda_device_params_t struct allows to enable this feature.
async_allocations
iree_hal_cuda_device_params_t
iree_hal_command_buffer_t is a recording of commands to issue to the GPU; when the command buffer is submitted to the device it's then actually executed on the GPU asynchronously.
iree_hal_command_buffer_t
Two implementations of iree_hal_command_buffer_t exist in the CUDA HAL driver--one backed by CUgraph and the other backed by CUstream.
CUgraph
CUgraph conceptually matches iree_hal_command_buffer_t better given it's a recording of commands to issue to the GPU. Also using the CUgraph API allows to easily encode fine grain dependencies between dispatch without having to create multiple streams. Therefore, the CUgraph-backed implementation is a more natural one. Though note that CUgraph API is meant to be used for recording once and replying multiple times and there may be a performance penalty to using CUgraph API for one-shot command buffer.
The CUstream-backed implementation just issues commands directly to a CUstream when recording. Commands issued to CUstream can be immediately sent to the GPU for execution; there is no recording and replaying separation. In order to match the recording semantics of iree_hal_command_buffer_t, to use the CUstream-backed command buffer, we need to first record the command buffer into an in-memory iree_hal_deferred_command_buffer_t, and then when applying the command buffer, we create a new CUstream-backed implementation.
iree_hal_deferred_command_buffer_t
The command_buffer_mode in the iree_hal_cuda_device_params_t struct allows to select which implementation to use.
command_buffer_mode
The allocator will forward allocation requests to cuMemHostAlloc() for host local memory, cuMemAlloc() for device local and host invisible memory, and cuMemAllocManaged() for device local and host visible memory.
cuMemHostAlloc()
cuMemAlloc()
cuMemAllocManaged()
CUDA buffers are represented either as a host pointer or a device pointer of type CUdeviceptr.
CUdeviceptr
iree_hal_executable_t maps naturally to CUmodule.
iree_hal_executable_t
CUmodule
The compiler generates a FlatBuffer containing a PTX image as well as a list of entry point functions and their associated metadata (names, workgroup size, dynamic shared memory size, etc.). At runtime, the CUDA HAL driver loads the PTX image and creates CUfunctions out of it for various entry points.
CUfunction
iree_hal_event_t right now is not used in the compiler so it's not yet implemented in the CUDA HAL driver.
iree_hal_event_t
The IREE HAL uses semaphores to synchronize work between host CPU threads and device GPU streams. It's a unified primitive that covers all directions--host to host, host to device, device to host, and device to device, and allows flexible signal and wait ordering--signal before wait, or wait before signal. There is no limit on the number of waits of the same value too.
The core state of a HAL semaphore consists of a monotonically increasing 64-bit integer value, which forms a timeline--signaling the semaphore to a larger value advances the timeline and unblocks work waiting on some earlier values. The semantics closely mirrors Vulkan timeline semaphore.
In CUDA, there is no direct equivalent primitives providing all the capabilities needed by the HAL semaphore abstraction:
cuStreamWriteValue64()
cuStreamWaitValue64()
cuSignalExternalSemaphoresAsync()
cuWaitExternalSemaphoresAsync()
CUexternalSemaphore
cuImportExternalSemaphore()
Therefore, to implement the support, we need to leverage multiple native CPU or CUDA primitives under the hood.
CUevent
The main synchronization mechanism is CUDA event--CUevent. As a functionality and integration baseline, we use CUevent to implement the IREE HAL semaphore abstraction.
CUevent natively supports the following capabilities:
cuEventSynchronize()
cuGraphAddEventWaitNode()
cuEventRecord()
cuGraphAddEventRecordNode()
We need to fill the remaining capability gaps. Before going into details, the overall approach would be to:
Concretely, for a given HAL semaphore, looking at the four directions:
A CPU thread signals the semaphore timeline to a new value.
If there are CPU waits, it is purely on the CPU side. We just need to use common CPU notification mechanisms. In IREE we have iree_event_t wrapping various low-level OS primitives for it. So we can just use that to represent a wait timepoint. We need to keep track of all CPU wait timepoints in the timeline. After a new signaled value, go through the timeline and notify all those waiting on earlier values.
iree_event_t
If there are GPU waits, given that there are no way we can signal a CUevent on CPU, one way to handle this is to cache and defer the submission batches by ourselves until CPU signals past the desired value. To support this, we would need to implement a deferred/pending actions queue.
GPU signals can only be through a CUevent object, which has a binary state. We need to advance the timeline too. One way is to use cuLaunchHostFunc() to advance from the CPU side with iree_hal_semaphore_list_signal(). This additionally would mean we can reuse the logic form CPU signaling to unblock CPU waits.
cuLaunchHostFunc()
iree_hal_semaphore_list_signal()
After advancing the timeline from the CPU side with cuLaunchHostFunc(), we can release more workload from the deferred/pending actions queue to the GPU. Though, per the documentation of cuLaunchHostFunc(), \"the host function must not make any CUDA API calls.\" So we cannot do that directly inside cuLaunchHostFunc(); we need to notify another separate thread to call CUDA APIs to push more work to the GPU. So the deferred/pending action queue should have an associcated thread.
For GPU waits, we can also leverage the same logic--using CPU signaling to unblock deferred GPU queue actions. Though this is performant, given that the CPU is involved for GPU internal synchronization. We want to use CUevent instead:
Another problem is that per the cuLaunchHostFunc() doc, \"the function will be called after currently enqueued work and will block work added after it.\" We don't want the blocking behavior involving host. So we can use a dedicated CUstream for launching the host function, waiting on the CUevent from the original stream too. We can also handle resource deallocation together there.
To summarize, we need the following data structures to implement HAL semaphore:
iree_event_pool_t
iree_hal_cuda_event_t
iree_hal_cuda_event_pool_t
iree_hal_cuda_timepoint_t
iree_hal_cuda_timepoint_pool_t
iree_hal_cuda_timeline_semaphore_t
iree_hal_cuda_queue_action_t
iree_hal_cuda_pending_queue_actions_t
A not-so-concise walkthrough of various IREE features that are in the design process and planned for future versions. A lot of the questions around how the IREE IR is designed and why certain components exist (such as the VM) hopefully become much clearer when seeing where we want to go with the infrastructure we are building (as opposed to where we currently are with our MVP slice). This document is not meant to encompass the entire design of any individual feature and if there's interest please say hi on the iree-discuss mailing list.
It's assumed that any work related to quantization/compression has happened prior to lowering into IREE dialects. Our plan is to use the proposed Quantization Transforms to achieve both training and inference-time quantization of types in a way that preserves maximum accuracy. IREE will support running with original unquantized floats in all cases, allowing for a smooth on-ramp to quantization and the gains in performance and reduction in model size that come from it.
As future work IREE would like to move beyond these transformation-directed approaches to quantization and interface directly to frontends which have a defined enough type system to represent accurate quantized (and otherwise compressed) computations directly, not relying exclusively on compiler-side type inference transforms.
flow
The flow dialect is designed to allow us to extract as much concurrency as possible from a program and partition IR into the scheduling and execution domains. Today we have the IR structure and transformation flow in place but have not yet got to the most interesting things such an infrastructure enables. A majority of the largest performance, latency, and memory usage improvements IREE can offer are determined first here and all following lowerings benefit. The fastest code is the code you don't execute and the smallest allocation is the allocation you don't make ;)
flow.stream
A majority of the readbacks we have today (manifested as flow.tensor.load.* ops) will be removed when we have an HLO tensor->primitive conversion. There will still be cases when readbacks are required for correctness but they usually fall into a small set of usage patterns. For those that don't this is one place where IREE will warn about performance issues, allowing programs that perform suboptimally but encouraging authors to adjust their input model to enable better behavior. The IREE VM also has specific support for hiding readback latency in an efficient way via coroutines.
flow.tensor.load.*
The most common case we are currently seeing in the IR is that of dynamic copies where the offsets are dependent on the result of previous computations. Source models may have top-k + gather operations, for example. These appear as a flow.stream, a flow.tensor.load, and then another flow.stream that uses the loaded value for a flow.tensor.update (or other operation):
flow.tensor.load
flow.tensor.update
%index_tensor = flow.ex.stream.fragment(...) -> tensor<i32> { ... }\n%index = flow.tensor.load %index_tensor : tensor<i32>\n%result = flow.ex.stream.fragment(%arg0 = %index : i32, ...) -> ... {\n %0 = flow.dispatch ...\n %1 = flow.tensor.update %0, %arg2[%index] : tensor<10xf32> -> tensor<1x10xf32>\n ...\n}\n
Today the flow.tensor.update turns into HAL command buffer transfer operations that must have their offsets known at recording time. This is a limitation of vkCmdCopyBuffer but not a fundamental limitation of any hardware. In fact several drivers implement copies as small built-in shader programs meaning that we could perform the same expansion here with the right primitives. This would allow, in the above example, both the index to be computed and the tensor to be updated within the same stream to entirely remove the host round-trip.
vkCmdCopyBuffer
The current flow.ex.stream.fragment, as denoted by the experimental tag, is a temporary implementation designed to get the concept of streams lowered to the HAL dialect. For streams to be effective at modeling larger concurrency scopes they need to be able to move across branches in the CFG. This intuitively follows exactly what one would do if recording commands in C:
flow.ex.stream.fragment
ex
vkCmdCopyBuffer(cmd, ...);\nif (some_flag) {\n vkCmdBindPipeline(cmd, ..., pipeline_a);\n} else {\n vkCmdBindPipeline(cmd, ..., pipeline_b);\n}\nvkCmdDispatch(cmd, ...);\n
The corresponding flow IR:
flow.stream.append[%s0](...) {\n flow.tensor.update ...\n }\n %b = arith.cmpi ne %some_flag, ...\n cond_br %b, ^a(%s0), ^b(%s0)\n^a(%s1):\n flow.stream.append[%s1](...) {\n flow.dispatch @pipeline_a, ...\n }\n br ^end(%s1)\n^b(%s2):\n flow.stream.append[%s2](...) {\n flow.dispatch @pipeline_b, ...\n }\n br ^end(%s2)\n^end(%s3):\n ...\n
This allows the entire stream to be lowered into one command buffer without the need for any host round-trips. The conversion into the flow dialect will walk the CFG and attempt to thread the flow.stream values through so long as there are no external dependencies.
flow.dispatch
While the flow.stream threading through the CFG can remove many of the simpler conditional dispatches there will always be some that will have their execution dependent on the result of prior dispatches. For these a flow.cond_dispatch will allow a condition to be provided that must be true for the dispatch to actually be performed.
flow.cond_dispatch
For targets that natively support predication in their command buffers (such as D3D12's ID3D12GraphicsCommandList::SetPredication) this provides a host round-trip-free way of conditionally executing dispatches and transfers. Unfortunately Vulkan support is still lacking, though Nvidia supports the VK_EXT_conditional_rendering extension that exposes the same behavior.
For targets that do not support predication natively it's still possible to emulate predication with indirect dispatches. In this model the workgroup counts normally used to dispatch execution are sourced from another device buffer at the time the dispatch is made instead of sourced from the command buffer at the time the dispatch is recorded. Degenerate dispatches with counts of 0, 0, 0 allow for effective neutering of the dispatch with minimal overhead (vs. the significant penalty of a host round-trip!).
0, 0, 0
By modeling such predication at the flow level we are able to lower into the HAL with target-aware predication semantics and fuse indirect dispatch workgroup count calculations into existing dispatches already being performed such that overhead is reduced.
flow.executable
While still in the flow dialect, the executables are target-agnostic. This makes simple IR tree diffing a potential solution to deduplication. Since most of the dispatches originate from the same source-language library calls in input frameworks there's a high likelihood of duplication, and depending on when inlining is performed we may have stronger or weaker ability to perform the deduplication. Thanks to the MLIR canonicalization pass (that ensures ops are rearranged into consistent canonical representations) the IR comparisons can be done rather trivially.
Common subexpression elimination is performed many times during lowering, however there comes a point where the CSE can introduce false dependencies and additional allocations that are otherwise avoidable. For example if a broadcasting operation is CSE'd and then the result is used by two or more operations that are scheduled independently what would have been a relatively cheap lowering of the broadcast to a simple index remapping now becomes an additional dispatch, materialization of an intermediate tensor, and a barrier:
%bcast = \"mhlo.broadcast_in_dim\"(%cst) : (tensor<f32>) -> tensor<1024x10xf32>\n%mul1 = mhlo.multiply %arg0, %bcast : tensor<1024x10xf32>\n// (pretend something here that prevents fusion)\n%mul2 = mhlo.multiply %arg1, %bcast : tensor<1024x10xf32>\n
%bcast = flow.dispatch.region(%cst : tensor<f32>) -> tensor<1024x10xf32> {\n %0 = \"mhlo.broadcast_in_dim\"(%cst) : (tensor<f32>) -> tensor<1024x10xf32>\n return %0 : tensor<1024x10xf32>\n}\n// a barrier will be required here\n%mul1 = flow.dispatch.region(%arg0 : tensor<1024x10xf32>, %bcast : tensor<1024x10xf32>) -> tensor<1024x10xf32> {\n %1 = mhlo.multiply %arg0, %bcast : tensor<1024x10xf32>\n return %1 : tensor<1024x10xf32>\n}\n%mul2 = flow.dispatch.region(%arg1 : tensor<1024x10xf32>, %bcast : tensor<1024x10xf32>) -> tensor<1024x10xf32> {\n %2 = mhlo.multiply %arg1, %bcast : tensor<1024x10xf32>\n return %2 : tensor<1024x10xf32>\n}\n
Instead the broadcast should be rematerialized inside of both dispatch regions as the cost of doing so is significantly less in compute resources and then the intermediate tensor will not be required at all. Though at first it may seem counter-intuitive to undo such a critical optimization as CSE (both to code size and often to compute) but here it's something we must carefully balance while looking at the whole system. It gets even more important when considering multi-device execution as the cost of sharing memory and synchronizing may be extremely non-trivial.
While still within the flow dialect we have the ability to easily split streams and safely shuffle around operations. Target execution backends can opt into such behavior to ensure that device restrictions such as maximum in-flight memory, maximum scheduling depth, and capabilities are observed. For heterogeneous configurations the intent is that certain operations, dispatches, and streams can be attributed to specify which device categories they should be lowered. The constraint solving that takes place can be provided with generic heuristics (\"big GEMMs go on the accelerator\"), profile-guided databases based on benchmarks, learned traits via ML, etc.
hal
As the IREE HAL is designed almost 1:1 with a compute-only Vulkan API many of the techniques classically used in real-time graphics apply. The benefit we have by modeling our usage of such a low-level API in IR is that the normal work - some of which is very non-trivial - for managing allocations, tracking resource lifetime, and ensuring proper synchronization/barriers is something we can apply the full force of an offline compiler against.
hal.interface
The hal.interface op specifies the ABI between the scheduler and the device containing the buffer bindings and additional non-buffer data (parameters, shapes, specialization flags, etc). Today a na\u00efve ordering is used uniformly for all targets however it is possible for target backends to opt into providing their own interfaces based on target configuration. The same hal.executable may have multiple interfaces and the same backend may use one or more. This is useful for when target capabilities may vary at runtime, such as the number of available storage buffer bindings in Vulkan. By exposing a few hal.interface variants with different binding amounts the Vulkan backend could make better use of the larger number of bindings available at runtime while still providing support for smaller configurations.
hal.executable
Once we have multiple hal.interfaces defined for executables the scheduler needs to emit HAL ops that properly switch between them. By having a canonical form for bindings we can ensure that only the differences between the interfaces will need additional code.
Though the flow dialect attempts to fuse as many ops as possible into dispatch regions, it's not always possible for all target backends to schedule a region as a single dispatch. A classic example is algorithms like parallel reduction commonly used on GPUs that may require many dispatches to identical executables, while other algorithms may vary the executables they use based on the input parameters such as shape or the target runtime device support.
By default the flow.dispatch executable translation to hal.executables is performed 1:1 and it is assumed that a single dispatch is required. Extending target backends with scheduling interfaces (enabling them to opt into different scheduling behavior) will allow the backends to emit any number of hal.executables and any stream commands (such as additional dispatches or transfers) they may need. This is effectively equivalent to what would be done at runtime only because we are still operating on IR prior to buffer allocation and can use the hal ringbuffer primitive. Through this we can elide many of the allocations that would otherwise be required at runtime (and the concurrency-limiting false dependencies that usually come along with scratch memory).
Since the algorithm used may vary based on the parameters of the dispatch (such as the shape of the reduction which may be dynamically determined) scheduling specialization may occur even when targeting a single backend. In many cases folding and canonicalization can eliminate the overhead as whether one dynamically computed workgroup size is used instead of another the same IR is present.
Many explicit hardware APIs require knowing how buffers are used alongside with where they should be located. For example this additional information determines caching policy on buffer accesses (write-through, write-back, etc), visibility of writes across compute units, and the possible MMU properties that may need to be maintained/matched for the buffer. By using the SSA-form value-semantics of the MLIR tensor as used in the flow dialect we have complete information of where buffers may be used or at least where they enter or leave regions where we can derive such information.
Analysis passes can run over IR to attribute tensors such that when allocation is performed when lowering to the hal dialect we do so from an allocator compatible with where the buffer will be used, with memory types chosen based on the potential cost and location of operations performed (write-only on host vs. read-write on host and device, etc), and with usage bits indicating what kind of operations may be performed on the buffer. Many of these are local transformations as most buffers are only live within very small regions such as the flow.stream encompassing their usage.
Traditional systems need to either use very permissive buffer properties or heuristics that can introduce additional non-trivial overhead when such heuristics are incorrect. For example, OpenGL had several such usage hints that drivers were then able to use but almost no drivers behaved as desired in all cases and it lead to additional memory ghosting, copies, readbacks, and unpredictable performance. For almost all uses of the buffers within an IREE invocation we instead can know precisely where and how buffers may need to be moved and do it a minimum number of times if it is required.
For targets that may require runtime preprocessing of their executables prior to dispatch, such as SPIR-V or MSL, the IREE HAL provides a caching and batch compilation mechanism based on Vulkan's Pipeline Cache.
Today each executable is compiled on-demand and cached only for the process lifetime. Though some drivers may provide their own caching we can make better use of the explicit caching and compilation behavior with the additional information we have in the compiler.
For any given entry point (or group of entry points) into an IREE module we can perform reachability analysis to know which executables may be executed when that entry point is invoked. In this way we can emit pre-invocation compilation checks (similar to an std::call_once block) that provides all required executables for compilation and allows more efficient compilation through multithreading the compiler invocations. These same compilation caching function can be exposed and invoked manually by an application to force pre-compilation when it is least likely to impact the user, such as a post-install/first-run step or concurrently while other application features are loading.
std::call_once
We can use zero or more scoped caches for executables within a module. Completely dynamic modules (such as those emitted in eager-mode usage) may avoid the caching overhead entirely, while modules that have several primary usage modes (such as training and inference) may choose to use independent caches for each such mode.
The caches generated can then be retrieved and saved by the hosting application. Upon the next execution the application can provide the caches and if still valid they will be used to avoid compilation.
An advantage of representing executable binaries in IR after translation is that we can apply various post-compilation compression and minification techniques while still know precisely where the executable will be used. This is extremely important for SPIR-V as it is not designed to be a small at-rest format. Though the biggest lever we have to control generated code size is higher-level deduplication and specialization there will still be a sufficiently large number of executable binaries we will need to embed within the final modules and having targeted approaches for reducing their size beyond just \"gzip everything\" is very powerful.
For example, SMOL-V is a fantastic lossless SPIR-V compression technique that, when coupled with modern dictionary-based compression algorithms, can save significant binary size. As a data point, the SPIR-V corpus SMOL-V uses for testing goes from 4.8MiB of raw SPIR-V to 348KiB of compressed SMOL-V.
Combined with Batched Executable Caching and Precompilation we can easily use shared dictionaries and other cross-artifact compression in a relatively plug-in way.
It's still an area that needs more research but one goal of the IREE design was to enable efficient target- and context-aware compression of large constants (typically model weights/parameters/embeddings). This may mean reusing existing hardware compression formats on GPUs, ML accelerator-specific formats, or very-low-bit-depth (1-4 bit per value) quantization techniques that cannot be directly used without first decompressing. The inspiration here is formats like Crunch and Basis Universal that perform \"supercompression\", and we may even be able to use these directly as then we can make use of GPU hardware samplers to do the 4-bit to 32-bit decompression, etc.
The IREE HAL - much like Vulkan it is based on - eschews much of the state that traditional APIs have in favor of (mostly) immutable state objects (pipeline layouts, pipeline states, descriptor sets, etc). There are still a few stateful entry points in the API, though, and deduplicating or reordering redundant calls can reduce both IR, API, and execution overhead.
The key place this will have the largest impact is around descriptor set bindings and push descriptors, both of which are state and can have non-trivial setup overhead. A canonicalization for such commands that inspects the target hal.command_buffer to see if the same state was set prior and code motion to move such commands out of loop bodies when possible would be helpful.
hal.command_buffer
A core concept of the IREE scheduler that allows for overlapping in-flight invocations is that of the resource timeline. This identifies module state that can be in use by multiple invocations and assigns timeline milestones denoting when the resource will be in the appropriate state for the current invocation to proceed. Conceptually it is like a epoch-based synchronization mechanism as commonly found in garbage collectors to allow for lock-free asynchronous memory reclamation.
The advantage we have in the IR is that we know both the usage of all resources thanks to buffer usage tracking and the synchronization domains of all resources (in most cases). This allows us to effectively assign one timeline semaphore per writeable resource while in practice having far fewer than 1:1, as for example if two resources are only ever written in the same command buffer only one semaphore is needed to signal the completion of both writes.
By transforming IR to sink all resource reads and writes closest to where the value is used we can enlarge the time windows that can overlap across invocations that may share those resources. This is similar to what out-of-order CPUs do with register renaming/reorder buffers/etc and something we can apply some traditional instruction scheduling techniques to (only here our 'instructions' are entire command buffer dispatches/transfers).
Two degenerate cases of this approach are that of resource indirection (util.ptr<tensor<T>>) and dynamic resource shapes. In these two cases it may not be possible to continue recording commands even if we are able to ensure execution is appropriately synchronized. This is where indirect dispatch, predication, indirect command buffers, and VM coroutines can all help cover for the times where we are unable to transform away the indirection or emit shape logic without data dependencies.
util.ptr<tensor<T>>
(When properly implemented) almost all buffers required during execution never escape the command buffers they are used in or a single VM invocation. We can trivially identify this from the explicit captures of flow.stream and flow.dispatch ops and the fact that all tensor types have value-semantics. Only those tensor values loaded-from/stored-to module state or that cross the exported module function boundary need special consideration while almost everything else can live transiently only so long as it is required during execution.
Thanks to this information about buffer usage and lifetime we can use a ringbuffer to store the transient tensor data and other required data reservations such as uniform buffers used to pass dynamic parameters (shapes, flags, etc) into dispatches. This gives the compiler and the application a knob that allows them to control maximum concurrency (by having a very large ringbuffer) or maximum memory usage (by having a minimally small ringbuffer).
Allocating tensors from the ringbuffer does not require sophisticated runtime packing as we can emit IR to calculate required sizes for dynamically shaped tensors. Whether a basic block reserves %sz = arith.constant 42 : index bytes or %sz = std.muli %cst, %dyn_dim : index bytes doesn't materially change how the allocations are performed. Since almost all usage involves simple write head bumps there is no need for ahead-of-time memory planning or large fixed allocations, and since no buffer within the ringbuffer can alias we can have coarse (read: low overhead) guarantees about the availability of certain regions of the ringbuffer (\"when this event is signaled all prior ringbuffer writes have completed\").
%sz = arith.constant 42 : index
%sz = std.muli %cst, %dyn_dim : index
Usually any planning we may want to perform can be done in IR via code motion. For example applying traditional algorithms used to reduce register pressure will help us attain narrower live windows within the ringbuffer leading to a larger number of in-flight operations for the same ringbuffer memory usage.
We may end up using both a classical ringbuffer and a variant known as the bip buffer because it is better for descriptor set utilization (as we can provide many dispatch parameters with a single base offset bound once at the beginning of a region).
Functions calls made across modules (either from C++ into the VM, VM->VM, or VM->C++) should be able to define timeline semaphores used to wait and signal on the call. We can do this by making all exports automatically have the semaphores and then make invocations populate them if they were not provided by the caller. In this way we can allow multiple invocations of exported functions to chain naturally with internal asynchronous workloads, turning most IREE invocations into just recording of command buffers that can never block.
When combined with VM coroutine support we even have the ability to interleave any required host execution between the wait and signal semaphores provided such that the caller never knows on which device execution is taking place. It's still possible to provide synchronous wrappers that emulate blocking behavior but by having the core system designed around a single system-supported primitive we avoid the need for additional things like interrupt watchdog threads, implicit blocking, and other pitfalls.
One approach to using multiple cores on a CPU is to perform interior parallelization of operations such as OpenMP or library-call-based custom thread pools (gemmlowp). This works when each individual operation is relatively costly vs. potential pipeline bubbles caused by work spinning down near the end of an operation and spinning up at the beginning of the next.
IREE is designed to handle many more workloads - some of which have very narrow shapes but very deep pipelines (like search algorithms) - such that the above approach of multithreading within ops becomes a bottleneck. These workloads are traditionally very poorly handled by frameworks and issues with oversubscription, pipeline stalls, and suboptimal system schedulers (such as on Android) can lead to more time being spent thrashing about than actually executing real work.
The approach we take here is to treat the cores of a CPU as if they were computation units on a GPU, each able to perform some set of heterogeneous work independent of others units. This means that the concurrency we are trying to model at the flow level and communicate to the runtime via the hal that explicitly states which dispatches can overlap and the size of the workgroups can trivially be used to distribute this work over many cores exactly as a GPU would do it. Integration with library calls that may require their own threading (such as Ruy) requires that they be able to use the IREE thread pool instead of their own.
In this way we can avoid pipeline bubbles and other latency-inducing unpredictable scheduling. This does not mean that we treat individual units of work at the same scale as we would for GPUs, but instead that we tile and have one or more processing units that allows us to work on those tiles. Whether the tile size is defined by a library call contract, heuristics, or empirically is TBD, but expect workgroup sizes in the thousands to millions of invocations vs. normal GPU workgroup sizes in the dozens to hundreds of invocations.
To achieve this style of scheduling efficiently we'll likely use something like marl as the scheduler. Marl provides cross-platform low-overhead fibers and is compatible with this style of scheduling as it was built for the Swiftshader software rasterizer.
Even if IREE was only targeting CPUs the assertion is that we would still want to schedule this way and it's only an incidental benefit that if building for heterogeneous targets the scheduling code may be shared (just with a different divisor for workgroup count calculations).
vm
The VM is designed as a dynamic linkage ABI, stable bytecode representation, and intermediate lowering IR. Many of the optimizations we can perform on it will benefit all use cases (such as when lowering to LLVM IR) by allowing higher-level program transformations around synchronization that are difficult to perform on arbitrary LLVM IR.
One of the largest features currently missing from the VM is coroutines (aka user-mode fiber scheduling). Coroutines are what will allow us to have multiple in-flight invocations into a module - some of which may be waiting on external events - without the need for complex multithreading logic or state machine machinations.
In many cases once semaphores are exposed to callers we will not need to yield in the VM. The user will call into the module with provided semaphores, the work to perform will be recorded to one or more command buffers and submitted to the device, and then control return will return to the caller immediately.
In cases requiring host readbacks that we were not able to remove, however, additional VM code may need to run prior to when the final semaphore is signaled. To preserve the asynchronous interface and immediate execution guarantees the compiler can emit explicit yield points (vm.yield) that are known-good locations for yielding (such as most resources not required after the yield having been flushed/discarded, partial synchronization scope availability if other work may be able to execute concurrently irrespective of the yielded coroutine, etc).
vm.yield
When the VM encounters the yield at runtime it will suspend the coroutine until a defined condition is met. Many coroutines can be in various states at any given time and - thanks to the resource timeline - can still be memory safe. For example if two stateless invocations are made with a common wait semaphore both can be recorded and submitted without waiting on each other. If there is internal module state accessed the invocations are implicitly ordered by invocation order (similar to what Vulkan calls API order) based on internal resource timeline semaphores.
Waking the coroutines can be performed by either an application-provided callback in the case of the application already having a periodic event which is doing bookkeeping (such as frame end callbacks when rendering or Looper idle events on Android), giving direct control over the frequency and location which IREE utilizes to perform additional work. A helper will be provided as well that runs a dedicated IREE thread to do this, but the expectation is that applications can often do a better (and importantly more predictable) job.
By utilizing coroutines IREE will have a way to fill traditional pipeline bubbles even with execution from the same module (let alone across modules) in the situation where host readbacks or other logic is required. This increases overall throughput and utilization while reducing host wakeups as many coroutines can be processed at once to submit new work to the device queues, though it does not help reduce per-invocation latency.
External code such as the HAL implementation or user ops may provide the wait handles used for continuation. For example, the HAL can expose a function that yields and wakes only when one or more timeline semaphores reach their target values:
// submit work\nhal.device.yield %semaphore4 >= %sem4_target, %semaphore5 >= %sem5_target\n// continue here, possibly much later in time\n
Though coroutines help throughput there is a way we've found to reduce latency that's been documented as cellular batching. This same technique has been implemented in prior internal systems and is one of the motivating design goals for IREE's creation. The core idea is to identify small uniform work that can be partitioned and scheduled greedily such as to enable batching or reduce associated invocation costs (such as refreshing accelerator SRAM/caches with new parameters). This usually manifests as finding large GEMM/GEMV operations using the same fixed parameters and either dynamically increasing the batch size by adding the waiting work (without deferring the actual execution time) or sequencing them back to back to ensure better cache utilization. Which approach is taken depends on any data dependencies that may be present (such as LSTM state feedback edges).
With the foundation of coroutines in IREE it's possible to yield execution at any given point - including during command buffer recording - and wake on specific conditions. A majority of the logic can be built into the module itself with very little need for runtime machinery, as shared VM variables can be used to track pending work across invocations (even from different parts of the program) and flush based on logic wholly controlled by the user or compiler (such as count/max time latency/etc limits). This allows for the large variety of scheduling behavior various applications may want to use, such as a zero-latency batch-only-within-this-invocation to a Nagle's Algorithm-esque time or limit based behavior or even some learned model-specific windowing.
Design work is still required on how to represent this in IR but the current thought is to model the regions in which deferred execution is possible and beneficial and allow during lowering to the VM additional transformations. This is similar to how the async-await behavior works in C# where the async keyword is just sugar that expands to additional generated helper utilities.
A simple strawman representation for sequential dispatch may look like:
hal.scheduling_policy @defer_policy {\n // max time, max count, max live memory, etc\n}\n...\nhal.command_buffer.dispatch.deferred @defer_policy, @dispatch, ...\n// vm.yield added here during lowering\n
There are many cases to explore and as cellular batching can have performance benefits of several orders of magnitudes it'll be one of the primary areas of research in the long-term.
For scenarios where dynamic module loading is not required and entire modules can be compiled into applications we can lower the VM IR to LLVM IR within MLIR's transformation pipeline. Instead of embedding vm.call ops that are dispatched at runtime to things like the HAL we can instead lower to llvm::CallInst to runtime-resolved function pointers. This still enables all of the flexibility of heterogeneous/runtime-determined devices, pluggable diagnostics, and backend composition without any need for FlatBuffers or the VM bytecode interpreter.
vm.call
llvm::CallInst
The VM was designed to make such a lowering easy and the C-style struct-based function pointer registration for runtime modules was designed to make emitting code that used it fairly robust even when linked in dynamically such as when embedded in shared objects.
An extension of this is what we've been calling 'runtimeless mode', where the IREE VM linkage code is statically linked into the binary alongside the generated module LLVM IR. If only a single HAL backend is linked in then (with some build-fu) we should be able to get call devirtualization to reduce code size to precisely the functionality used by the module.
Currently the VM only supports two types: i32 and vm.ref<T>. This is an intentional limitation such that we can determine what is really needed to express the scheduling we perform, with the idea being that such a limited model will make it easier to use techniques like indirect command buffers to compile the VM itself to an accelerator executable that dispatches work without host involvement.
i32
vm.ref<T>
As we port more models we may find a few primitives that are worth bringing into the VM design such that it's worth potential complications to future porting. These includes types like f32 (for simple float calculations/comparisons), list/dict (easier python compatibility), and vector<4xf32> (for simple inline calculations that are not worth dispatch overhead/synchronization).
list
dict
vector<4xf32>
Though IREE will use many different tricks such as predication to build deep pipelines there is still the requirement that the command recording and submission happens on the host CPU. Though the cost of this in terms of latency and power use can be minimized by coalescing and timelines there is still the possibility of non-trivial roundtrips being introduced that limit performance. For particular applications like low-power always-on compute or where there is significantly branchy behavior (such as search algorithms) it is important that the decision making logic as to what is dispatched runs as close to real-time as possible within the execution pipeline.
The IREE VM is designed to be runnable on-device in a secure and cooperative way (no pointers, indirect buffer handles to allow for memory space rearrangement op-to-op, deterministic execution and explicit yield points, etc).
The recent efforts to bring indirect command buffers to Vulkan and Metal's Indirect Command Buffers (that both derive inspiration from NV_command_list) are one such target for this. Either by lowering the VM IR to LLVM IR or SPIR-V, by a special conversion to target-specific forms, or by actually executing the VM bytecode directly on-device (it's ~1000 LoC) we should be able to prototype what full on-device usage is like. Even if only some VM functions the compiler deems useful to schedule on the device are used and the rest run on the host (particularly those functions calling imported functions) some of the most costly logic that creates tight coupling of the host and device scheduling can be limited.
Authored December, 2019
Updated August, 2021
A key job of the IREE compiler and runtime is capturing function call semantics from the originating system and providing mechanisms so that invocations can be performed in as similar way as possible in various target languages. In general, this requires additional metadata on top of the raw characteristics of a function. Where possible, this is done by attaching attributes to a function.
iree.abi
This is the default ABI supported by the IREE VM invocations. It attempts to provide a default calling convention that can be used without further reflection metadata but which may be enhanced with it.
It natively allows monomorphic functions to be exported where arguments and results are composed of the following types:
ND-Array buffers of Value Types:
String (byte arrays)
Opaque reference object
The intent with these low level types is that calling conventions can be synthesized to bind arbitrary high level, domain/language specific signatures to these types, possibly by way of additional reflection metadata.
The above are all representable with native constructs in the VM:
ValueType:
iree_vm_value
Simple ND-Array Buffer:
iree_hal_buffer_view
tensor<>
String:
iree_vm_list
!util.list<i8>
Tuple:
!util.list<?>
TypedList (homogenous):
T
!util.list<T>
While the above features of the native ABI may be sufficient for direct use by various programs, many programs and callers will need to represent various higher level types, consistently mapping them to the above facilities. This section describes calling conventions for various higher level types which do not map 1:1 to the above. Not all source language types are representable, and extending these calling conventions (and the fundamental types above) is demand driven.
All of these calling conventions presume that the arity of the arguments/results of the raw function matches the user-level function, meaning that the calling convention is specified per argument/result. Higher-level whole function transformations may also exist for some domains but are outside of the scope of this specification.
A Structure is a common enough entity to have a dedicated calling convention. In C-like languages, this may just be a struct. In Python, it is typically a dict with an associated schema providing a name and type bound for each of its slots. In both, its slots are of fixed arity.
Structure
struct
In this convention, such a structure is represented as a Tuple in the native calling convention (i.e. !util.list of variant type). The order of the elements of the tuple are the natural order of the structure, where that is either:
Tuple
!util.list
Most languages interop between byte arrays (i.e. the native ABI String type) by way of applying an encoding. Such strings are just a sequence of bytes (i.e. !util.list<i8>).
String
High level lists which all share the same type bound are represented as a TypedList in the native ABI.
TypedList
NDArrays of reference types are considered separately from those of value types. Internally, the code generated for them is completely different from what gets generated for numeric based arrays (i.e. has ref-counting, ownership semantics, non-POD, etc). These types are permitted for completeness, not necessarily performance: by nature they are already indirected and have overheads.
In the native ABI, these are represented as a composite tuple type (i.e. today a list since sugar for tuple is not yet defined): !iree.tuple<!util.list<T>, !util.list<index>>. The first element of the tuple is the list of values, packed with a C-Layout and the second element is the list of dimension sizes.
!iree.tuple<!util.list<T>, !util.list<index>>
Additional reflection metadata may be encoded in a custom JSON form, providing additional typing hints for arguments and results. If present, this will be a reflection attribute with key d, containing a serialized JSON object.
d
The JSON object contains:
a
r
Type records are one of:
A string naming a primitive type:
i[0-9]+
f[0-9]+
bf16
JSON null: A null reference value
null
\"unknown\": An unknown/unmapped type
\"unknown\"
An array, interpreted as a tuple describing a compound type.
A compound type tuple has a type identifier as its first element, followed with type specific fields:
[\"named\", \"key\", {slot_type}]
[\"ndarray\", {element_type}, {rank}, {dim...}]
rank
[\"slist\", {slot_type...}]
[\"stuple\", {slot_type...}]
slist
[\"sdict\", [\"key\", {slot_type}]...]
[\"py_homogeneous_list\", {element_type}]
element_type
This document lists technical details regarding the HIP implemenation of IREE's Hardware Abstraction Layer, called a HIP HAL driver.
IREE provides a Hardware Abstraction Layer (HAL) as a common interface to different compute accelerators. IREE HAL's design draws inspiration from modern GPU architecture and APIs; so implementing a HAL driver using HIP is mostly straightforward; though there are places we need emulation given no direct mapping concepts or mechanisms. HIP HAL driver draws inspiration from the CUDA HAL driver and the code structure is based off of that implementation.
Unlike CUDA, HIP doesn't provide two separate API's with the same functionality in the name of driver and runtime. Instead it extends the HIP API with Modules and Ctx control API's that the CUDA driver API's exclusively offer.At runtime the HIP HAL driver will load the libamdhip64.so/amdhip64.dll library dynamically.
libamdhip64.so
amdhip64.dll
There is no direct HIP construct that map to the IREE HAL iree_hal_driver_t abstraction. We use it to hold the dynamic symbols loaded for all devices, and device enumeration and creation.
iree_hal_hip_device_t implements iree_hal_device_t to provide the interface to HIP GPU device by wrapping a hipDevice_t. For each device, right now we create two hipStream_ts--one for issuing commands for memory allocation and kernel lauches as instructed by the program; the other for issue host callback functions after dispatched command buffers completes. See synchronization section regarding the details.
iree_hal_hip_device_t
hipDevice_t
hipStream_t
The HIP HAL drivers supports async allocation (iree_hal_device_queue_alloca() and iree_hal_device_queue_dealloca()) via HIP stream ordered memory allocation.
The async_allocations in the iree_hal_hip_device_params_t struct allows to enable this feature.
iree_hal_hip_device_params_t
Two implementations of iree_hal_command_buffer_t exist in the HIP HAL driver--one backed by hipGraph_t and the other backed by hipStream_t.
hipGraph_t
hipGraph_t conceptually matches iree_hal_command_buffer_t better given it's a recording of commands to issue to the GPU. Also using the hipGraph_t API allows to easily encode fine grain dependencies between dispatch without having to create multiple streams. Therefore, the hipGraph_t-backed implementation is a more natural one. Though note that hipGraph_t API is meant to be used for recording once and replaying multiple times and there may be a performance penalty to using hipGraph_t API for one-shot command buffer.
The hipStream_t-backed implementation just issues commands directly to a hipStream_t when recording. Commands issued to hipStream_t can be immediately sent to the GPU for execution; there is no recording and replaying separation. In order to match the recording semantics of iree_hal_command_buffer_t, to use the hipStream_t-backed command buffer, we need to first record the command buffer into an in-memory iree_hal_deferred_command_buffer_t, and then when applying the command buffer, we create a new hipStream_t-backed implementation.
The command_buffer_mode in the iree_hal_hips_device_params_t struct allows to select which implementation to use.
iree_hal_hips_device_params_t
The allocator will forward allocation requests to hipHostMalloc() for host local memory, hipMalloc() for device local and host invisible memory, and hipMallocManaged() for device local and host visible memory.
hipHostMalloc()
hipMalloc()
hipMallocManaged()
HIP buffers are represented either as a host pointer or a device pointer of type hipDeviceptr_t.
hipDeviceptr_t
iree_hal_executable_t maps naturally to hipModule_t.
hipModule_t
The compiler generates a FlatBuffer containing a HSACO image as well as a list of entry point functions and their associated metadata (names, workgroup size, dynamic shared memory size, etc.). At runtime, the HIP HAL driver loads the HSACO image and creates hipFunction_ts out of it for various entry points.
hipFunction_t
iree_hal_event_t right now is not used in the compiler so it's not yet implemented in the HIP HAL driver.
In HIP, there is no direct equivalent primitives providing all the capabilities needed by the HAL semaphore abstraction. Therefore, to implement the support, we need to leverage multiple native CPU or HIP primitives under the hood.
hipEvent_t
The main synchronization mechanism is HIP event--hipEvent_t. As a functionality and integration baseline, we use hipEvent_t to implement the IREE HAL semaphore abstraction.
hipEvent_t natively supports the following capabilities:
hipEventSynchronize()
hipGraphAddEventWaitNode()
hipEventRecord()
hipGraphAddEventRecordNode()
If there are GPU waits, given that there are no way we can signal a hipEvent_t on CPU, one way to handle this is to cache and defer the submission batches by ourselves until CPU signals past the desired value. To support this, we would need to implement a deferred/pending actions queue.
GPU signals can only be through a hipEvent_t object, which has a binary state. We need to advance the timeline too. One way is to use hipLaunchHostFunc() to advance from the CPU side with iree_hal_semaphore_list_signal(). This additionally would mean we can reuse the logic form CPU signaling to unblock CPU waits.
hipLaunchHostFunc()
After advancing the timeline from the CPU side with hipLaunchHostFunc(), we can release more workload from the deferred/pending actions queue to the GPU. Though, per the documentation of hipLaunchHostFunc(), \"the host function must not make any HIP API calls.\" So we cannot do that directly inside hipLaunchHostFunc(); we need to notify another separate thread to call HIP APIs to push more work to the GPU. So the deferred/pending action queue should have an associcated thread.
For GPU waits, we can also leverage the same logic--using CPU signaling to unblock deferred GPU queue actions. Though this is performant, given that the CPU is involved for GPU internal synchronization. We want to use hipEvent_t instead:
Another problem is that per the hipLaunchHostFunc() doc, \"the function will be called after currently enqueued work and will block work added after it.\" We don't want the blocking behavior involving host. So we can use a dedicated hipStream_t for launching the host function, waiting on the hipEvent_t from the original stream too. We can also handle resource deallocation together there.
iree_hal_hip_event_t
iree_hal_hip_event_pool_t
iree_hal_hip_timepoint_t
iree_hal_hip_timepoint_pool_t
iree_hal_hip_timeline_semaphore_t
iree_hal_hip_queue_action_t
iree_hal_hip_pending_queue_actions_t
Authored June, 2022
This documents the behavior of the user-visible invocation mechanism IREE uses to schedule program execution. Internally IREE uses a very similar modeling for tracking its internal workloads and in kind carries that down to target APIs and devices that themselves use a very similar model. The intent is to expose the device model in an abstracted way that allows for the full capture and communication of the execution intent to be propagated to the hardware that executes it. Though here we focus on the user-visible portion of execution there is really only one \"IREE execution model\" and the entire stack follows the same design. At its core this design is just an instantiation of an out-of-order execution algorithm such as those originating from the 1960's.
stateDiagram\n state UserApplication {\n direction BT\n state Context0 {\n ModuleA-->ModuleAState0\n ModuleB-->ModuleBState0\n }\n state Context1 {\n ModuleA-->ModuleAState1\n ModuleB-->ModuleBState1\n ModuleC-->ModuleCState1\n }\n state ModuleA {\n @func1\n @func2\n }\n state ModuleB {\n @func3\n @func4\n }\n state ModuleC {\n @func5\n }\n }
An IREE program is a collection of modules instantiated in a context from which invocations can be made. Invocations are ordered on a user-controlled timeline that uses fences to define the execution order requirements to enable out-of-order execution. A hosting user application may have multiple programs or multiple instances of the same program available and running invocations at a time across multiple timelines.
Modules define executable code and data that can be loaded, linked, and run \u00e0 la ELF shared libraries. Modules may be implemented as C/C++, generated bytecode or C sources from the IREE compiler, or any other mechanism that can run code and implement the iree_vm_module_t interface. Modules on their own are read-only and can be reused across many contexts.
iree_vm_module_t
Traditional ML runtimes would use a model (graph, etc) as their module representation. In IREE everything is a module including runtime subsystems like the HAL and user-provided custom code. This ensures that anything IREE can do can be externalized and replaced by users without needing to modify the core IREE code.
A collection of modules are linked and instantiated in a context. Each context operates independently and carries its own copies of mutable module state. Invocations execute within a context scope and hosting applications coordinate across contexts as required. Contexts are cheap to create (microseconds) and retain (~100B + program state) such that users can decide how to manage them based on their scenario.
Traditional ML runtimes would call these \"sessions\" but in IREE everything is a program. Whether the program is stateful or stateless and how the program is invoked is up to the program author.
An invocation represents a single call into a module exported function using the program state stored in a context. Users can decide whether to perform synchronous blocking invocations or asynchronous non-blocking invocations per-call; the behavior of the invocation is independent from the target function and a user program may contain a mix of both.
As an example a user program may synchronously invoke a @query_output_shapes function to preallocate storage for an asynchronous @execute_in_place function to write into.
@query_output_shapes
@execute_in_place
A timeline represents the observable order of execution. Users define their own timelines and communicate them to IREE via fences. Timelines do not match up with the order of invocations unless the user dictates they must by way of fences. In the absence of fences all invocations execute in an arbitrary order and they may execute concurrently just as threads in C with no barriers.
Each timeline can be thought of as an independent clock domain that may operate asynchronously at its own frequency with only fences acting to tie separate timelines together. This directly mirrors real hardware constraints like clock domain crossing as each execution scope (thread on core, driver calls to queues, kernel queues to device queues, device queues to compute unit queues, etc) is naturally operating at different rates and well-designed systems must tolerate that variability.
A fence is a specific point of progress in one or more timelines acting as a barrier, fork, or join point. Fences only guard execution ordering and not any particular resources though users can use them to guard resources by defining when in time the resources are available for use.
Waits on fences are wait-until operations specifying that the timeline must reach at least a specific point. This allows for flexible reordering and deferral of execution as executors can pull forward scheduled work based on policy (run similar work together, etc).
The HAL is an optional feature of IREE that is used to provide a consistent interface across execution resources. It is used internally by IREE programs to define and submit work to devices and signal across them but may also be used by users to directly interface with hardware in a compatible way. Exposing the HAL API allows for users to efficiently manage their data and custom execution without expensive marshaling. Most users will only interact with HAL buffers as they work with their data but more advanced integrations can directly insert IREE into existing device contexts to transparently share scheduling and resources or insert their own code into IREE to pipeline custom execution.
NOTE: this defines an execution scheme that IREE supports but a user may use one or more such schemes in a single program - just as a C application may mix single- and multi-threaded code within itself for different components.
The combination of invocations, timelines, and fences allows users to provide future knowledge to lower layers of the system by declaring their availability requirements and the lower layers are then able to execute the work out-of-order so long as the specified requirements are met. The primary goal when designing for such a system is to specify as few requirements as possible in order to provide the maximum amount of scheduling freedom to the implementation.
This makes timelines one of the most critical components of the interface. The purpose of invocations is to schedule work against one or more timelines and what happens within the invocations is an implementation detail of the program.
Here we say \"a user invokes a function to schedule execution on a timeline\" vs. a more traditional \"a user invokes a function to execute work\" and this manifests in the IREE ABI as invocations taking fences defining specific points on timelines of which the user may observe:
# Fences are effectively just timeline + integer tuples and are cheap to hold.\nwait_fence = my_timeline.at(t)\nsignal_fence = my_timeline.at(t+1)\n# Schedule work against the timeline.\n# All work prior to t must complete before execution can occur and after\n# execution the timeline will advance to t+1.\nasync_invoke(@some_fn, wait_fence, signal_fence)\n# The invocation may have returned immediately after the work was scheduled;\n# until the fence is reached no actual execution may have occurred. To\n# synchronize the user code with the timeline the user can block until the fence\n# is reached.\nsignal_fence.wait()\n
To the user this would appear as:
sequenceDiagram\n User->>@some_func: invoke\n activate @some_func\n @some_func->>User: ;\n @some_func-->>@some_func: wait t\n @some_func-->>User: signal t+1\n deactivate @some_func
This means from the user's perspective the actual operations performed by the invocation are not important: the only thing the user can observe in this situation is when the timeline reaches t+1 as they specified. Whether internally the invocation needs many steps to complete as there are timelines internal to the program is an implementation detail. Actual execution may look like this:
t+1
sequenceDiagram\n User->>@some_func: invoke\n activate @some_func\n @some_func->>User: ;\n @some_func->>@some_func: ;\n @some_func-->>Device A: ;\n Device A-->>Device A: wait t\n activate Device A\n @some_func->>@some_func: ;\n @some_func-->>Device B: ;\n activate Device B\n @some_func->>@some_func: ;\n Device A-->>@some_func: ;\n deactivate Device A\n @some_func->>@some_func: ;\n @some_func-->>Device B: ;\n activate Device B\n deactivate @some_func\n Device B-->>User: signal t+1\n deactivate Device B\n deactivate Device B
Even in this simple user-synchronous example the system is able to internally run several concurrent timelines with a minimal number of synchronization points and the lowest possible latency as the user is immediately notified without any intermediate layers needing to be woken, scheduled, executed, and passed on.
The true power of timelines comes from the ability to pipeline execution. Users define DAGs with fences and can construct arbitrarily complex execution topologies whether from the same program or across multiple programs:
stateDiagram\n direction LR\n state fence0 <<fork>>\n [*] --> fence0\n fence0 --> @fn0\n state fence1 <<fork>>\n @fn0 --> fence1\n fence1 --> @fn1\n fence1 --> @fn2\n state fence2 <<join>>\n @fn1 --> fence2\n @fn2 --> fence2\n @fn3 --> fence2\n fence0 --> @fn4\n @fn4 --> fence2\n fence2 --> [*]
This is a simple extension to the synchronous example using the same primitives:
# Timeline is defined by the user.\nfence_a = my_timeline.at(t)\nfence_b = my_timeline.at(t+1)\nfence_c = my_timeline.at(t+2)\n# Invocations are launched using the fences and may not complete immediately.\nasync_invoke(@fn0, fence_a, fence_b)\nasync_invoke(@fn1, fence_b, fence_c)\nasync_invoke(@fn2, fence_b, fence_c)\nasync_invoke(@fn3, None, fence_c)\nasync_invoke(@fn4, fence_a, fence_c)\n# Blocking here but no need to; could pass fence_c on to other invocations.\nfence_c.wait()\n
The critical point of this being that the user never had to wait for any particular invocation to complete before being able to schedule more work against the timeline, even if those invocations could themselves not complete synchronously. The lower layers of the system are able to fully model the execution as early as possible without needing to communicate (and importantly synchronize) with the user.
Users define the semantics of their programs themselves. For example if the user knows the precise shape of an output buffer they can preallocate the buffer and pass it in. If they don't know they can decide to factor out the shape calculation and invoke that synchronously in order to compute the shape, allocate the appropriately sized buffer, and pass that in. Or they could decide to only deal with synchronous invocations and return a program-allocated buffer view with the appropriate shape in their callback. IREE does not dictate the design of user programs and as such enables mixed stateful/stateless, asynchronous/synchronous, and arbitrary scheduling models (enqueue/drain, windowing, etc).
Inputs and outputs to invocations are provided by the user as primitive values (integers, floats, etc), supported builtin types (lists, byte buffers/strings), custom user types, and HAL types like buffers or buffer views (buffers + shape and type metadata). One or more wait fences can be used to order invocation access to one or more inputs by indicating that the resource is not available until a certain fence is reached. Similarly one or more signal fences can be used to order subsequent access to the resources by indicating the advancement of the timeline when they are available.
# wait_fence_a must be reached before buffer_a and buffer_b can be read.\n# wait_fence_b must be reached before buffer_c can be read.\n# buffer_a will be ready to read when signal_fence_a has been reached.\nasync_invoke(@fn,\n (wait_fence_a, buffer_a, buffer_b),\n 42, # no ordering required on value types\n (wait_fence_b, buffer_c),\n (signal_fence_a, buffer_a))\n
The above example demonstrates an in-place operation on buffer_a. It's also possible for invocations to return values:
buffer_a
result = invoke(@sum, 1, 2) # = 3\n
When executed asynchronously a callback or any construct that can be built upon them (like promises/futures) can receive the results:
def my_callback(result):\n print(result) # 3\nasync_invoke(@sum, 1, 2, my_callback)\n
Invocations generally have only a few KB of overhead and pipelined command buffers take only a small amount more. Storage buffers, however, can easily take hundreds of MB per invocation for I/O and transient state. This compounds as program usage becomes more complex or multiple programs are involved. IREE supports traditional host-ordered allocations (\u00e0 la malloc/free) for persistent buffers like large constants/read-only data or user-managed ringbuffers. Stream-ordered allocations are also supported to allow for pooled buffer reservations that can be allocated in a scheduled order alongside program execution.
For more detailed examples see the CUDA blog posts describing their implementation: part 1, part 2.
With stream-ordered allocations each allocation and deallocation operation is scheduled with wait and signal fences just as with invocations. This allows these allocation operations to execute remotely on device without host program involvement. For example, scheduling alloca0/dealloca0 and alloca1/dealloca1 interleaved with the function execution allows for the transient memory required for executing @fn0 to remain uncommitted until immediately before it is executed, committed during execution, and then decommitted immediately after execution. The memory required for passing data from @fn0 to the subsequent @fn1 and @fn2 survives until after they have completed executing before being decommitted. By using the same scheduling primitives as execution the allocation topology can be as arbitrarily complex as the invocation topology:
alloca0
dealloca0
alloca1
dealloca1
@fn0
@fn1
@fn2
stateDiagram\n direction LR\n state fence0a <<fork>>\n [*] --> fence0a\n state fence0b <<fork>>\n fence0a --> alloca0\n fence0a --> alloca1\n alloca0 --> fence0b\n alloca1 --> fence0b\n fence0b --> @fn0\n state fence1a <<fork>>\n @fn0 --> fence1a\n state fence1b <<fork>>\n fence1a --> dealloc0\n dealloc0 --> fence1b\n fence1b --> @fn1\n fence1b --> @fn2\n state fence2a <<join>>\n @fn1 --> fence2a\n @fn2 --> fence2a\n state fence2b\n fence2a --> dealloc1\n state fence2b <<join>>\n dealloc1 --> fence2b\n fence2b --> [*]
When operating in this way allocations from the host-perspective are just reservations for a slice of pooled storage that will be committed at some point in the future. Likewise deallocations from the host-perspective release the prior reservation and schedule the paired decommit at some point in the future. Scheduling N sequential invocations thus requires only enough committed storage for a single invocation in addition to the I/O (unless that too is stream-ordered).
This scheduling behavior allows for both minimal peak memory consumption regardless of the number of programs or invocation pipeline depth and sharing of committed storage across programs: the memory consumption of a program at rest is near zero when stateless and the sum of all state when stateful. Target devices that natively support stream-ordered allocations (like CUDA) can even share pools across processes.
The other provided feature in combination with the fence guaranteed forward progress is that so long as the memory pool can service a single request execution can still continue even when constrained. A device can serialize two independent invocations requiring 400MB of transient memory when the system only has 512MB available with no user-visible impact besides increased latency. This does require the user to ensure they schedule work that is possible to run or rely on the target system having paging in order to lighten the strictness of the pool quotas.
Stream-ordered allocations performed by the user for invocation inputs can be declared as transferred to the program. This allows the program to eagerly deallocate or reuse the input storage while still preserving the internal scheduling requirements of the program.
A stateful program may contain internal timelines that it uses to order its own execution. Take for example this simple stateful program:
class TrivialKernel(Program):\n _x0 = Program.export_global(x_type)\n def get(self):\n return self._x0\n def set(self, x=x_type):\n self._x0 = x\n def matmul(self, x=y_type):\n self._x0 = self._matmul(x, self._x0)\n @Program.kernel\n def _matmul(x, x0):\n return jnp.matmul(x, x0)\n
Each invocation of matmul needs to be executed in-order with prior invocations as there is a data dependency established on self._x0. Attempts to get or set must also be sequenced correctly with the matmul invocations. A basic usage like this:
matmul
self._x0
get
set
m = TrivialKernel()\nm.set(input)\nm.matmul(a)\nm.matmul(b)\nm.matmul(c)\noutput = m.get()\nprint(output) # implicit wait\n
Would be executed as:
sequenceDiagram\n activate User\n User->>TrivialKernel: @set(input)\n activate TrivialKernel\n TrivialKernel-->>Device: ;\n deactivate TrivialKernel\n activate Device\n TrivialKernel->>User: ;\n User->>TrivialKernel: @matmul(a)\n activate TrivialKernel\n TrivialKernel-->>Device: ;\n deactivate TrivialKernel\n TrivialKernel->>User: ;\n User->>TrivialKernel: @matmul(b)\n activate TrivialKernel\n TrivialKernel-->>Device: ;\n deactivate TrivialKernel\n TrivialKernel->>User: ;\n User->>TrivialKernel: @matmul(c)\n activate TrivialKernel\n TrivialKernel-->>Device: ;\n deactivate TrivialKernel\n TrivialKernel->>User: ;\n User->>TrivialKernel: @get()\n activate TrivialKernel\n TrivialKernel-->>Device: ;\n deactivate TrivialKernel\n TrivialKernel->>User: ;\n Device-->>Device: ;\n deactivate User\n User->>User: (wait)\n Device-->>User: (signal)\n deactivate Device\n activate User\n User->>User: print(output)\n deactivate User
Note that although the user provided no timeline of their own execution is still ordered correctly due to the internal timeline constructed by the program. If the user wanted to also pipeline execution with another program they could do so by providing their own fences.
This document lists technical details regarding the Metal implemenation of IREE's Hardware Abstraction Layer, called a Metal HAL driver.
IREE provides a Hardware Abstraction Layer (HAL) as a common interface to different compute accelerators. IREE HAL's design draws inspiration from modern GPU architecture and APIs; so implementing a HAL driver using modern GPU APIs is generally straightforward. This applies to the Metal HAL driver.
Currently the Metal HAL driver expects Metal 3 capabilities. Metal 3 was released late 2022 and are supported since macOS Ventura and iOS 16. It covers recent Apple silicon GPUs including A13+ and M1+ chips and others.
In the future, we expect to increase the support to cover Metal 2 capabilities. Metal 2 introduces useful features like argument buffer and others that are necessary for performance and make IREE HAL implementation simpler. Metal 2 was released late 2017 and are supported since macOS High Sierra and iOS 11. It is already dominant (macOS, iOS).
The Metal framework only exposes Objective-C or Swift programming language APIs. Metal HAL driver needs to inherit from common HAL abstraction definitions, which are in C. To minimize dependency and binary size and increase performance, we use Metal's Objective-C API for implementing the Metal HAL driver. Header (.h) and implementation (.m) files are put adjacent to each other.
.h
.m
Objective-C uses refcount for tracking object lifetime and managing memory. This is traditionally done manually by sending retain and release messages to Objective-C objects. Modern Objective-C allows developers to opt in to use Automatic Reference Counting to let the compiler to automatically deduce and insert retain/release where possible to simplify the burdern of manual management.
retain
release
We don't use ARC in the Metal HAL driver given that IREE has its own object refcount and lifetime management mechanism. Metal HAL GPU objects are tracked with that to be consistent with others. Each Metal HAL GPU object retains the underlying Metal id<MTL*> object on construction and releases on destruction.
id<MTL*>
Metal is one of the main modern GPU APIs that provide more explicit control over the hardware. The mapping between IREE HAL classes and Metal protocols are relatively straightforward:
MTLDevice
MTLCommandBuffer
iree_hal_semaphore_t
MTLSharedEvent
iree_hal_allocator_t
iree_hal_buffer_t
MTLBuffer
MTLLibrary
iree_hal_executable_cache_t
iree_hal_descriptor_set_layout_t
iree_hal_pipeline_layout_t
In the following subsections, we go over each pair to provide more details.
There is no native driver abstraction in Metal. IREE's Metal HAL driver still provides a iree_hal_metal_driver_t struct to implement the common iree_hal_driver_t struct. iree_hal_metal_driver_t just retains all available Metal devices in the system during its lifetime, to guarantee that we have the same id<MTLDevice> for device querying and creation.
iree_hal_metal_driver_t
id<MTLDevice>
iree_hal_metal_device_t implements iree_hal_device_t to provide the interface to Metal GPU device by wrapping a id<MTLDevice>. Upon construction, iree_hal_metal_device_t creates and retains one queue for both dispatch and transfer during its lifetime. In the future we expect to spport multiple queues for better concurrency.
iree_hal_metal_device_t
In IREE HAL, command buffers are directly created from the iree_hal_device_t. It's also directly submitted there via iree_hal_device_queue_execute(). Each execution takes a batch of command buffers, together with a list of waiting iree_hal_semaphore_ts and a list signaling iree_hal_semaphore_ts. There is no direct mapping of such structure in Metal; so we performs the submission in three steps:
iree_hal_device_queue_execute()
encodeWaitForEvent:value
encodeSignalEvent:value
Such submission enables asynchronous execution of the workload on the GPU.
Queue-ordered asynchronous allocations via iree_hal_device_queue_alloc is not fully supported yet; it just translates to blocking wait and allocation.
iree_hal_device_queue_alloc
Collectives suppport is not yet implemented.
The Metal HAL driver supports profiling via MTLCaptureManager. We can either capture to a trace file or XCode.
MTLCaptureManager
To perform profiling in the command line, attach --device_profiling_mode=queue --device_profiling_file=/path/to/metal.gputrace to IREE binaries.
--device_profiling_mode=queue --device_profiling_file=/path/to/metal.gputrace
Command buffers are where IREE HAL and Metal API have a major difference.
IREE HAL command buffers follow the flat Vulkan recording model, where all memory or dispatch commands are recorded into a command buffer directly. Unlike Vulkan, Metal adopts a multi-level command recording model--memory/dispatch commands are not directly recorded into a command buffer; rather, they must go through the additional level of blit/compute encoders. Implementing IREE's HAL using Metal would require switching encoders for interleaved memory and dispatch commands. Additionally, certain IREE HAL API features do not have direct mapping in Metal APIs, e.g., various forms of IREE HAL execution/memory barriers. Translating them would require looking at both previous and next commands to decide the proper mapping.
Due to these reasons, it's beneficial to have a complete view of the full command buffer and extra flexibility during recording, in order to fixup past commands, or inspect future commands.
Therefore, to implement IREE HAL command buffers using Metal, we perform two steps using a linked list of command segments: First we create segments to keep track of all IREE HAL commands and the associated data. And then, when finalizing the command buffer, we iterate through all the segments and record their contents into a proper MTLCommandBuffer. A linked list gives us the flexibility to organize command sequence in low overhead; and a deferred recording gives us the complete picture of the command buffer when really started recording.
The Metal HAL driver right now only support one-shot command buffers, by mapping to MTLCommandBuffers.
Metal APIs for fill and copy buffers have alignment restrictions on the offset and length. iree_hal_command_buffer_{fill|copy|update}_buffer() is more flexible regarding that. So for cases aren't directly supported by Metal APIs, we use polyfill compute kernels to perform the memory operation using GPU threads.
iree_hal_command_buffer_{fill|copy|update}_buffer()
iree_hal_semaphore_t allows host->device, device->host, host->host, and device->device synchronization. It maps to Vulkan timeline semaphore. In Metal world, the counterpart would be MTLSharedEvent. Most of the iree_hal_semaphore_t APIs are simple to implement in MetalSharedEvent, with iree_hal_semaphore_wait() as an exception. A listener is registered on the MTLSharedEvent with notifyListener:atValue:block: to singal a semaphore to wake the current thread, which is put into sleep by waiting on the semaphore.
MetalSharedEvent
iree_hal_semaphore_wait()
notifyListener:atValue:block:
At the moment the Metal HAL driver just has a very simple iree_hal_allocator_t implementation. It just wraps a MTLDevice and redirects all allocation requests to the MTLDevice. No page/pool/slab or whatever. This is meant to be used together with common allocator layers like the caching allocator.
IREE iree_hal_buffer_t maps Metal MTLBuffer. See Object Lifetime Management for more details.
IREE iree_hal_executable_t represents a GPU program archive with a driver-defined format. It maps naturally to Metal MTLLibrary. An entry point in a MTLLibrary is a MTLFunction. We define iree_hal_metal_kernel_params_t to wrap around a MTLLibrary, its MTLFunctions, and also MTLComputePipelineState objects constructed from MTLFunctions.
MTLFunction
iree_hal_metal_kernel_params_t
MTLComputePipelineState
IREE iree_hal_executable_cache_t is modeling a cache of preprared GPU executables for a particular device. At the moment the Metal HAL driver does not peforming any caching on GPU programs; it simply reads the program from the FlatBuffer and hands it over to Metal driver.
See Resource descriptors for more details.
Metal has Metal Shading Language (MSL) for authoring graphics shaders and compute kernels. MSL source code can be directly consumed by the Metal framework at run-time; it can also be compiled first into an opaque library using command-line tools at build-time.
IREE uses compilers to compile ML models expressed with high-level op semantics down to GPU native source format. This is also the case for the Metal HAL driver. Metal does not provide an open intermediate language; we reuse the SPIR-V code generation pipeline and then cross compile the generated SPIR-V into MSL source with SPIRV-Cross. This is actually a fair common practice for targeting multiple GPU APIs in graphics programming world. For example, the Vulkan implmenation in macOS/iOS, MoltenVK, is also doing the same for shaders/kernels. The path is quite robust, as demonstrated by various games on top of MoltenVK.
Therefore, in IREE, we have a MetalSPIRVTargetBackend, which pulls in the common SPIR-V passes to form the compilation pipeline. The difference would be to provide a suitable SPIR-V target environment to drive the compilation, which one can derive from the Metal GPU families to target. The serialization step differs from VulkanSPIRVTargetBackend too: following the normal SPIR-V serialization step, we additionally need to invoke SPRIV-Cross to cross compile the generated SPIR-V into MSL, and then compile and/or serialize the MSL source/library.
MetalSPIRVTargetBackend
VulkanSPIRVTargetBackend
IREE uses FlatBuffer to encode the whole workload module, including both GPU shader/kernel (called executable in IREE terminology) and CPU scheduling logic. The GPU executables are embedded as part of the module's FlatBuffer, which are mmapped when IREE runs.
mmap
For the Metal HAL driver, it means we need to embed the MSL kernels inside the module FlatBuffer. Right now we can either encode the MSL source strings and compile them at Metal run-time, or directly encoding the library instead.
When dispatching a compute kernel in Metal, we need to specify the number of thread groups in grid and the number of threads in thread group. Both are 3-D vectors. IREE HAL, which follows Vulkan, calls them workgroup count and workgroup size, respectively.
In Vulkan programming model, workgroup count and workgroup size are specified at different places: the former is given when invoking vkCmdDispatch(), while the later is encoded in the dispatched SPIR-V code. This split does not match the Metal model, where we specify both in the API with dispatchThreads:threadsPerThreadgroup:.
vkCmdDispatch()
dispatchThreads:threadsPerThreadgroup:
As said in shader/kernel compilation, MSL kernels are cross compiled from SPIR-V code and then embeded in the module FlatBuffer. The module FlatBuffer provides us a way to convey the threadgroup/workgroup size information extracted from the SPIR-V code. We encode an additional 3-D vector for each entry point and use it as the threadgroup size when later dispatching the MTLFunction corresponding to the entry point.
A descriptor is an opaque handle pointing to a resource that is accessed in the compute kernel. IREE's HAL models several concepts related to GPU resource management explicitly:
DescriptorSetLayout
However, this isn't totally matching Metal's paradigm. In the Metal framework, the closest concept to descriptor sets would be argument buffer. There is no direct correspondence to descriptor set layout and pipeline layout. Rather, the layout is implicitly encoded in Metal shaders as MSL structs. The APIs for creating argument buffers do not encourage early creation without pipelines: one typically creates them for each MTLFunction.
All of this means it's better to defer the creation of the argument buffer until the point of compute pipeline creation and dispatch. Therefore, the Metal HAL driver's iree_hal_metal_descriptor_set_layout_t and iree_hal_metal_pipeline_layout_t are just containers holding the information up for recording command buffer dispatch.
iree_hal_metal_descriptor_set_layout_t
iree_hal_metal_pipeline_layout_t
Metal HAL driver command buffer dispatch recording performs the following steps with the current active MTLComputeCommandEncoder:
MTLComputeCommandEncoder
setBytes:length:atIndex
S
MTLArgumentEncoder
B
setBuffer::offset::atIndex:
useResource:usage:
dispatchThreadgroups:threadsPerThreadgroup:
We'd love to accept your patches and contributions to this project.
Note - coordinating efforts
Please file issues or reach out on any of our other communication channels before doing substantial work; this will ensure that others don't duplicate the work and that there's a chance to discuss any design issues.
This project follows the OpenXLA Code of Conduct.
Contributors must certify that they wrote or otherwise have the right to submit the code they are contributing to the project.
By making a contribution to this project, I certify that:
The contribution was created in whole or in part by me and I have the right to submit it under the open source license indicated in the file; or
The contribution is based upon previous work that, to the best of my knowledge, is covered under an appropriate open source license and I have the right under that license to submit that work with modifications, whether created in whole or in part by me, under the same open source license (unless I am permitted to submit under a different license), as indicated in the file; or
The contribution was provided directly to me by some other person who certified 1., 2. or 3. and I have not modified it.
I understand and agree that this project and the contribution are public and that a record of the contribution (including all personal information I submit with it, including my sign-off) is maintained indefinitely and may be redistributed consistent with this project or the open source license(s) involved.
Signing is enforced by the DCO GitHub App. This requires that all commits included in pull requests include a Signed-off-by line:
Signed-off-by
This is my commit message\n\nSigned-off-by: Random J Developer <random@developer.example.org>\n
Git will automatically append this message if you use the -s option:
-s
git commit -s -m 'This is my commit message'\n
Users of Visual Studio Code can add \"git.alwaysSignOff\": true, in their settings.
\"git.alwaysSignOff\": true,
For more information about DCO enforcement and git workflows, see the dcoapp/app repository.
CLA is being replaced with DCO. Both are enabled while we migrate.
Contributions to this project must be accompanied by a Contributor License Agreement (CLA). Head over to https://cla.developers.google.com/ to see your current agreements on file or to sign a new one.
The AUTHORS file keeps track of those who have made significant contributions to the project.
AUTHORS
The .github/CODEOWNERS file lets maintainers opt in to PR reviews modifying certain paths.
.github/CODEOWNERS
The MAINTAINERS.md file documents official maintainers for project components.
MAINTAINERS.md
Most of the code style is derived from the Google Style Guides for the appropriate language and is generally not something we accept changes on (as clang-format and other linters set that for us). The C++ compiler portion of the project follows the MLIR/LLVM style guide.
Improvements to code structure and clarity are welcome but please file issues to track such work first. Pure style changes are unlikely to be accepted unless they are applied consistently across the project.
Formatters like clang-format (C/C++) and Black (Python) can be set to run automatically in your editor of choice.
clang-format
The script at build_tools/scripts/lint.sh can also be used to run the full suite of lint checks.
build_tools/scripts/lint.sh
With few exceptions, features should be accompanied by automated tests.
We use a mix of in-tree and out-of-tree unit and integration tests. For more information about the types of tests used across the project, refer to the testing guide.
All submissions, including submissions by maintainers, require review. We use GitHub pull requests (PRs) for this purpose. Consult GitHub Help for more information on using pull requests.
We use GitHub Actions to automatically build and test various parts of the project.
Some workflows only run on commits after they are merged. See the CI behavior manipulation section below to learn how to customize this behavior.
After review and presubmit checks, PRs should typically be merged using \"squash and merge\".
It is assumed that the PR author will merge their change unless they ask someone else to merge it for them (e.g. because they don't have write access yet).
Access to affiliated repositories is divided into tiers:
All access tiers first require joining the iree-org GitHub organization.
Fill out this form to request access
Once you are a member of the iree-org GitHub organization, you can request to join any of the teams on https://github.com/orgs/iree-org/teams.
Note: other GitHub organizations
Work on IREE sometimes spans other GitHub organizations like shark-infra. Reach out to a project member if you would also like access to repositories in those organizations.
Most work should be done on repository forks. For developers with write access, when creating a branch in the common iree-org/iree repository, please follow these naming guidelines:
users/[username]/*
users/cooldeveloper/my-awesome-feature
shared/*
shared/pytorch-performance-sprint
integrates/*
integrates/integrate-llvm-20240501
Branches that do not meet these guidelines may be deleted, especially if they appear to be stale.
IREE supports building from source with both Bazel and CMake.
IREE uses GitHub Actions for CI. The primary CI is configured in the ci.yml workflow file.
In addition to the default runners GitHub provides, IREE uses self-hosted runners to run many of its workflow jobs. These enable access to additional compute and custom configurations such as accelerators. Configuration scripting is checked in to this repository (see the README for that directory).
The setup step of the CI determines which CI jobs to run. This is controlled by the configure_ci.py script. It will generally run a pre-determined set of jobs on presubmit with some jobs kept as post-submit only. If changes are only to a certain set of excluded files that we know don't affect CI (e.g. Markdown files), then it will skip the jobs.
You can customize which jobs run using git trailers in the PR description.
The available options are
ci-skip: jobs,to,skip\nci-extra: extra,jobs,to,run\nci-exactly: exact,set,of,jobs,to,run\nskip-ci: free form reason\nskip-llvm-integrate-benchmark: free form reason\nbenchmark-extra: extra,benchmarks,to,run\nrunner-env: [testing|prod]\n
skip-ci
skip-ci skips all jobs. It is mutually exclusive with the other ci-* options and is synonomous with ci-skip: all.
ci-*
ci-skip: all
skip-ci: free form reason\n
ci-skip
ci-extra
ci-exactly
The ci-* options instruct the setup script on which jobs to include or exclude from its run. They take a comma-separated list of jobs which must be from the set of top-level job identifiers in the ci.yml file or the special keyword \"all\" to indicate all jobs.
ci.yml
ci-skip: jobs,to,skip\nci-extra: extra,jobs,to,run\nci-exactly: exact,set,of,jobs,to,run\n
In all these cases, the setup does not make any effort to ensure that job dependencies are satisfied. Thus, if you request skipping the build_all job, all the jobs that depend on it will fail, not be skipped.
build_all
benchmark-extra
skip-llvm-integrate-benchmark
benchmark-extra: extra,benchmarks,to,run\nskip-llvm-integrate-benchmark: free form reason\n
Benchmarks don't run by default on PRs, and must be specifically requested.
The benchmark-extra option allows specifying additional benchmark presets to run as part of benchmarking. It accepts a comma-separated list of benchmark presets. This combines with labels added to the PR (which are a more limited set of options). See the benchmark suites documentation.
Benchmarks do run by default on PRs detected to be an integration of LLVM into IREE, but this behavior can be disabled with skip-llvm-integrate-benchmark.
runner-env
The runner-env option controls which runner environment to target for our self-hosted runners. We maintain a test environment to allow testing out new configurations prior to rolling them out. This trailer is for advanced users who are working on the CI infrastructure itself.
runner-env: [testing|prod]\n
Copy/paste any of these at the bottom of a PR description to change what the CI runs.
Also run Windows and macOS builds that are normally post-merge only:
ci-extra: build_test_all_windows,build_test_all_macos_arm64,build_test_all_macos_x86_64\n
Also run GPU tests on NVIDIA A100 runners (opt-in due to low availability):
ci-extra: test_nvidia_a100\n
Skip all CI builds and tests, e.g. for comment-only changes:
skip-ci: Comment-only change.\n
Only run Bazel builds, e.g. for changes only affecting Bazel rules:
ci-exactly: build_test_all_bazel\n
For example, this PR opted in to running the build_test_all_windows job:
build_test_all_windows
The enabled jobs can be viewed from the Summary page of an action run:
This guide provides an overview of IREE's project structure and main tools for developers.
Flow
HAL
Stream
VM
IREE's core compiler accepts programs in supported input MLIR dialects (e.g. stablehlo, tosa, linalg). Import tools and APIs may be used to convert from framework-specific formats like TensorFlow SavedModel to MLIR modules. While programs are ultimately compiled down to modules suitable for running on some combination of IREE's target deployment platforms, IREE's developer tools can run individual compiler passes, translations, and other transformations step by step.
stablehlo
tosa
linalg
iree-opt is a tool for testing IREE's compiler passes. It is similar to mlir-opt and runs sets of IREE's compiler passes on .mlir input files. See \"conversion\" in MLIR's Glossary for more information. Transformations performed by iree-opt can range from individual passes performing isolated manipulations to broad pipelines that encompass a sequence of steps.
.mlir
Test .mlir files that are checked in typically include a RUN block at the top of the file that specifies which passes should be performed and if FileCheck should be used to test the generated output.
RUN
FileCheck
Here's an example of a small compiler pass running on a test file:
$ ../iree-build/tools/iree-opt \\\n --split-input-file \\\n --mlir-print-ir-before-all \\\n --iree-util-drop-compiler-hints \\\n $PWD/compiler/src/iree/compiler/Dialect/Util/Transforms/test/drop_compiler_hints.mlir\n
For a more complex example, here's how to run IREE's complete transformation pipeline targeting the VMVX backend on the fullyconnected.mlir model file:
$ ../iree-build/tools/iree-opt \\\n --iree-transformation-pipeline \\\n --iree-hal-target-backends=vmvx \\\n $PWD/tests/e2e/stablehlo_models/fullyconnected.mlir\n
iree-compile is IREE's main compiler driver for generating binaries from supported input MLIR assembly.
For example, to translate simple.mlir to an IREE module:
simple.mlir
$ ../iree-build/tools/iree-compile \\\n --iree-hal-target-backends=vmvx \\\n $PWD/samples/models/simple_abs.mlir \\\n -o /tmp/simple_abs_vmvx.vmfb\n
The iree-run-module program takes an already translated IREE module as input and executes an exported function using the provided inputs.
This program can be used in sequence with iree-compile to translate a .mlir file to an IREE module and then execute it. Here is an example command that executes the simple simple_abs_vmvx.vmfb compiled from simple_abs.mlir above on IREE's local-task CPU device:
simple_abs.mlir
$ ../iree-build/tools/iree-run-module \\\n --module=/tmp/simple_abs_vmvx.vmfb \\\n --device=local-task \\\n --function=abs \\\n --input=f32=-2\n
Input scalars are passed as value and input buffers are passed as [shape]xtype=[value].
value
[shape]xtype=[value]
--input=1234
tensor<i32>
--input=i32=1234
tensor<1xi32>
--input=1xi32=1234
tensor<2xi32>
--input=\"2xi32=12 34\"
tensor<2x3xi32>
--input=\"2x3xi32=[1 2 3][4 5 6]\"
See these test files for advanced usage examples:
Source file: tools/test/iree-run-module.mlir
tools/test/iree-run-module.mlir
// RUN: (iree-compile --iree-hal-target-backends=vmvx %s | iree-run-module --device=local-task --module=- --function=abs --input=\"2xf32=-2 3\") | FileCheck %s\n// RUN: (iree-compile --iree-hal-target-backends=llvm-cpu %s | iree-run-module --device=local-task --module=- --function=abs --input=\"2xf32=-2 3\") | FileCheck %s\n\n// CHECK-LABEL: EXEC @abs\nfunc.func @abs(%input : tensor<2xf32>) -> (tensor<2xf32>) {\n %result = math.absf %input : tensor<2xf32>\n return %result : tensor<2xf32>\n}\n // INPUT-BUFFERS: result[1]: hal.buffer_view\n // INPUT-BUFFERS-NEXT: 2xf32=-2.0 3.0\n
Source file: tools/test/iree-run-module-inputs.mlir
tools/test/iree-run-module-inputs.mlir
// Passing no inputs is okay.\n\n// RUN: (iree-compile --iree-hal-target-backends=vmvx %s | \\\n// RUN: iree-run-module --device=local-sync --module=- --function=no_input) | \\\n// RUN: FileCheck --check-prefix=NO-INPUT %s\n// NO-INPUT-LABEL: EXEC @no_input\nfunc.func @no_input() {\n return\n}\n\n// -----\n\n// Scalars use the form `--input=value`. Type (float/int) should be omitted.\n// * The VM does not use i1/i8 types, so i32 VM types are returned instead.\n\n// RUN: (iree-compile --iree-hal-target-backends=vmvx %s | \\\n// RUN: iree-run-module --device=local-sync \\\n// RUN: --module=- \\\n// RUN: --function=scalars \\\n// RUN: --input=1 \\\n// RUN: --input=5 \\\n// RUN: --input=1234 \\\n// RUN: --input=-3.14) | \\\n// RUN: FileCheck --check-prefix=INPUT-SCALARS %s\n// INPUT-SCALARS-LABEL: EXEC @scalars\nfunc.func @scalars(%arg0: i1, %arg1: i8, %arg2 : i32, %arg3 : f32) -> (i1, i8, i32, f32) {\n // INPUT-SCALARS: result[0]: i32=1\n // INPUT-SCALARS: result[1]: i32=5\n // INPUT-SCALARS: result[2]: i32=1234\n // INPUT-SCALARS: result[3]: f32=-3.14\n return %arg0, %arg1, %arg2, %arg3 : i1, i8, i32, f32\n}\n\n// -----\n\n// Buffers (\"tensors\") use the form `--input=[shape]xtype=[value]`.\n// * If any values are omitted, zeroes will be used.\n// * Quotes should be used around values with spaces.\n// * Brackets may also be used to separate element values.\n\n// RUN: (iree-compile --iree-hal-target-backends=vmvx %s | \\\n// RUN: iree-run-module --device=local-sync \\\n// RUN: --module=- \\\n// RUN: --function=buffers \\\n// RUN: --input=i32=5 \\\n// RUN: --input=2xi32 \\\n// RUN: --input=\"2x3xi32=1 2 3 4 5 6\") | \\\n// RUN: FileCheck --check-prefix=INPUT-BUFFERS %s\n// INPUT-BUFFERS-LABEL: EXEC @buffers\nfunc.func @buffers(%arg0: tensor<i32>, %arg1: tensor<2xi32>, %arg2: tensor<2x3xi32>) -> (tensor<i32>, tensor<2xi32>, tensor<2x3xi32>) {\n // INPUT-BUFFERS: result[0]: hal.buffer_view\n // INPUT-BUFFERS-NEXT: i32=5\n // INPUT-BUFFERS: result[1]: hal.buffer_view\n // INPUT-BUFFERS-NEXT: 2xi32=0 0\n // INPUT-BUFFERS: result[2]: hal.buffer_view\n // INPUT-BUFFERS-NEXT: 2x3xi32=[1 2 3][4 5 6]\n return %arg0, %arg1, %arg2 : tensor<i32>, tensor<2xi32>, tensor<2x3xi32>\n}\n\n// -----\n\n// Buffer values can be read from binary files with `@some/file.bin`.\n// * numpy npy files from numpy.save or previous tooling output can be read to\n// provide 1+ values.\n// * Some data types may be converted (i32 -> si32 here) - bug?\n\n// RUN: (iree-compile --iree-hal-target-backends=vmvx %s -o=%t.vmfb && \\\n// RUN: iree-run-module --device=local-sync \\\n// RUN: --module=%t.vmfb \\\n// RUN: --function=npy_round_trip \\\n// RUN: --input=2xi32=11,12 \\\n// RUN: --input=3xi32=1,2,3 \\\n// RUN: --output=@%t.npy \\\n// RUN: --output=+%t.npy && \\\n// RUN: iree-run-module --device=local-sync \\\n// RUN: --module=%t.vmfb \\\n// RUN: --function=npy_round_trip \\\n// RUN: --input=*%t.npy) | \\\n// RUN: FileCheck --check-prefix=INPUT-NUMPY %s\n\n// INPUT-NUMPY-LABEL: EXEC @npy_round_trip\nfunc.func @npy_round_trip(%arg0: tensor<2xi32>, %arg1: tensor<3xi32>) -> (tensor<2xi32>, tensor<3xi32>) {\n // INPUT-NUMPY: result[0]: hal.buffer_view\n // INPUT-NUMPY-NEXT: 2xsi32=11 12\n // INPUT-NUMPY: result[1]: hal.buffer_view\n // INPUT-NUMPY-NEXT: 3xsi32=1 2 3\n return %arg0, %arg1 : tensor<2xi32>, tensor<3xi32>\n}\n
Source file: tools/test/iree-run-module-outputs.mlir
tools/test/iree-run-module-outputs.mlir
// Tests that execution providing no outputs is ok.\n\n// RUN: (iree-compile --iree-hal-target-backends=vmvx %s | \\\n// RUN: iree-run-module --device=local-sync --module=- --function=no_output) | \\\n// RUN: FileCheck --check-prefix=NO-OUTPUT %s\n// NO-OUTPUT-LABEL: EXEC @no_output\nfunc.func @no_output() {\n return\n}\n\n// -----\n\n// Tests the default output printing to stdout.\n\n// RUN: (iree-compile --iree-hal-target-backends=vmvx %s | \\\n// RUN: iree-run-module --device=local-sync --module=- --function=default) | \\\n// RUN: FileCheck --check-prefix=OUTPUT-DEFAULT %s\n// OUTPUT-DEFAULT-LABEL: EXEC @default\nfunc.func @default() -> (i32, tensor<f32>, tensor<?x4xi32>) {\n // OUTPUT-DEFAULT: result[0]: i32=123\n %0 = arith.constant 123 : i32\n // OUTPUT-DEFAULT: result[1]: hal.buffer_view\n // OUTPUT-DEFAULT-NEXT: f32=4\n %1 = arith.constant dense<4.0> : tensor<f32>\n // OUTPUT-DEFAULT: result[2]: hal.buffer_view\n // OUTPUT-DEFAULT-NEXT: 2x4xi32=[0 1 2 3][4 5 6 7]\n %2 = flow.tensor.dynamic_constant dense<[[0,1,2,3],[4,5,6,7]]> : tensor<2x4xi32> -> tensor<?x4xi32>\n return %0, %1, %2 : i32, tensor<f32>, tensor<?x4xi32>\n}\n\n// -----\n\n// Tests explicit output to npy files by producing a concatenated .npy and then\n// printing the results in python. This also verifies our npy files can be\n// parsed by numpy.\n\n// RUN: (iree-compile --iree-hal-target-backends=vmvx %s | \\\n// RUN: iree-run-module --device=local-sync --module=- --function=numpy \\\n// RUN: --output= \\\n// RUN: --output=@%t.npy \\\n// RUN: --output=+%t.npy) && \\\n// RUN: \"%PYTHON\" %S/echo_npy.py %t.npy | \\\n// RUN: FileCheck --check-prefix=OUTPUT-NUMPY %s\nfunc.func @numpy() -> (i32, tensor<f32>, tensor<?x4xi32>) {\n // Output skipped:\n %0 = arith.constant 123 : i32\n // OUTPUT-NUMPY{LITERAL}: 4.0\n %1 = arith.constant dense<4.0> : tensor<f32>\n // OUTPUT-NUMPY-NEXT{LITERAL}: [[0 1 2 3]\n // OUTPUT-NUMPY-NEXT{LITERAL}: [4 5 6 7]]\n %2 = flow.tensor.dynamic_constant dense<[[0,1,2,3],[4,5,6,7]]> : tensor<2x4xi32> -> tensor<?x4xi32>\n return %0, %1, %2 : i32, tensor<f32>, tensor<?x4xi32>\n}\n\n// -----\n\n// Tests output to binary files by round-tripping the output of a function into\n// another invocation reading from the binary files. Each output is written to\n// its own file (optimal for alignment/easier to inspect).\n\n// RUN: (iree-compile --iree-hal-target-backends=vmvx %s -o=%t.vmfb && \\\n// RUN: iree-run-module --device=local-sync \\\n// RUN: --module=%t.vmfb \\\n// RUN: --function=write_binary \\\n// RUN: --output=@%t.0.bin \\\n// RUN: --output=@%t.1.bin && \\\n// RUN: iree-run-module --device=local-sync \\\n// RUN: --module=%t.vmfb \\\n// RUN: --function=echo_binary \\\n// RUN: --input=f32=@%t.0.bin \\\n// RUN: --input=2x4xi32=@%t.1.bin) | \\\n// RUN: FileCheck --check-prefix=OUTPUT-BINARY %s\n\n// Tests output to binary files by round-tripping the output of a function into\n// another invocation reading from the binary files. The values are appended to\n// a single file and read from the single file.\n\n// RUN: (iree-compile --iree-hal-target-backends=vmvx %s -o=%t.vmfb && \\\n// RUN: iree-run-module --device=local-sync \\\n// RUN: --module=%t.vmfb \\\n// RUN: --function=write_binary \\\n// RUN: --output=@%t.bin \\\n// RUN: --output=+%t.bin && \\\n// RUN: iree-run-module --device=local-sync \\\n// RUN: --module=%t.vmfb \\\n// RUN: --function=echo_binary \\\n// RUN: --input=f32=@%t.bin \\\n// RUN: --input=2x4xi32=+%t.bin) | \\\n// RUN: FileCheck --check-prefix=OUTPUT-BINARY %s\n\nfunc.func @write_binary() -> (tensor<f32>, tensor<?x4xi32>) {\n %0 = arith.constant dense<4.0> : tensor<f32>\n %1 = flow.tensor.dynamic_constant dense<[[0,1,2,3],[4,5,6,7]]> : tensor<2x4xi32> -> tensor<?x4xi32>\n return %0, %1 : tensor<f32>, tensor<?x4xi32>\n}\nfunc.func @echo_binary(%arg0: tensor<f32>, %arg1: tensor<?x4xi32>) -> (tensor<f32>, tensor<?x4xi32>) {\n // OUTPUT-BINARY{LITERAL}: f32=4\n // OUTPUT-BINARY{LITERAL}: 2x4xi32=[0 1 2 3][4 5 6 7]\n return %arg0, %arg1 : tensor<f32>, tensor<?x4xi32>\n}\n
Source file: tools/test/iree-run-module-expected.mlir
tools/test/iree-run-module-expected.mlir
// RUN: (iree-compile --iree-hal-target-backends=vmvx %s | iree-run-module --device=local-task --module=- --function=abs --input=f32=-2 --expected_output=f32=-2 --expected_output=f32=2.0) | FileCheck %s --check-prefix=SUCCESS-MATCHES\n// RUN: (iree-compile --iree-hal-target-backends=vmvx %s | iree-run-module --device=local-task --module=- --function=abs --input=f32=-2 --expected_output=f32=-2 --expected_output=\"(ignored)\") | FileCheck %s --check-prefix=SUCCESS-IGNORED\n// RUN: (iree-compile --iree-hal-target-backends=vmvx %s | iree-run-module --device=local-task --module=- --function=abs --input=f32=-2 --expected_output=f32=-2 --expected_output=f32=2.1 --expected_f32_threshold=0.1) | FileCheck %s --check-prefix=SUCCESS-THRESHOLD\n// RUN: (iree-compile --iree-hal-target-backends=vmvx %s | not iree-run-module --device=local-task --module=- --function=abs --input=f32=-2 --expected_output=f32=123 --expected_output=f32=2.0) | FileCheck %s --check-prefix=FAILED-FIRST\n// RUN: (iree-compile --iree-hal-target-backends=vmvx %s | not iree-run-module --device=local-task --module=- --function=abs --input=f32=-2 --expected_output=f32=-2 --expected_output=f32=4.5) | FileCheck %s --check-prefix=FAILED-SECOND\n// RUN: (iree-compile --iree-hal-target-backends=vmvx %s | not iree-run-module --device=local-task --module=- --function=abs --input=f32=-2 --expected_output=f32=-2 --expected_output=4xf32=2.0) | FileCheck %s --check-prefix=FAILED-SHAPE\n\n// SUCCESS-MATCHES: [SUCCESS]\n// SUCCESS-THRESHOLD: [SUCCESS]\n// SUCCESS-IGNORED: [SUCCESS]\n// FAILED-FIRST: [FAILED] result[0]: element at index 0 (-2) does not match the expected (123)\n// FAILED-SECOND: [FAILED] result[1]: element at index 0 (2) does not match the expected (4.5)\n// FAILED-SHAPE: [FAILED] result[1]: metadata is f32; expected that the view matches 4xf32\n\nfunc.func @abs(%input: tensor<f32>) -> (tensor<f32>, tensor<f32>) {\n %result = math.absf %input : tensor<f32>\n return %input, %result : tensor<f32>, tensor<f32>\n}\n
The iree-check-module program takes an already translated IREE module as input and executes it as a series of googletest tests. This is the test runner for the IREE check framework.
iree-check-module
$ ../iree-build/tools/iree-compile \\\n --iree-input-type=stablehlo \\\n --iree-hal-target-backends=vmvx \\\n $PWD/tests/e2e/stablehlo_ops/abs.mlir \\\n -o /tmp/abs.vmfb\n
$ ../iree-build/tools/iree-check-module \\\n --device=local-task \\\n --module=/tmp/abs.vmfb\n
The iree-run-mlir program takes a .mlir file as input, translates it to an IREE bytecode module, and executes the module.
iree-run-mlir
It is designed for testing and debugging, not production uses, and therefore does some additional work that usually must be explicit, like marking every function as exported by default and running all of them.
For example, to execute the contents of samples/models/simple_abs.mlir:
# iree-run-mlir <compiler flags> [input.mlir] <runtime flags>\n$ ../iree-build/tools/iree-run-mlir \\\n --iree-hal-target-backends=vmvx \\\n $PWD/samples/models/simple_abs.mlir \\\n --input=f32=-2\n
The iree-dump-module program prints the contents of an IREE module FlatBuffer file.
iree-dump-module
For example, to inspect the module translated above:
../iree-build/tools/iree-dump-module /tmp/simple_abs_vmvx.vmfb\n
All the IREE tools support reading input values from a file. This is quite useful for debugging. Use --help for each tool to see what the flag to set. The inputs are expected to be newline-separated. Each input should be either a scalar or a buffer. Scalars should be in the format type=value and buffers should be in the format [shape]xtype=[value]. For example:
--help
type=value
1x5xf32=1,-2,-3,4,-5\n1x5x3x1xf32=15,14,13,12,11,10,9,8,7,6,5,4,3,2,1\n
This flag will enable tracing inputs and outputs for each dispatch function. It is easier to narrow down test cases, since IREE breaks a ML workload into multiple dispatch function. When the flag is on, IREE will insert trace points before and after each dispatch function. The first trace op is for inputs, and the second trace op is for outputs. There will be two events for one dispatch function.
The IREE compiler is built using MLIR, so it naturally supports the common MLIR debugging workflows. For areas where IREE differentiates itself, this page lists other helpful tips and tricks.
Tools such as iree-compile take options via command-line flags. Pass --help to see the full list:
$ iree-compile --help\n\nOVERVIEW: IREE compilation driver\n\nUSAGE: iree-compile [options] <input file or '-' for stdin>\n\nOPTIONS:\n ...\n
Tip - Options and the Python bindings
If you are using the Python bindings, options can be passed via the extra_args=[\"--flag\"] argument:
extra_args=[\"--flag\"]
import iree.compiler as ireec\n\ninput_mlir = \"\"\"\nfunc.func @abs(%input : tensor<f32>) -> (tensor<f32>) {\n %result = math.absf %input : tensor<f32>\n return %result : tensor<f32>\n}\"\"\"\n\ncompiled_module = ireec.tools.compile_str(\n input_mlir,\n target_backends=[\"llvm-cpu\"],\n extra_args=[\"--mlir-timing\"])\n
The IREE compiler generates FlatBuffer files using the .vmfb file extension, short for \"Virtual Machine FlatBuffer\", which can then be loaded and executed using IREE's runtime.
The IREE compiler can output different formats with the `--output-format= flag:
`--output-format=
--output-format=vm-bytecode
--output-format=vm-c
VM Bytecode files are usable across a range of deployment scenarios, while C source modules provide low level connection points for constrained environments like bare metal platforms.
By default, .vmfb files can be opened as zip files: (1)
--iree-vm-emit-polyglot-zip=false
$ unzip -d simple_abs_cpu ./simple_abs_cpu.vmfb\n\nArchive: ./simple_abs_cpu.vmfb\n extracting: simple_abs_cpu/module.fb\n extracting: simple_abs_cpu/abs_dispatch_0_system_elf_x86_64.so\n
The embedded binary (here an ELF shared object with CPU code) can be parsed by standard tools:
$ readelf -Ws ./simple_abs_cpu/abs_dispatch_0_system_elf_x86_64.so\n\nSymbol table '.dynsym' contains 2 entries:\n Num: Value Size Type Bind Vis Ndx Name\n 0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND\n 1: 0000000000001760 17 FUNC GLOBAL DEFAULT 7 iree_hal_executable_library_query\n\nSymbol table '.symtab' contains 42 entries:\n Num: Value Size Type Bind Vis Ndx Name\n 0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND\n 1: 0000000000000000 0 FILE LOCAL DEFAULT ABS abs_dispatch_0\n 2: 0000000000001730 34 FUNC LOCAL DEFAULT 7 abs_dispatch_0_generic\n 3: 00000000000034c0 80 OBJECT LOCAL DEFAULT 8 iree_hal_executable_library_query_v0\n 4: 0000000000001780 111 FUNC LOCAL DEFAULT 7 iree_h2f_ieee\n 5: 00000000000017f0 207 FUNC LOCAL DEFAULT 7 iree_f2h_ieee\n ...\n
The iree-dump-module tool can also be used to see information about a given .vmfb file:
$ iree-dump-module simple_abs.vmfb\n\n//===---------------------------------------------------------------------===//\n// @module : version 0\n//===---------------------------------------------------------------------===//\n\nRequired Types:\n [ 0] i32\n [ 1] i64\n [ 2] !hal.allocator\n [ 3] !hal.buffer\n ...\n\nModule Dependencies:\n hal, version >= 0, required\n\nImported Functions:\n [ 0] hal.allocator.allocate(!vm.ref<?>, i32, i32, i64) -> (!vm.ref<?>)\n [ 1] hal.devices.get(i32) -> (!vm.ref<?>)\n ...\n\nExported Functions:\n [ 0] abs(!vm.ref<?>) -> (!vm.ref<?>)\n [ 1] __init() -> ()\n\n...\n
The --iree-hal-dump-executable-* flags instruct the compiler to save files related to \"executable translation\" (code generation for a specific hardware target) into a directory of your choosing. If you are interested in seeing which operations in your input program were fused into a compute kernel or what device code was generated for a given program structure, these flags are a great starting point.
--iree-hal-dump-executable-*
iree-hal-dump-executable-files-to
iree-hal-dump-executable-sources-to
iree-hal-dump-executable-intermediates-to
.o
iree-hal-dump-executable-binaries-to
.so
.spv
.ptx
iree-hal-dump-executable-benchmarks-to
iree-benchmark-module
$ mkdir -p /tmp/iree/simple_abs/\n\n$ iree-compile simple_abs.mlir \\\n --iree-hal-target-backends=llvm-cpu \\\n --iree-llvmcpu-link-embedded=false \\\n --iree-hal-dump-executable-files-to=/tmp/iree/simple_abs \\\n -o /tmp/iree/simple_abs/simple_abs_cpu.vmfb\n\n$ ls /tmp/iree/simple_abs\n\nmodule_abs_dispatch_0.mlir\nmodule_abs_dispatch_0_system_elf_x86_64_benchmark.mlir\nmodule_abs_dispatch_0_system_elf_x86_64.codegen.bc\nmodule_abs_dispatch_0_system_elf_x86_64.linked.bc\nmodule_abs_dispatch_0_system_elf_x86_64.optimized.bc\nmodule_abs_dispatch_0_system_elf_x86_64.o\nmodule_abs_dispatch_0_system_elf_x86_64.s\nmodule_abs_dispatch_0_system_elf_x86_64.so\nsimple_abs_cpu.vmfb\n
Tip - Embedded and system linking
The default value of --iree-llvmcpu-link-embedded=true generates embedded ELF files. By disabling that flag, the compiler will produce platform-standard .so files for Linux, .dll files for Windows, etc. While embedded ELF files can be smaller and more portable, inspection of artifacts is easier with platform-standard shared object files.
--iree-llvmcpu-link-embedded=true
.dll
The .bc intermediate files use the LLVM BitCode format, which can be disassembled using llvm-dis:
// Build `llvm-dis` from source as needed:\n$ cmake --build iree-build/ --target llvm-dis\n$ iree-build/llvm-project/bin/llvm-dis --help\n\n$ cd /tmp/iree/simple_abs/\n$ llvm-dis module_abs_dispatch_0_system_elf_x86_64.codegen.bc\n$ cat module_abs_dispatch_0_system_elf_x86_64.codegen.ll\n\n; ModuleID = 'module_abs_dispatch_0_system_elf_x86_64.codegen.bc'\nsource_filename = \"abs_dispatch_0\"\ntarget triple = \"x86_64-linux-gnu\"\n\n%iree_hal_executable_library_header_t = type { i32, ptr, i32, i32 }\n%iree_hal_executable_dispatch_attrs_v0_t = type { i16, i16 }\n\n...\n\ndefine internal i32 @abs_dispatch_0_generic(\n ptr noalias nonnull align 16 %0,\n ptr noalias nonnull align 16 %1,\n ptr noalias nonnull align 16 %2) #0 {\n %4 = load %iree_hal_executable_dispatch_state_v0_t, ptr %1, align 8,\n %5 = extractvalue %iree_hal_executable_dispatch_state_v0_t %4, 10,\n %6 = load ptr, ptr %5, align 8,\n %7 = ptrtoint ptr %6 to i64,\n %8 = and i64 %7, 63,\n %9 = icmp eq i64 %8, 0,\n call void @llvm.assume(i1 %9),\n %10 = load %iree_hal_executable_dispatch_state_v0_t, ptr %1, align 8,\n %11 = extractvalue %iree_hal_executable_dispatch_state_v0_t %10, 10,\n %12 = getelementptr ptr, ptr %11, i32 1,\n %13 = load ptr, ptr %12, align 8,\n %14 = ptrtoint ptr %13 to i64,\n %15 = and i64 %14, 63,\n %16 = icmp eq i64 %15, 0,\n call void @llvm.assume(i1 %16),\n %17 = load float, ptr %6, align 4,\n %18 = call float @llvm.fabs.f32(float %17),\n store float %18, ptr %13, align 4,\n ret i32 0,\n}\n\n...\n
$ mkdir -p /tmp/iree/simple_abs/\n\n$ iree-compile simple_abs.mlir \\\n --iree-hal-target-backends=vulkan-spirv \\\n --iree-hal-dump-executable-files-to=/tmp/iree/simple_abs \\\n -o /tmp/iree/simple_abs/simple_abs_vulkan.vmfb\n\n$ ls /tmp/iree/simple_abs\n\nmodule_abs_dispatch_0.mlir\nmodule_abs_dispatch_0_vulkan_spirv_fb_benchmark.mlir\nmodule_abs_dispatch_0_vulkan_spirv_fb.mlir\nmodule_abs_dispatch_0_vulkan_spirv_fb.spv\nsimple_abs_vulkan.vmfb\n
spirv-dis
The .spv files use the SPIR-V binary format, which can be disassembled using spirv-dis from SPIR-V Tools:
$ cd /tmp/iree/simple_abs/\n$ spirv-dis module_abs_dispatch_0_vulkan_spirv_fb.spv\n\n; SPIR-V\n; Version: 1.0\n; Generator: Khronos; 22\n; Bound: 20\n; Schema: 0\n OpCapability Shader\n OpExtension \"SPV_KHR_storage_buffer_storage_class\"\n %18 = OpExtInstImport \"GLSL.std.450\"\n OpMemoryModel Logical GLSL450\n OpEntryPoint GLCompute %abs_dispatch_0_generic \"abs_dispatch_0_generic\"\n OpExecutionMode %abs_dispatch_0_generic LocalSize 1 1 1\n OpName %__resource_var_0_0_ \"__resource_var_0_0_\"\n OpName %__resource_var_0_1_ \"__resource_var_0_1_\"\n OpName %abs_dispatch_0_generic \"abs_dispatch_0_generic\"\n OpDecorate %_arr_float_uint_1 ArrayStride 4\n OpMemberDecorate %_struct_2 0 Offset 0\n OpDecorate %_struct_2 Block\n OpDecorate %__resource_var_0_0_ Binding 0\n OpDecorate %__resource_var_0_0_ DescriptorSet 0\n OpDecorate %__resource_var_0_1_ Binding 1\n OpDecorate %__resource_var_0_1_ DescriptorSet 0\n %float = OpTypeFloat 32\n %uint = OpTypeInt 32 0\n %uint_1 = OpConstant %uint 1\n%_arr_float_uint_1 = OpTypeArray %float %uint_1\n %_struct_2 = OpTypeStruct %_arr_float_uint_1\n%_ptr_StorageBuffer__struct_2 = OpTypePointer StorageBuffer %_struct_2\n%__resource_var_0_0_ = OpVariable %_ptr_StorageBuffer__struct_2 StorageBuffer\n%__resource_var_0_1_ = OpVariable %_ptr_StorageBuffer__struct_2 StorageBuffer\n %void = OpTypeVoid\n %9 = OpTypeFunction %void\n %uint_0 = OpConstant %uint 0\n%_ptr_StorageBuffer_float = OpTypePointer StorageBuffer %float\n%abs_dispatch_0_generic = OpFunction %void None %9\n %12 = OpLabel\n %15 = OpAccessChain %_ptr_StorageBuffer_float %__resource_var_0_0_ %uint_0 %uint_0\n %16 = OpLoad %float %15\n %17 = OpExtInst %float %18 FAbs %16\n %19 = OpAccessChain %_ptr_StorageBuffer_float %__resource_var_0_1_ %uint_0 %uint_0\n OpStore %19 %17\n OpReturn\n OpFunctionEnd\n
$ mkdir -p /tmp/iree/simple_abs/\n\n$ iree-compile simple_abs.mlir \\\n --iree-hal-target-backends=cuda \\\n --iree-hal-dump-executable-files-to=/tmp/iree/simple_abs \\\n -o /tmp/iree/simple_abs/simple_abs_cuda.vmfb\n\n$ ls /tmp/iree/simple_abs\n\nmodule_abs_dispatch_0_cuda_nvptx_fb_benchmark.mlir\nmodule_abs_dispatch_0_cuda_nvptx_fb.codegen.bc\nmodule_abs_dispatch_0_cuda_nvptx_fb.linked.bc\nmodule_abs_dispatch_0_cuda_nvptx_fb.optimized.bc\nmodule_abs_dispatch_0_cuda_nvptx_fb.ptx\nmodule_abs_dispatch_0.mlir\nsimple_abs_cuda.vmfb\n
// Build `llvm-dis` from source as needed:\n$ cmake --build iree-build/ --target llvm-dis\n$ iree-build/llvm-project/bin/llvm-dis --help\n\n$ cd /tmp/iree/simple_abs/\n$ llvm-dis module_abs_dispatch_0_cuda_nvptx_fb.codegen.bc\n$ cat module_abs_dispatch_0_cuda_nvptx_fb.codegen.ll\n\n; ModuleID = 'module_abs_dispatch_0_cuda_nvptx_fb.codegen.bc'\nsource_filename = \"abs_dispatch_0\"\n\ndeclare ptr @malloc(i64)\n\ndeclare void @free(ptr)\n\ndeclare float @__nv_fabsf(float)\n\ndefine void @abs_dispatch_0_generic(ptr noalias readonly align 16 %0, ptr noalias align 16 %1) {\n %3 = ptrtoint ptr %0 to i64\n %4 = and i64 %3, 63\n %5 = icmp eq i64 %4, 0\n call void @llvm.assume(i1 %5)\n %6 = ptrtoint ptr %1 to i64\n %7 = and i64 %6, 63\n %8 = icmp eq i64 %7, 0\n call void @llvm.assume(i1 %8)\n %9 = load float, ptr %0, align 4\n %10 = call float @__nv_fabsf(float %9)\n store float %10, ptr %1, align 4\n ret void\n}\n\n!nvvm.annotations = !{!0, !1, !2, !3}\n\n!0 = !{ptr @abs_dispatch_0_generic, !\"kernel\", i32 1}\n!1 = !{ptr @abs_dispatch_0_generic, !\"maxntidx\", i32 1}\n!2 = !{ptr @abs_dispatch_0_generic, !\"maxntidy\", i32 1}\n!3 = !{ptr @abs_dispatch_0_generic, !\"maxntidz\", i32 1}\n
The benchmark files produced by --iree-hal-dump-executable-benchmarks-to can be compiled in isolation and passed to iree-benchmark-module, where they exercise the full IREE runtime for a single executable:
--iree-hal-dump-executable-benchmarks-to
$ iree-compile simple_abs.mlir \\\n --iree-hal-target-backends=llvm-cpu \\\n --iree-hal-dump-executable-benchmarks-to=/tmp/iree/simple_abs/ \\\n -o /dev/null\n\n$ iree-compile \\\n /tmp/iree/simple_abs/module_abs_dispatch_0_embedded_elf_x86_64_benchmark.mlir \\\n -o /tmp/iree/simple_abs/module_abs_dispatch_0_benchmark.vmfb\n\n$ iree-benchmark-module \\\n /tmp/iree/simple_abs/module_abs_dispatch_0_benchmark.vmfb\n
The binary files produced by --iree-hal-dump-executable-binaries-to can be passed to iree-benchmark-executable where they are benchmarked directly, without using the IREE VM, HAL APIs, task system, etc. Note that this interface is much lower level and you must specify all push constants / binding parameters manually:
--iree-hal-dump-executable-binaries-to
iree-benchmark-executable
$ iree-compile \\\n --iree-hal-target-backends=llvm-cpu \\\n --iree-hal-dump-executable-binaries-to=/tmp/iree/simple_abs/ \\\n -o /dev/null\n\n$ iree-benchmark-executable \\\n --device=local-sync \\\n --executable_format=embedded-elf-x86_64 \\\n --executable_file=/tmp/iree/simple_abs/module_abs_dispatch_0_embedded_elf_x86_64.so \\\n --entry_point=0 \\\n --binding=f32=-2.5 \\\n --binding=f32=0 \\\n --workgroup_count=1,1,1\n
See the comments in tools/iree-benchmark-executable-main.c and the test file at tools/test/iree-benchmark-executable.mlir for more information and examples.
tools/iree-benchmark-executable-main.c
tools/test/iree-benchmark-executable.mlir
IREE compiles programs through a series of broad phases:
graph LR\n accTitle: Compilation phases overview\n accDescr: Input to ABI to Flow to Stream to HAL to VM\n\n A([Input])\n A --> B([ABI])\n B --> C([Flow])\n C --> D([Stream])\n D --> E([HAL])\n E --> F([VM])
These are the phase names available for use with the --compile-to and --compile-from flags described below:
--compile-to
--compile-from
input
abi
preprocessing
stream
executable-sources
executable-targets
end
For an accurate list of phases, see the source code or check the help output with a command such as:
iree-compile --help | sed -n '/--compile-to/,/--/p' | head -n -1\n
You can output a program snapshot at intermediate phases with the --compile-to=<phase name> flag:
--compile-to=<phase name>
$ cat simple_abs.mlir\n\nfunc.func @abs(%input : tensor<f32>) -> (tensor<f32>) {\n %result = math.absf %input : tensor<f32>\n return %result : tensor<f32>\n}\n\n$ iree-compile simple_abs.mlir --compile-to=abi\n\nmodule {\n func.func @abs(%arg0: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {\n %0 = hal.tensor.import %arg0 \"input 0\" : !hal.buffer_view -> tensor<f32>\n %1 = math.absf %0 : tensor<f32>\n %2 = hal.tensor.export %1 \"output 0\" : tensor<f32> -> !hal.buffer_view\n return %2 : !hal.buffer_view\n }\n}\n
This is similar to the --mlir-print-ir-after= flag, but at clearly defined pipeline phases.
--mlir-print-ir-after=
Compilation can be continued from any intermediate phase. This allows for interative workflows - compile to a phase, make edits to the .mlir file, then resume compilation and continue through the pipeline:
$ iree-compile simple_abs.mlir --compile-to=abi -o simple_abs_abi.mlir\n\n$ sed \\\n -e 's/math.absf/math.exp/' \\\n -e 's/@abs/@exp/' \\\n simple_abs_abi.mlir > simple_exp_abi.mlir\n\n$ iree-compile simple_exp_abi.mlir \\\n --iree-hal-target-backends=llvm-cpu \\\n -o simple_exp_cpu.vmfb\n
or explicitly resume from an intermediate phase with --compile-from=<phase name>:
--compile-from=<phase name>
$ iree-compile simple_exp_abi.mlir \\\n --iree-hal-target-backends=llvm-cpu \\\n --compile-from=abi \\\n -o simple_exp_cpu.vmfb\n
The --dump-compilation-phases-to flag can be used to dump program IR after each phase, much like --compile-to but without exiting early:
--dump-compilation-phases-to
$ iree-compile simple_abs.mlir \\\n --iree-hal-target-backends=llvm-cpu \\\n --dump-compilation-phases-to=/tmp/iree/simple_abs \\\n -o /tmp/iree/simple_abs/simple_abs_cpu.vmfb\n\n$ ls /tmp/iree/simple_abs -1v\n\nsimple_abs.1.input.mlir\nsimple_abs.2.abi.mlir\nsimple_abs.3.preprocessing.mlir\nsimple_abs.4.global-optimization.mlir\nsimple_abs.5.flow.mlir\nsimple_abs.6.stream.mlir\nsimple_abs.7.executable-sources.mlir\nsimple_abs.8.executable-configurations.mlir\nsimple_abs.9.executable-targets.mlir\nsimple_abs.10.hal.mlir\nsimple_abs.11.vm.mlir\n
As with --compile-to, these files can be used together with --compile-from:
$ iree-compile simple_abs.2.abi.mlir \\\n --iree-hal-target-backends=llvm-cpu \\\n --compile-from=abi \\\n -o simple_exp_cpu.vmfb\n
All together, these passes can be used to, for example:
IREE cuts automated releases via a workflow that is triggered daily. The only constraint placed on the commit that is released is that it has passed all CI checks. These are published on GitHub with the \"pre-release\" status. For debugging this process, see the Release debugging playbook.
We periodically promote one of these candidates to a \"stable\" release by removing the \"pre-release\" status. This makes it show up as a \"latest\" release on GitHub. We also push the Python packages for this release to PyPI.
When selecting a candidate we use the following criteria:
When you've identified a potential candidate, email the iree-discuss list with the proposal and solicit feedback. People may point out known regressions or request that some feature make the cut.
(Authorized users only) Push to PyPI using pypi_deploy.sh
Open the release on GitHub. Rename the release from \"candidate\" to \"stable\", uncheck the option for \"pre-release\", and check the option for \"latest\".
Like the IREE project in general, IREE tests are divided into a few different components and use different tooling depending on the needs of that component.
There are also more *_test_suite targets that groups test targets with the same configuration together.
*_test_suite
Tests for the IREE compilation pipeline are written as lit tests in the same style as MLIR.
By convention, IREE includes tests for
.../IR/test/{OP_CATEGORY}_ops.mlir
.../IR/test/{OP_CATEGORY}_folding.mlir
.../test/*.mlir
For the test iree/compiler/Dialect/VM/Conversion/MathToVM/test/arithmetic_ops.mlir
iree/compiler/Dialect/VM/Conversion/MathToVM/test/arithmetic_ops.mlir
With CMake, run this from the build directory:
ctest -R iree/compiler/Dialect/VM/Conversion/MathToVM/test/arithmetic_ops.mlir.test\n
With Bazel, run this from the repo root:
bazel test //compiler/src/iree/compiler/Dialect/VM/Conversion/MathToVM/test:arithmetic_ops.mlir.test\n
For advice on writing MLIR compiler tests, see the MLIR testing guide. Tests should be .mlir files in test directory adjacent to the functionality they are testing. Instead of mlir-opt, use iree-opt, which registers IREE dialects and passes and doesn't register some unnecessary core ones.
test
mlir-opt
As with most parts of the IREE compiler, these should not have a dependency on the runtime.
In the Bazel BUILD file, create a iree_lit_test_suite rule. We usually create a single suite that globs all .mlir files in the directory and is called \"lit\".
iree_lit_test_suite
load(\"//iree/build_tools/bazel:iree_lit_test.bzl\", \"iree_lit_test_suite\")\n\niree_lit_test_suite(\n name = \"lit\",\n srcs = glob([\"*.mlir\"]),\n tools = [\n \"@llvm-project//llvm:FileCheck\",\n \"//tools:iree-opt\",\n ],\n)\n
There is a corresponding CMake function, calls to which will be generated by our Bazel to CMake converter.
iree_lit_test_suite(\n NAME\n lit\n SRCS\n \"arithmetic_ops.mlir\"\n DATA\n FileCheck\n iree-opt\n)\n
You can also create a test for a single file with iree_lit_test.
iree_lit_test
Tests for the runtime C++ code use the GoogleTest testing framework. They should generally follow the style and best practices of that framework.
For the test /runtime/src/iree/base/bitfield_test.cc:
/runtime/src/iree/base/bitfield_test.cc
ctest -R iree/base/bitfield_test\n
bazel test //runtime/src/iree/base:arena_test\n
Parallel testing for ctest can be enabled via the CTEST_PARALLEL_LEVEL environment variable. For example:
ctest
CTEST_PARALLEL_LEVEL
export CTEST_PARALLEL_LEVEL=$(nproc)\n
To use the Vulkan backend as test driver, you may need to select between a Vulkan implementation from SwiftShader and multiple Vulkan-capable hardware devices. This can be done via environment variables. See the generic Vulkan setup page for details regarding these variables.
For Bazel, you can persist the configuration in user.bazelrc to save typing. For example:
test:vkswiftshader --test_env=\"LD_LIBRARY_PATH=...\"\ntest:vkswiftshader --test_env=\"VK_LAYER_PATH=...\"\ntest:vknative --test_env=\"LD_LIBRARY_PATH=...\"\ntest:vknative --test_env=\"VK_LAYER_PATH=...\"\n
Then you can use bazel test --config=vkswiftshader to select SwiftShader as the Vulkan implementation. Similarly for other implementations.
bazel test --config=vkswiftshader
For advice on writing tests in the GoogleTest framework, see the GoogleTest primer. Test files for source file foo.cc with build target foo should live in the same directory with source file foo_test.cc and build target foo_test. You should #include iree/testing/gtest.h instead of any of the gtest or gmock headers.
foo.cc
foo
foo_test.cc
foo_test
iree/testing/gtest.h
As with all parts of the IREE runtime, these should not have a dependency on the compiler.
In the Bazel BUILD file, create a cc_test target with your test file as the source and any necessary dependencies. Usually, you can link in a standard gtest main function. Use iree/testing:gtest_main instead of the gtest_main that comes with gtest.
cc_test
iree/testing:gtest_main
gtest_main
cc_test(\n name = \"arena_test\",\n srcs = [\"arena_test.cc\"],\n deps = [\n \":arena\",\n \"//iree/testing:gtest_main\",\n ],\n)\n
We have created a corresponding CMake function iree_cc_test that mirrors the Bazel rule's behavior. Our Bazel to CMake converter should generally derive the CMakeLists.txt file from the BUILD file:
iree_cc_test
iree_cc_test(\n NAME\n arena_test\n SRCS\n \"arena_test.cc\"\n DEPS\n ::arena\n iree::testing::gtest_main\n)\n
There are other more specific test targets, such as iree_hal_cts_test_suite, which are designed to test specific runtime support with template configuration and is not supported by Bazel rules.
iree_hal_cts_test_suite
Here \"end-to-end\" means from the input accepted by the IREE core compiler (dialects like TOSA, StableHLO, Linalg) to execution using the IREE runtime components. It does not include tests of the integrations with ML frameworks (e.g. TensorFlow, PyTorch) or bindings to other languages (e.g. Python).
We avoid using the more traditional lit tests used elsewhere in the compiler for runtime execution tests. Lit tests require running the compiler tools on the test platform through shell or python scripts that act on files from a local file system. On platforms like Android, the web, and embedded systems, each of these features is either not available or is severely limited.
lit
Instead, to test these flows we use a custom framework called check. The check framework compiles test programs on the host machine into standalone test binary files that can be pushed to test devices (such as Android phones) where they run with gtest style assertions (e.g. check.expect_almost_eq(lhs, rhs)).
check
check.expect_almost_eq(lhs, rhs)
The files needed by these tests are not built by default with CMake. You'll need to build the special iree-test-deps target to generate test files prior to running CTest (from the build directory):
iree-test-deps
cmake --build . --target iree-test-deps\n
To run e2e model tests in generated_e2e_model_tests.cmake, because of their dependencies, -DIREE_BUILD_E2E_TEST_ARTIFACTS=ON needs to be set when configuring CMake. Also see IREE Benchmark Suite Prerequisites for required packages.
-DIREE_BUILD_E2E_TEST_ARTIFACTS=ON
For the test tests/e2e/stablehlo_ops/floor.mlir compiled for the VMVX target backend and running on the VMVX driver (here they match exactly, but in principle there's a many-to-many mapping from backends to drivers).
tests/e2e/stablehlo_ops/floor.mlir
ctest -R tests/e2e/stablehlo_ops/check_vmvx_local-task_floor.mlir\n
bazel test tests/e2e/stablehlo_ops:check_vmvx_local-task_floor.mlir\n
Similarly, you can use environment variables to select Vulkan implementations for running tests as explained in the Runtime tests section.
These tests live in tests/e2e. A single test consists of a .mlir source file specifying an IREE module where each exported function takes no inputs and returns no results and corresponds to a single test case.
tests/e2e
As an example, here are some tests for the MHLO floor operation:
func.func @tensor() {\n %input = util.unfoldable_constant dense<[0.0, 1.1, 2.5, 4.9]> : tensor<4xf32>\n %result = \"mhlo.floor\"(%input) : (tensor<4xf32>) -> tensor<4xf32>\n check.expect_almost_eq_const(%result, dense<[0.0, 1.0, 2.0, 4.0]> : tensor<4xf32>): tensor<4xf32>\n return\n}\n\nfunc.func @scalar() {\n %input = util.unfoldable_constant dense<101.3> : tensor<f32>\n %result = \"mhlo.floor\"(%input) : (tensor<f32>) -> tensor<f32>\n check.expect_almost_eq_const(%result, dense<101.0> : tensor<f32>): tensor<f32>\n return\n}\n\nfunc.func @negative() {\n %input = util.unfoldable_constant dense<-1.1> : tensor<f32>\n %result = \"mhlo.floor\"(%input) : (tensor<f32>) -> tensor<f32>\n check.expect_almost_eq_const(%result, dense<-2.0> : tensor<f32>): tensor<f32>\n return\n}\n
Test cases are created in gtest for each public function exported by the module.
Note the use of util.unfoldable_constant to specify test constants. If we were to use a regular constant the compiler would fold away everything at compile time and our test would not actually test the runtime. unfoldable_constant adds a barrier that prevents folding. To prevent folding/constant propagate on an arbitrary SSA-value you can use util.optimization_barrier.
util.unfoldable_constant
unfoldable_constant
util.optimization_barrier
Next we use this input constant to exercise the runtime feature under test (in this case, just a single floor operation). Finally, we use a check dialect operation to make an assertion about the output. There are a few different assertion operations. Here we use the expect_almost_eq_const op: almost because we are comparing floats and want to allow for floating-point imprecision, and const because we want to compare it to a constant value. This last part is just syntactic sugar around
expect_almost_eq_const
%expected = arith.constant dense<101.0> : tensor<f32>\ncheck.expect_almost_eq(%result, %expected) : tensor<f32>\n
The output of running this test looks like:
[==========] Running 4 tests from 1 test suite.\n[----------] Global test environment set-up.\n[----------] 4 tests from module\n[ RUN ] module.tensor\n[ OK ] module.tensor (76 ms)\n[ RUN ] module.scalar\n[ OK ] module.scalar (79 ms)\n[ RUN ] module.double\n[ OK ] module.double (55 ms)\n[ RUN ] module.negative\n[ OK ] module.negative (54 ms)\n[----------] 4 tests from module (264 ms total)\n\n[----------] Global test environment tear-down\n[==========] 4 tests from 1 test suite ran. (264 ms total)\n[ PASSED ] 4 tests.\n
The \"module\" name for the test suite comes from the default name for an implicit MLIR module. To give the test suite a more descriptive name, use an explicit named top-level module in this file.
A single .mlir source file can be turned into a test target with the iree_check_test Bazel macro (and corresponding CMake function).
iree_check_test
load(\"//build_tools/bazel:iree_check_test.bzl\", \"iree_check_test\")\n\niree_check_test(\n name = \"check_vmvx_local-task_floor.mlir\",\n src = \"floor.mlir\",\n driver = \"local-task\",\n target_backend = \"vmvx\",\n)\n
The target naming convention is \"check_backend_driver_src\". The generated test will automatically be tagged with a \"driver=vmvx\" tag, which can help filter tests by backend (especially when many tests are generated, as below).
Usually we want to create a suite of tests across many backends and drivers. This can be accomplished with additional macros. For a single backend/driver pair:
load(\"//build_tools/bazel:iree_check_test.bzl\", \"iree_check_single_backend_test_suite\")\n\niree_check_single_backend_test_suite(\n name = \"check_vmvx_local-task\",\n srcs = glob([\"*.mlir\"]),\n driver = \"local-task\",\n target_backend = \"vmvx\",\n)\n
This will generate a separate test target for each file in srcs with a name following the convention above as well as a Bazel test_suite called \"check_vmvx_local-task\" that will run all the generated tests.
srcs
You can also generate suites across multiple pairs:
load(\"//build_tools/bazel:iree_check_test.bzl\", \"iree_check_test_suite\")\n\niree_check_test_suite(\n name = \"check\",\n srcs = [\"success.mlir\"],\n # Leave this argument off to run on all supported backend/driver pairs.\n target_backends_and_drivers = [\n (\"vmvx\", \"local-task\"),\n (\"vulkan-spirv\", \"vulkan\"),\n ],\n)\n
This will create a test per source file and backend/driver pair, a test suite per backend/driver pair, and a test suite, \"check\", that will run all the tests.
The CMake functions follow a similar pattern. The calls to them are generated in our CMakeLists.txt file by bazel_to_cmake.
There are other test targets that generate tests based on template configuraton and platform detection, such as iree_static_linker_test. Those targets are not supported by Bazel rules at this point.
iree_static_linker_test
An out-of-tree test suite is under development at nod-ai/SHARK-TestSuite for large collections of generated tests and machine learning models that are too large to fit into the main git repository.
Testing these programs follows several stages:
graph LR\n Import -. \"\\n(offline)\" .-> Compile\n Compile --> Run
This particular test suite treats importing (e.g. from ONNX, PyTorch, or TensorFlow) as an offline step and contains test cases organized into folders of programs, inputs, and expected outputs:
test_case_name/\n model.mlir\n input_0.npy\n output_0.npy\n test_data_flags.txt\n
--input=@input_0.npy\n--expected_output=@output_0.npy\n
Each test case can be run using a sequence of commands like:
iree-compile model.mlir {flags} -o model.vmfb\niree-run-module --module=model.vmfb --flagfile=test_data_flags.txt\n
To run slices of the test suite, a pytest runner is included that can be configured using JSON files. The JSON files tested in the IREE repo itself are stored in build_tools/pkgci/external_test_suite/.
build_tools/pkgci/external_test_suite/
For example, here is part of a config file for running ONNX tests on CPU:
{\n \"config_name\": \"cpu_llvm_sync\",\n \"iree_compile_flags\": [\n \"--iree-hal-target-backends=llvm-cpu\"\n ],\n \"iree_run_module_flags\": [\n \"--device=local-sync\"\n ],\n \"skip_compile_tests\": [\n \"test_dequantizelinear\",\n \"test_slice_default_axes\"\n ],\n \"skip_run_tests\": [],\n \"expected_compile_failures\": [\n \"test_acos\",\n \"test_acos_example\",\n \"test_acosh\",\n \"test_acosh_example\",\n \"test_adagrad\",\n \"test_adagrad_multiple\",\n
To add new test cases to the external test suite:
To start running new test cases:
.github/workflows/
The external test suite only needs iree-compile and iree-run-module to run, so it is well suited for use in downstream projects that implement plugins for IREE. The conftest.py file can also be forked (or bypassed entirely) to further customize the test runner behavior.
conftest.py
IREE Benchmarks Suites is a collection of benchmarks for IREE developers to track performance improvements/regressions during development.
The benchmark suites are run for each commit on the main branch and the results are uploaded to https://perf.iree.dev for regression analysis (for the current supported targets). On pull requests, users can add labels benchmarks:* to trigger the benchmark runs. The results will be compared with https://perf.iree.dev and post in the comments.
benchmarks:*
Information about the definitions of the benchmark suites can be found in the IREE Benchmark Suites Configurations.
Install iree-import-tf and iree-import-tflite in your Python environment (see Tensorflow Integration and TFLite Integration).
iree-import-tf
iree-import-tflite
IREE Benchmark Suites contain many benchmarks for different devices and model sizes, which can take lots of space and time to build all of them. So benchmarks are grouped into presets to allow building and running only a subset of them. The available presets are:
Execution benchmarks:
android-cpu
android-gpu
cuda-large
vulkan-nvidia
x86_64
x86_64-large
Compilation benchmarks (to collect compilation statistics, such as module sizes):
comp-stats
comp-stats-large
Note that *-large presets will download and build a few hundreds GBs of artifacts.
*-large
Set the environment variables of benchmark presets for the steps below, for example:
export EXECUTION_BENCHMARK_PRESETS=\"cuda,x86_64\"\nexport COMPILATION_BENCHMARK_PRESETS=\"comp-stats\"\n
Configure IREE with -DIREE_BUILD_E2E_TEST_ARTIFACTS=ON:
cmake -GNinja -B \"${IREE_BUILD_DIR?}\" -S \"${IREE_REPO?}\" \\\n -DCMAKE_BUILD_TYPE=RelWithDebInfo \\\n -DCMAKE_C_COMPILER=clang \\\n -DCMAKE_CXX_COMPILER=clang++ \\\n -DIREE_ENABLE_LLD=ON \\\n -DIREE_BUILD_E2E_TEST_ARTIFACTS=ON\n
If you only need the imported MLIR models:
cmake --build \"${IREE_BUILD_DIR?}\" --target \\\n iree-benchmark-import-models\n # For large benchmarks (this will take > 100G disk space)\n # iree-benchmark-import-models-large\n
Otherwise, compile the benchmark suites and tools for benchmarking:
cmake --build \"${IREE_BUILD_DIR?}\" --target \\\n iree-benchmark-suites \\\n # If any *-large preset is enabled, also build this target:\n # iree-benchmark-suites-large \\\n iree-benchmark-module\nexport E2E_TEST_ARTIFACTS_DIR=\"${IREE_BUILD_DIR?}/e2e_test_artifacts\"\n
TODO(#13683): Each preset should have its own target to further reduce unnecessary builds
Export the execution benchmark config:
build_tools/benchmarks/export_benchmark_config.py execution \\\n --benchmark_presets=\"${EXECUTION_BENCHMARK_PRESETS?}\" \\\n > \"${E2E_TEST_ARTIFACTS_DIR?}/exec_config.json\"\n
Run benchmarks (currently only support running on a Linux host):
build_tools/benchmarks/run_benchmarks_on_linux.py \\\n --normal_benchmark_tool_dir=\"${IREE_BUILD_DIR?}/tools\" \\\n --e2e_test_artifacts_dir=\"${E2E_TEST_ARTIFACTS_DIR?}\" \\\n --execution_benchmark_config=\"${E2E_TEST_ARTIFACTS_DIR?}/exec_config.json\" \\\n --target_device_name=\"<target_device_name, e.g. c2-standard-60>\" \\\n --output=\"${E2E_TEST_ARTIFACTS_DIR?}/benchmark_results.json\" \\\n --verbose \\\n --cpu_uarch=\"<host CPU uarch, e.g. CascadeLake>\"\n# Traces can be collected by adding:\n# --traced_benchmark_tool_dir=\"${IREE_TRACED_BUILD_DIR?}/tools\" \\\n# --trace_capture_tool=/path/to/iree-tracy-capture \\\n# --capture_tarball=captured_tracy_files.tar.gz\n
Note that:
<target_device_name>
c2-standard-60
a2-highgpu-1g
--cpu_uarch
CascadeLake
Filters can be used to select the benchmarks:
build_tools/benchmarks/run_benchmarks_on_linux.py \\\n --normal_benchmark_tool_dir=\"${IREE_BUILD_DIR?}/tools\" \\\n --e2e_test_artifacts_dir=\"${E2E_TEST_ARTIFACTS_DIR?}\" \\\n --execution_benchmark_config=\"${E2E_TEST_ARTIFACTS_DIR?}/exec_config.json\" \\\n --target_device_name=\"c2-standard-60\" \\\n --output=\"${E2E_TEST_ARTIFACTS_DIR?}/benchmark_results.json\" \\\n --verbose \\\n --cpu_uarch=\"CascadeLake\" \\\n --model_name_regex=\"MobileBert*\" \\\n --driver_filter_regex='local-task' \\\n --mode_regex=\"4-thread\"\n
Export the compilation benchmark config:
build_tools/benchmarks/export_benchmark_config.py compilation \\\n --benchmark_presets=\"${COMPILATION_BENCHMARK_PRESETS?}\" \\\n > \"${E2E_TEST_ARTIFACTS_DIR?}/comp_config.json\"\n
Generate the compilation statistics:
build_tools/benchmarks/collect_compilation_statistics.py \\\n --compilation_benchmark_config=comp_config.json \\\n --e2e_test_artifacts_dir=\"${E2E_TEST_ARTIFACTS_DIR?}\" \\\n --build_log=\"${IREE_BUILD_DIR?}/.ninja_log\" \\\n --output=\"${E2E_TEST_ARTIFACTS_DIR?}/compile_stats_results.json\"\n
Note that you need to use Ninja to build the benchmark suites as the tool collects information from its build log.
If you want to generate a comparison report locally, you can use diff_local_benchmarks.py script to compare two result json files and generate the report. For example:
build_tools/benchmarks/diff_local_benchmarks.py \\\n --base \"${E2E_TEST_ARTIFACTS_DIR?}/before_benchmark_results.json\" \\\n --target \"${E2E_TEST_ARTIFACTS_DIR?}/after_benchmark_results.json\" \\\n > report.md\n
An example that compares compilation statistics:
build_tools/benchmarks/diff_local_benchmarks.py \\\n --base-compile-stats \"${E2E_TEST_ARTIFACTS_DIR?}/before_compile_stats_results.json\" \\\n --target-compile-stats \"${E2E_TEST_ARTIFACTS_DIR?}/after_compile_stats_results.json\" \\\n > report.md\n
Each benchmark has its benchmark ID in the benchmark suites, you will see a benchmark ID at:
https://perf.iree.dev/serie?IREE?<benchmark_id>
https://perf.iree.dev/serie?IREE?<benchmark_id>-<metric_id>
benchmark_results.json
compile_stats_results.json
run_config_id
gen_config_id
diff_local_benchmarks.py
If you don't have artifacts locally, see Fetching Benchmark Artifacts from CI to find the GCS directory of the CI artifacts. Then fetch the needed files:
# Get ${E2E_TEST_ARTIFACTS_DIR_URL} from \"Fetching Benchmark Artifacts from CI\".\nexport E2E_TEST_ARTIFACTS_DIR=\"e2e_test_artifacts\"\n\n# Download all artifacts\nmkdir \"${E2E_TEST_ARTIFACTS_DIR?}\"\ngcloud storage cp -r \"${E2E_TEST_ARTIFACTS_DIR_URL?}\" \"${E2E_TEST_ARTIFACTS_DIR?}\"\n
Run the helper tool to dump benchmark commands from benchmark configs:
build_tools/benchmarks/benchmark_helper.py dump-cmds \\\n --execution_benchmark_config=\"${E2E_TEST_ARTIFACTS_DIR?}/execution-benchmark-config.json\" \\\n --compilation_benchmark_config=\"${E2E_TEST_ARTIFACTS_DIR?}/compilation-benchmark-config.json\" \\\n --e2e_test_artifacts_dir=\"${E2E_TEST_ARTIFACTS_DIR?}\" \\\n --benchmark_id=\"<benchmark_id>\"\n
The commands below output the full list of execution and compilation benchmarks, including the benchmark names and their flags:
build_tools/benchmarks/export_benchmark_config.py execution > \"${E2E_TEST_ARTIFACTS_DIR?}/exec_config.json\"\nbuild_tools/benchmarks/export_benchmark_config.py compilation > \"${E2E_TEST_ARTIFACTS_DIR?}/comp_config.json\"\nbuild_tools/benchmarks/benchmark_helper.py dump-cmds \\\n --execution_benchmark_config=\"${E2E_TEST_ARTIFACTS_DIR?}/exec_config.json\" \\\n --compilation_benchmark_config=\"${E2E_TEST_ARTIFACTS_DIR?}/comp_config.json\"\n
On the commit of the benchmark run, you can find the list of the workflow jobs by clicking the green check mark. Click any job starts with CI /:
CI /
On the CI page, click Summary on the top-left to open the summary page. Scroll down and the links to artifacts are listed in a section titled \"Artifact Links\". Paste the content in your shell to define all needed variables for the following steps:
Summary
To fetch files from the GCS URL, the gcloud CLI tool (https://cloud.google.com/sdk/docs/install) can list the directory contents and download files (see https://cloud.google.com/sdk/gcloud/reference/storage for more usages). If you want to use CI artifacts to reproduce benchmarks locally, see Find Compile and Run Commands to Reproduce Benchmarks.
Assume you get the GCS URL variables from Get URLs of GCS artifacts.
Download artifacts:
# The GCS directory has the same structure as your local ${IREE_BUILD_DIR?}/e2e_test_artifacts.\ngcloud storage ls \"${E2E_TEST_ARTIFACTS_DIR_URL?}\"\n\n# Download all source and imported MLIR files:\ngcloud storage cp \"${E2E_TEST_ARTIFACTS_DIR_URL?}/*.mlir\" \"<target_dir>\"\n
Execution and compilation benchmark configs can be downloaded at:
# Execution benchmark config:\ngcloud storage cp \\\n \"${E2E_TEST_ARTIFACTS_DIR_URL?}/execution-benchmark-config.json\" \\\n \"${E2E_TEST_ARTIFACTS_DIR?}/exec_config.json\"\n\n# Compilation benchmark config:\ngcloud storage cp \\\n \"${E2E_TEST_ARTIFACTS_DIR_URL?}/compilation-benchmark-config.json\" \\\n \"${E2E_TEST_ARTIFACTS_DIR?}/comp_config.json\"\n
Benchmark raw results and traces can be downloaded at:
# Execution benchmark raw results\ngcloud storage cp \"${EXECUTION_BENCHMARK_RESULTS_DIR_URL?}/benchmark-results-*.json\" .\n\n# Optional: Merge raw results into a single file\nbuild_tools/benchmarks/benchmark_helper.py merge-results benchmark-results-*.json > benchmark_results.json\n\n# Execution benchmark traces\ngcloud storage cp \"${EXECUTION_BENCHMARK_RESULTS_DIR_URL?}/benchmark-traces-*.tar.gz\" .\n\n# Compilation benchmark results\ngcloud storage cp \"${COMPILATION_BENCHMARK_RESULTS_URL?}\" .\n
IREE uses benchmarks to inspect performance at varying levels of granularity. Benchmarking is implemented using the Google Benchmark library. To understand performance details and guide optimization, please refer to the IREE profiling documentation.
iree-benchmark-module is a program accepting (almost) the same inputs as iree-run-module that will benchmark the invocation of a single entry function. It measures timing for the whole process of invoking a function through the VM, including allocating and freeing output buffers. This is a high-level benchmark of an entire invocation flow. It provides a big picture view, but depends on many different variables, like an integration test. For finer-grained measurements more akin to unit tests, see Executable Benchmarks.
To use iree-benchmark-module, generate an IREE module for the target backend:
$ bazel run //tools:iree-compile -- \\\n --iree-hal-target-backends=vmvx \\\n $PWD/samples/models/simple_abs.mlir \\\n -o /tmp/module.fb\n
and then benchmark an exported function in that module:
$ bazel run //tools:iree-benchmark-module -- \\\n --module=/tmp/module.fb \\\n --device=local-task \\\n --function=abs \\\n --input=f32=-2\n
You'll see output like
Run on (12 X 4500 MHz CPU s)\nCPU Caches:\n L1 Data 32K (x6)\n L1 Instruction 32K (x6)\n L2 Unified 1024K (x6)\n L3 Unified 8448K (x1)\nLoad Average: 2.21, 1.93, 3.34\n***WARNING*** CPU scaling is enabled, the benchmark real time measurements may\n be noisy and will incur extra overhead.\n***WARNING*** Library was built as DEBUG. Timings may be affected.\n------------------------------------------------------------------------------\nBenchmark Time CPU Iterations\n------------------------------------------------------------------------------\nBM_RunModule/process_time/real_time 0.22 ms 0.23 ms 3356\n
Notice that there are a few warnings in there (you may not see all of these). The benchmark library helpfully warns about some common issues that will affect benchmark timing. When trying to obtain real benchmark numbers, you should generally build an optimized build (-c opt in Bazel) and disable CPU scaling.
-c opt
bazel build -c opt //tools:iree-benchmark-module\n
Another thing to consider is that depending on where you are running the benchmark you might want to avoid additional programs running at the same time. Bazel itself runs a server even when it's not being actively invoked that can be quite a memory hog, so we'll instead invoke the binary directly. Use your favorite process manager (e.g. htop or pkill on Linux) to kill heavy-weight programs such as Chrome and Bazel.
Now we'll actually invoke the binary:
$ ./bazel-bin/tools/iree-benchmark-module \\\n --module=/tmp/module.fb \\\n --device=local-task \\\n --function=abs \\\n --input=f32=-2\n
Run on (12 X 4500 MHz CPU s)\nCPU Caches:\n L1 Data 32K (x6)\n L1 Instruction 32K (x6)\n L2 Unified 1024K (x6)\n L3 Unified 8448K (x1)\nLoad Average: 1.49, 3.42, 3.49\n------------------------------------------------------------------------------\nBenchmark Time CPU Iterations\n------------------------------------------------------------------------------\nBM_RunModule/process_time/real_time 0.011 ms 0.014 ms 61654\n
Remember to restore CPU scaling when you're done.
We also benchmark the performance of individual parts of the IREE system in isolation. IREE breaks a model down to dispatch functions. To benchmark all the dispatch functions, generate an IREE module with the -iree-flow-export-benchmark-funcs flag set:
-iree-flow-export-benchmark-funcs
$ build/tools/iree-compile \\\n --iree-input-type=stablehlo \\\n --iree-flow-export-benchmark-funcs \\\n --iree-hal-target-backends=vmvx \\\n tests/e2e/stablehlo_models/fullyconnected.mlir \\\n -o /tmp/fullyconnected.vmfb\n
and then benchmark all exported dispatch functions (and all exported functions) in that module:
$ build/tools/iree-benchmark-module\n --module=/tmp/fullyconnected.vmfb\n --device=local-task\n
If no entry_function is specified, iree-benchmark-module will register a benchmark for each exported function that takes no inputs.
entry_function
You will see output like:
Run on (72 X 3700 MHz CPU s)\nCPU Caches:\n L1 Data 32 KiB (x36)\n L1 Instruction 32 KiB (x36)\n L2 Unified 1024 KiB (x36)\n L3 Unified 25344 KiB (x2)\nLoad Average: 4.39, 5.72, 6.76\n---------------------------------------------------------------------------------------------\nBenchmark Time CPU Iterations\n---------------------------------------------------------------------------------------------\nBM_main_ex_dispatch_0_benchmark/process_time/real_time 0.030 ms 0.037 ms 34065\nBM_main_ex_dispatch_1_benchmark/process_time/real_time 0.034 ms 0.042 ms 20567\nBM_main_ex_dispatch_2_benchmark/process_time/real_time 0.043 ms 0.051 ms 18576\nBM_main_ex_dispatch_3_benchmark/process_time/real_time 0.029 ms 0.036 ms 21345\nBM_main_ex_dispatch_4_benchmark/process_time/real_time 0.042 ms 0.051 ms 15880\nBM_main_ex_dispatch_5_benchmark/process_time/real_time 0.030 ms 0.037 ms 17854\nBM_main_ex_dispatch_6_benchmark/process_time/real_time 0.043 ms 0.052 ms 14919\nBM_main_benchmark/process_time/real_time 0.099 ms 0.107 ms 5892\n
Normally, the IREE VM is expected to be integrated into applications and driving model execution. So its performance is of crucial importance. We strive to introduce as little overhead as possible and have several benchmark binaries dedicated for evaluating the VM's performance. These benchmark binaries are named as *_benchmark in the iree/vm/ directory. They also use the Google Benchmark library as the above.
*_benchmark
iree/vm/
When benchmarking, it's important to consider the configuration of your CPUs. Most notably, CPU scaling can give variable results, so you'll usually want to disable it. This can get pretty complex, but the most basic thing to do is to run all CPUs at maximum frequency. The other thing to consider is what CPU(s) your program is running on. Both of these get more complicated on mobile and in multithreaded workloads.
Google benchmark provides some instructions. Note that the library will print \"CPU scaling is enabled\" warnings for any configuration that doesn't have the quota governor set to performance. Similarly the CPU frequency it reports is the maximum frequency of cpu0, not the frequency of the processor it's actually running on. This means that more advanced configurations should ignore these messages.
Turn off CPU scaling before benchmarking.
sudo cpupower frequency-set --governor performance\n
Restore CPU scaling after benchmarking:
sudo cpupower frequency-set --governor powersave\n
To learn more about different quota governor settings, see https://www.kernel.org/doc/Documentation/cpu-freq/governors.txt. To restrict which CPUs you run on, use the taskset command which takes a hexadecimal mask.
taskset
To only run on the lowest-numbered CPU you can run
taskset 1 sleep 20 &\n
You can confirm that the process is running on the given CPU:
ps -o psr $!\n
Note that $! indicates the process ID of the last executed background command, so you can only use this shorthand if you didn't run any commands after the sleep. For more info on taskset, see https://linux.die.net/man/1/taskset.
$!
Read and understand the Linux instructions first.
Android doesn't give us quite as nice tooling, but the principle is basically the same. One important difference is that thermal throttling is a much bigger concern on mobile. Without a cooling plate, it is likely that high clock speeds will overheat the device and engage thermal throttling, which will ignore whatever clock speeds you may have set to prevent things from catching on fire. Therefore the naive approach above is likely not a good idea.
You will likely need to be root (use su or adb root). The commands will depend on your exact phone and number of cores. First play around and make sure you understand what everything means. Note that each CPU has its own files which are used to control its behavior, but changes to a single CPU will sometimes affect others (see /sys/devices/system/cpu/cpu0/cpufreq/affected_cpus).
su
adb root
/sys/devices/system/cpu/cpu0/cpufreq/affected_cpus
Some useful files:
/proc/cpuinfo\n/sys/devices/system/cpu/possible\n/sys/devices/system/cpu/present\n/sys/devices/system/cpu/cpu0/online\n/sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors\n/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor\n/sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies\n/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq\n/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq\n/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq\n/sys/devices/system/cpu/cpu0/cpufreq/affected_cpus\n/sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed\n
See the clockspeed of each CPU
$ for i in `cat /sys/devices/system/cpu/present | tr '-' ' ' | xargs seq`; do \\\n paste \\\n \"/sys/devices/system/cpu/cpu${i?}/cpufreq/cpuinfo_cur_freq\" \\\n \"/sys/devices/system/cpu/cpu${i?}/cpufreq/cpuinfo_min_freq\" \\\n \"/sys/devices/system/cpu/cpu${i?}/cpufreq/cpuinfo_max_freq\"; \\\ndone\n
Before changing things, make sure to check the current scaling governor settings first so you can put them back when you're done.
$ for i in `cat /sys/devices/system/cpu/present | tr '-' ' ' | xargs seq`; do \\\n cat \"/sys/devices/system/cpu/cpu${i?}/cpufreq/scaling_governor\"; \\\ndone\n
Here's an example to run IREE in a single-threaded context on CPU 7 at its lowest clock speed.
First we'll take control of the clockspeed by setting the governor to \"userspace\".
$ for i in `cat /sys/devices/system/cpu/present | tr '-' ' ' | xargs seq`; do \\\n echo userspace > \\\n \"/sys/devices/system/cpu/cpu${i?}/cpufreq/scaling_governor\"; \\\ndone\n
We can now set individual clock speeds. We'll pin cpu7 to its minimum frequency. We choose the minimum instead of the maximum here to mitigate thermal throttling concerns
$ cat /sys/devices/system/cpu/cpu7/cpufreq/cpuinfo_min_freq > \\\n/sys/devices/system/cpu/cpu7/cpufreq/scaling_setspeed\n
We can confirm the frequencies of all the CPUs by running the same command above. Now to run a command specifically on cpu7, use taskset 80 (hex for 10000000):
taskset 80
taskset 80 sleep 20 &\nps -o psr $!\n
Remember to cleanup when you're done! Here we'll set the scaling governor back to schedutil because that's what they were before on the particular device this, was tested on, but that may not exist on all devices.
$ for i in `cat /sys/devices/system/cpu/present | tr '-' ' ' | xargs seq`; do \\\n echo schedutil > \\\n \"/sys/devices/system/cpu/cpu${i?}/cpufreq/scaling_governor\"; \\\ndone\n
We provide a few scripts to set clockspeeds on Android (under build_tools/benchmarks). These are somewhat device-specific:
build_tools/benchmarks
set_android_scaling_governor.sh
set_*_gpu_scaling_policy.sh
Sample configuration steps for Pixel 6:
adb push build_tools/benchmarks/*.sh /data/local/tmp\n
adb shell\noriole:/ # su\noriole:/ # cd /data/local/tmp\n
oriole:/ # ./set_android_scaling_governor.sh\n CPU info (before changing governor):\n cpu governor cur min max\n ------------------------------------------------\n cpu0 sched_pixel 1098000 300000 1803000\n cpu1 sched_pixel 1598000 300000 1803000\n cpu2 sched_pixel 1598000 300000 1803000\n cpu3 sched_pixel 1098000 300000 1803000\n cpu4 sched_pixel 400000 400000 2253000\n cpu5 sched_pixel 400000 400000 2253000\n cpu6 sched_pixel 500000 500000 2802000\n cpu7 sched_pixel 500000 500000 2802000\n Setting CPU frequency governor to performance\n CPU info (after changing governor):\n cpu governor cur min max\n ------------------------------------------------\n cpu0 performance 1803000 300000 1803000\n cpu1 performance 1803000 300000 1803000\n cpu2 performance 1803000 300000 1803000\n cpu3 performance 1803000 300000 1803000\n cpu4 performance 2253000 400000 2253000\n cpu5 performance 2253000 400000 2253000\n cpu6 performance 2802000 500000 2802000\n cpu7 performance 2802000 500000 2802000\noriole:/data/local/tmp # ./set_pixel6_gpu_scaling_policy.sh\n GPU info (before changing frequency scaling policy):\n policy cur min max\n --------------------------------------------------------------\n coarse_demand [adaptive] always_on 251000 151000 848000\n Setting GPU frequency scaling policy to performance\n GPU info (after changing frequency scaling policy):\n policy cur min max\n --------------------------------------------------------------\n coarse_demand adaptive [always_on] 848000 151000 848000\n
oriole:/ # ./set_android_scaling_governor.sh sched_pixel\n...\noriole:/ # ./set_pixel6_gpu_scaling_policy.sh default\n...\n
TODO(scotttodd): Windows instructions
CPUs are able to record certain events that may be relevant when investigating the performance of a program. A common example of such an event is a \"cache miss\", when the program tries to access data in memory that isn't already in some CPU cache, causing that access to be slower than it could otherwise be.
Querying and analyzing this data can be useful, but is hard in two distinct ways:
There are two parts to this page: platform-specific information about how to query this data, and, at the end, a platform-independent explanation of how to interpret it.
The Linux kernel exposes system event counters to user-space programs by means of the perf_event_open system call. This includes both hardware event counters (such as CPU cache events) and software events from the kernel (such as page faults and context switches). Anyone may use this system call to implement a profiler, but Linux readily offers one, perf.
perf_event_open
By default IREE cleans up any temporary files it creates while running. Tools like perf, however, require those files exist even after the process has exited. The environment variable IREE_PRESERVE_DYLIB_TEMP_FILES can be set to preserve the files. This is only needed for the CPU path when using the system loader.
IREE_PRESERVE_DYLIB_TEMP_FILES
export IREE_PRESERVE_DYLIB_TEMP_FILES=1\n
On desktop Linux we can use perf. It is provided on most Linux distributions, for instance on Debian-based distributions do:
sudo apt install linux-perf\n
Run the program to be profiled, prepending its command line with perf record. By default this will write the profile data to the current directory, ./perf.data. Sometimes this isn't ideal, such as then the current directory is under version control. Explicit paths can be specified by -o flag to direct the output of perf record, and then by -i flags to select the input of subsequent commands analyzing the profile. Example:
perf record
./perf.data
-i
perf record -o /tmp/perf.data \\\n ./tools/iree-benchmark-module \\\n --device=local-task \\\n ... command-line arguments of iree-benchmark-module as usual ...\n
By default, this samples time spent. One may specify instead an event to sample by, with the -e flag. For instance, to sample by L1 cache misses, one may do:
-e
perf record -o /tmp/perf.data -e L1-dcache-load-misses \\\n ./tools/iree-benchmark-module \\\n --device=local-task \\\n ... command-line arguments of iree-benchmark-module as usual ...\n
perf list dumps the list of event types.
perf list
Once you have recorded a profile, there are two main ways to analyze it: perf report and perf annotate.
perf report
perf annotate
perf report breaks down the event counts by symbol. In the default case where what was sampled was time, this is just an ordinary profile by symbol name, no different than what could be viewed in other profilers such as Tracy. Where it gets really interesting is when the profile was recording a specific event type, as in the above -e L1-dcache-load-misses example:
-e L1-dcache-load-misses
perf report -i /tmp/perf.data\n\nSamples: 6K of event 'L1-dcache-load-misses', Event count (approx.): 362571861\nOverhead Command Shared Object Symbol\n 61.53% cpu0 dylib_executablenzpx2Q.so [.] serving_default_ex_dispatch_31\n 13.30% cpu0 dylib_executablenzpx2Q.so [.] serving_default_ex_dispatch_11\n 2.11% cpu0 dylib_executablenzpx2Q.so [.] serving_default_ex_dispatch_13\n 1.90% cpu0 dylib_executablenzpx2Q.so [.] serving_default_ex_dispatch_19\n 1.54% cpu0 dylib_executablenzpx2Q.so [.] serving_default_ex_dispatch_25\n 1.49% cpu0 dylib_executablenzpx2Q.so [.] serving_default_ex_dispatch_5\n
perf annotate breaks down the event counts by instruction. Again, in the default case where what was sampled was time, this is no different than what could be viewed in Tracy, and the real motivation to use perf is when profiling by specific event types as in the above -e L1-dcache-load-misses example:
perf annotate -i perf.data\n\nSamples: 6K of event 'L1-dcache-load-misses', 4000 Hz, Event count (approx.): 362571861\nserving_default_ex_dispatch_31 /tmp/dylib_executablenzpx2Q.so [Percent: local period]\n 1.66 \u2502 movups -0x1000(%rdi),%xmm10\n 0.48 \u2502 movups -0x800(%rdi),%xmm9\n 0.82 \u2502 movups (%rdi),%xmm8\n 0.49 \u2502 movaps %xmm1,%xmm4\n 0.12 \u2502 shufps $0x0,%xmm1,%xmm4\n 0.14 \u2502 mulps %xmm5,%xmm4\n 0.28 \u2502 addps %xmm6,%xmm4\n 0.60 \u2502 movaps %xmm3,%xmm6\n 0.34 \u2502 shufps $0x0,%xmm3,%xmm6\n
perf annotate is even noisier than perf report as it can be overly optimistic, depending on the CPU, to pin an event to a specific instruction. Typically, this works fairly well on x86 CPUs and less well on ARM CPUs and more generally on anything mobile. Even on a desktop x86 CPU, this is noisy, as the above example (recorded on a Skylake workstation) shows: it blamed a mulps %xmm5,%xmm4 instruction for a cache miss, which doesn't make sense as that instruction only touches registers.
mulps %xmm5,%xmm4
On Android we can use simpleperf. It's preinstalled on current Android userdebug images, and part of the Android NDK.
simpleperf
userdebug
In theory, as Android is Linux, it should be possible to use perf. Unfortunately, perf is difficult to build for Android. Fortunately, simpleperf is readily available: it is preinstalled in Android userdebug images, and it is part of the Android NDK.
First, we record on the device:
adb shell \\\n simpleperf record -e raw-l1d-cache-refill -o /data/local/tmp/perf.data \\\n /data/local/tmp/iree-benchmark-module \\\n --device=local-task \\\n ... command-line arguments of iree-benchmark-module as usual ...\n
Then pull the recorded data from the device, and analyze on the desktop. We assume that ${ANDROID_NDK} points to the local copy of the Android NDK.
${ANDROID_NDK}
adb pull /data/local/tmp/perf.data /tmp/perf.data\n${ANDROID_NDK}/simpleperf/report.py -i /tmp/perf.data\n
This prints a breakdown of raw-l1d-cache-refill events by symbol.
raw-l1d-cache-refill
Like with perf, a list of event types can be queried by the list subcommand:
adb shell simpleperf list\n
annotate
There is no simpleperf annotate. The simpleperf documentation lists a couple of ways of achieving the same thing.
simpleperf annotate
However:
There are multiple layers of complexity in interpreting CPU event counts.
The first difficulty is in the fact that most of these events are normal. So just knowing that they happened is not in itself actionable.
For example, if we learn that some code causes cache misses, that isn't big news: so does all code. Maybe this code has too many cache misses, but how many is too many? Maybe this code alone accounts for a large fraction of the overall total of the whole program, but maybe even that is normal, for instance if the code being studied is the 'hot' part of the program where a large fraction of overall time is spent?
Many of these events have a meaning that varies between CPUs and that is difficult to characterize on any CPU, let alone in a way that applies to all CPUs.
For example, take the \"L2 data cache refill\". On ARM, with simpleperf, that would be raw-l2d-cache-refill. Questions:
raw-l2d-cache-refill
The answers to all of the above questions are CPU-dependent. They may even vary between the CPU cores of the same Android device.
Expect noise levels above 10% in many CPU event counts on ARM CPUs. Moreover, on ARM, as discussed above, there is inaccuracy in which instruction is blamed for which event, which will increase inaccuracy of per-symbol breakdowns for very cheap symbols (and makes perf annotate impossible as noted above). Finally, be aware that some ARM CPUs may perform event count interpolation, so we may not have any access to true hardware counts.
Here is a workflow pattern that allows to make significant use of CPU event counts, despite all the problems noted above:
Some things NOT to be done:
Tracy offers great insights into CPU/GPU interactions and Vulkan API usage details. However, information at a finer granularity, especially inside a particular shader dispatch, is missing. To supplement general purpose tools like Tracy, vendor-specific tools can be used.
(TODO: add some pictures for each tool)
Support for RenderDoc can be enabled by configuring cmake with -DIREE_ENABLE_RENDERDOC_PROFILING=ON. When built in to IREE the profiling functionality is available for programmatic use via the iree_hal_device_profiling_begin and iree_hal_device_profiling_end APIs.
-DIREE_ENABLE_RENDERDOC_PROFILING=ON
iree_hal_device_profiling_begin
iree_hal_device_profiling_end
When using one of the standard IREE tools (iree-run-module, iree-benchmark-module, etc) the --device_profiling_mode=queue flag can be passed to enable capture around the entire invocation (be careful when benchmarking as the recordings can be quite large!). The default capture file name can be specified with --device_profiling_file=foo.rdc.
--device_profiling_mode=queue
--device_profiling_file=foo.rdc
Capturing in the RenderDoc UI can be done by specifying the IREE tool or embedding application (iree-run-module, etc) as the launch executable and adding all arguments as normal.
Capturing from the command line can be done using renderdoccmd with the specified file appearing (by default) in the executable directory:
renderdoccmd
renderdoccmd capture tools/iree-run-module --device_profiling_mode=queue --device_profiling_file=foo.rdc ...\nstat tools/foo.rdc\nrenderdoccmd capture tools/iree-run-module --device_profiling_mode=queue --device_profiling_file=/some/path/foo.rdc ...\nstat /some/path/foo.rdc\n
There are multiple GPU vendors for the Android platforms, each offering their own tools. Android GPU Inspector (AGI) provides a cross-vendor solution. See the documentation for more details.
Vulkan supports both graphics and compute, but most tools in the Vulkan ecosystem focus on graphics. As a result, some Vulkan profiling tools expect commands to correspond to a sequence of frames presented to displays via framebuffers. This means additional steps for IREE and other Vulkan applications that solely rely on headless compute. For graphics-focused tools, we need to wrap IREE's logic inside a dummy rendering loop in order to provide the necessary markers for these tools to perform capture and analysis.
For AMD GPUs, Radeon GPU Profiler (RGP) is the tool to understand fine details of how IREE GPU performs. See the documentation for details.
For NVIDIA GPUs, NVIDIA Nsight Graphics is the tool to understand fine details of how IREE GPU performs. See the documentation for details.
Tracy is a hybrid instrumentation and sampling profiler that IREE uses for performance analysis.
Instrumentation is generic code built into the program being profiled, recording zone start and end timestamps where a developer requests them:
Most of IREE's runtime code is instrumented using the macros defined in iree/base/tracing.h:
void iree_sample_function() {\n IREE_TRACE_ZONE_BEGIN(z0);\n // All code here will be included in the zone for `iree_sample_function`.\n IREE_TRACE_ZONE_END(z0);\n}\n
Sampling collects program state and information about the machine using platform-specific APIs at a regular sampling frequency. Sampled data includes callstacks, hardware counters, and more:
While recording instrumentation data requires no special setup, recording sampling data will need some configuration depending on your operating system. Refer to the \"Automated data collection\" section in the Tracy PDF manual for full details. Generally, sampling needs:
-DCMAKE_BUILD_TYPE=RelWithDebInfo
Tracy uses a client-server model with communication over a TCP socket:
graph LR\n tracyclient[\"Tracy Client\n e.g. iree-run-module\"]\n tracyserver[\"Tracy Server\"]\n network([\"Network\"])\n\n thread1[\"Thread 1\"] --> tracyclient\n thread2[\"Thread 2\"] --> tracyclient\n thread3[\"Thread 3\"] --> tracyclient\n\n tracyclient --> network\n network --> tracyserver\n\n tracyserver --> display[\"Display\"]\n tracyserver --> storage[\"Storage\"]
This allows for remote capture, such as over SSH, as well as sharing of saved traces across machines.
The primary source of Tracy documentation, including how to build the profiler UI and CLI capture tool, is a PDF manual:
Download tracy.pdf View tracy.pdf in browser
You will need three things to capture a trace:
The Tracy tools can either be downloaded from the official releases or they can be built from source by using either the upstream CMake build or IREE's downstream CMake build.
Build iree-run-module (or other tools like iree-benchmark-module) with tracing support:
# Sampling needs debug info from the `RelWithDebInfo` or `Debug` build type.\n\ncmake -G Ninja -B ../iree-build/ -S . \\\n -DCMAKE_BUILD_TYPE=RelWithDebInfo \\\n -DIREE_ENABLE_RUNTIME_TRACING=ON\ncmake --build ../iree-build/ --target iree-run-module\n
For more information about building from source, follow the Getting started page.
The iree-runtime Python package includes prebuilt instrumented tools. Set the IREE_PY_RUNTIME=tracy environment variable to use them:
iree-runtime
python -m pip install iree-runtime\nIREE_PY_RUNTIME=tracy iree-run-module ...\n
You should see the following message printed to stderr:
-- Using Tracy runtime (IREE_PY_RUNTIME=tracy)
See this section in the Python bindings documentation for more details.
Compile a program to profile:
# The --iree-hal-executable-debug-level=3 flag embeds source information\n# about each executable into the .vmfb file for the runtime to pass to\n# Tracy. Without this flag, source locations are included on a best-effort\n# basis, typically coming from the input .mlir or .py file.\n\niree-compile program_input.mlir \\\n --iree-hal-target-backends={target} \\\n --iree-hal-executable-debug-level=3 \\\n -o program.vmfb\n
Run the program using the instrumented iree-run-module:
# Set the TRACY_NO_EXIT environment variable to keep short-running programs\n# from exiting before connecting.\n#\n# Some platforms need elevated permissions (root / sudo / administrator)\n# to collect sampling data using kernel facilities. If you only want to\n# collect instrumentation data or your platform does not require it, you\n# can run with more limited permissions.\n\nTRACY_NO_EXIT=1 sudo iree-run-module \\\n --module=program.vmfb \\\n --device={device} \\\n --entry_function={entry} \\\n --parameters={parameters} \\\n --input={arg0} \\\n --input={arg1} \\\n ...\n
While the program is running, connect using the Tracy profiler UI or capture tool:
The profiler UI lists available clients or can be set to connect to the next instrumented process:
The capture tool can be used programmatically and over SSH:
$ capture -o /tmp/capture.tracy\n\nConnecting to 127.0.0.1:8086...\n
View the captured trace once it finishes collecting events. Traces captured by the profiler UI can also be saved to .tracy files for sharing and archival.
.tracy
IREE_TRACING_MODE
Set IREE's IREE_TRACING_MODE value (defined in iree/base/tracing.h) to adjust which tracing features are enabled. Each feature adds tracing overhead and increases the size of trace files, so adjust this setting with care.
For example, to track memory allocations with callstacks:
cmake -G Ninja -B ../iree-build/ -S . \\\n -DCMAKE_BUILD_TYPE=RelWithDebInfo \\\n -DIREE_ENABLE_RUNTIME_TRACING=ON \\\n -DIREE_TRACING_MODE=4\ncmake --build ../iree-build/ --target iree-run-module\n
The Memory window in the Tracy profiler should then show callstacks for each allocation:
When using the llvm-cpu backend (--iree-hal-target-backends=llvm-cpu with --device=local-task or --device=local-sync), these options are available:
--iree-hal-target-backends=llvm-cpu
--device=local-task
--device=local-sync
The --iree-llvmcpu-link-embedded=false flag uses the \"system\" linker (.so/.dylib/.dll) instead of the generic \"embedded\" ELF linker, allowing Tracy to look more deeply at generated code:
--iree-llvmcpu-link-embedded=false
The IREE_PRESERVE_DYLIB_TEMP_FILES environment variable can be used on Posix platforms to ensure that Tracy can view IREE's generated native code.
Ensure that --iree-llvmcpu-debug-symbols=true is set (it is by default).
--iree-llvmcpu-debug-symbols=true
Putting those flags and environment variables together in an example:
iree-compile program_input.mlir \\\n --iree-hal-target-backends=llvm-cpu \\\n --iree-hal-executable-debug-level=3 \\\n --iree-llvmcpu-link-embedded=false \\\n --iree-llvmcpu-debug-symbols=true \\\n -o program_full_info.vmfb\n\nTRACY_NO_EXIT=1 IREE_PRESERVE_DYLIB_TEMP_FILES=1 sudo iree-run-module \\\n --device=local-task \\\n --module=program_full_info.vmfb \\\n ...\n
Tracy's client/server connection uses TCP port 8086 by default. If the Tracy-instrumented program is running on a separate machine, this port needs to be forwarded.
In particular, when profiling on Android, this is needed:
adb forward tcp:8086 tcp:8086\n
You can also pass -p <port> to the capture tool to override the default port to connect to, or use the Tracy GUI which scans other ports too.
-p <port>
The initial view should look like this:
Before going further, take a second to check that your recorded profile data has all the data that it should have. Permissions issues could cause it to lack \"sampling\" or \"CPU data\" information. For example, here is what he initial view looks like when one forgot to run the profiled program as root on Desktop Linux (where running as root is required):
Notice how the latter screenshot is lacking the following elements:
Click the 'Statistics' button at the top. It will open a window like this:
See how the above screenshot has two radio buttons at the top: 'Instrumentation' and 'Sampling'. At this point, if you don't see the 'Sampling' radio button, you need to resolve that first, as discussed above about possible permissions issues.
These 'Instrumentation' and 'Sampling' statistics correspond the two kinds of data that Tracy collects about your program. In the Tracy main view, they correspond, respectively, to 'instrumentation' and 'ghost' zones. Refer to the Tracy PDF manual for a general introduction to these concepts. For each thread, the ghost icon toggles the view between these two kinds of zones.
Back to the main view, look for the part of the timeline that is of interest to you. Your area of interest might not be on the Main thread. In fact, it might be on a thread that's not visible in the initial view at all. To pan around with the mouse, hold the right mouse button down (or its keyboard equivalent on macOS). Alternatively, look for the 'Frame' control at the top of the Tracy window. Use the 'next frame' arrow button until more interesting threads appear.
IREE module code tends to run on a thread whose name contains the word worker.
worker
Once you have identified the thread of interest, you typically want to click its ghost icon to view its \"ghost\" (i.e. sampling) zones. Here is what you should get when clicking on a ghost zone:
The percentages column to the left of the disassembly shows where time is being spent. This is unique to the sampling data (ghost zones) and has no equivalent in the instrumentation data (instrumentation zones). Here is what we get clicking on the corresponding instrumentation zone:
This still has a 'Source' button but that only shows the last C++ caller that had explicit Tracy information, so here we see a file under iree/hal whereas the Ghost zone saw into the IREE compiled module that that calls into, with the source view pointing to the .mlir file.
iree/hal
Tracing iree-compile is much like tracing the runtime tools, except that both of these options need to be set with CMake: -DIREE_ENABLE_RUNTIME_TRACING=ON -DIREE_ENABLE_COMPILER_TRACING=ON:
-DIREE_ENABLE_RUNTIME_TRACING=ON -DIREE_ENABLE_COMPILER_TRACING=ON
cmake -G Ninja -B ../iree-build/ -S . \\\n -DCMAKE_BUILD_TYPE=RelWithDebInfo \\\n -DIREE_ENABLE_RUNTIME_TRACING=ON \\\n -DIREE_ENABLE_COMPILER_TRACING=ON\ncmake --build ../iree-build/ --target iree-compile\n
The steps for collecting traces are the same: run the instrumented program and connect using the Tracy profiler UI or capture tool.
TracingUtils.h
When tracing the compiler, the LLVM/MLIR code can easily generate millions of trace events. Traces captured with sampling can thus take hours to collect, require 40GB+ of RAM to view, and take 1GB+ on disk to store.
However, sampling is especially useful in diagnosing long compile times, since only the MLIR passes are instrumentated, unlike in IREE's runtime where most functions are covered.
For more tips on profiling the compiler, see the Compile time regression debugging page.
This is a known issue with how tracy operates. One way to workaround it is to manually increase the total number of files that can be kept opened simultaneously and run the command with that setting:
sudo sh -c \"ulimit -n <bigNum> && <myTracyInstrumentedProgram>\"\n
Info
Tracy keeps a number of file descriptors open that, depending on the machine and its settings, may exceed the limit allowed by the system resulting in IREE failing to open more files. In particular, it is commom to have a relatively low limit when running with sudo.
You can skip this section if you don't need disassembly of CPU code.
Capstone is the disassembly framework used by Tracy. The default branch, which is what OS packages still distribute, is running a few years behind current CPU architectures.
Newer CPU architectures such as RISC-V, or newer extensions of existing architectures (e.g. new SIMD instructions in the ARM architecture) are typically only supported in the next branch. If you need that support, check out and build that branch. Consider uninstalling any OS package for capstone or otherwise ensure that your IREE build will pick up your next branch build.
next
capstone
If you haven't opted to build capstone-next (see above section), install the OS package for capstone now (Debian-based distributions):
capstone-next
sudo apt install libcapstone-dev\n
Install other dependencies:
sudo apt install libtbb-dev libzstd-dev libglfw3-dev libfreetype6-dev libgtk-3-dev\n
If you only build the command-line tool iree-tracy-capture and not the graphical iree-tracy-profiler, you can install only:
iree-tracy-capture
iree-tracy-profiler
sudo apt install libtbb-dev libzstd-dev\n
The zstd version on Ubuntu 18.04 is old. You will need to install it from source from https://github.com/facebook/zstd.git
If you haven't opted to build capstone-next (see above section), install the system capstone now:
brew install capstone\n
brew install pkg-config glfw freetype tbb zstd\n
A CMake-based build system for Tracy is maintained as part of IREE. In your IREE host build directory, set the following CMake option:
cmake -DIREE_BUILD_TRACY=ON -DIREE_ENABLE_LLD=ON .\n
That enables building the Tracy server tools, iree-tracy-profiler and iree-tracy-capture, introduced above. It also enables building the tool iree-tracy-csvexport which can be used to export a captured trace as a CSV file (see Section 6 \"Exporting zone statistics to CSV\" in the Tracy manual).
iree-tracy-csvexport
TODO - switch to using upstream CMake project
Tracy now has an upstream CMake build for each of its components. We may be able to use this directly.
If profiling on Android/ARM, you might need the patch discussed in the next paragraph.
Consider building without assertions (cmake -DIREE_ENABLE_ASSERTIONS=OFF). At least iree-tracy-profiler has some faulty assertions that can cause the profiler UI to crash during normal usage.
cmake -DIREE_ENABLE_ASSERTIONS=OFF
Rebuild, either everything or just these specific targets:
cmake --build . --target iree-tracy-profiler iree-tracy-capture iree-tracy-csvexport\n
This should have created the iree-tracy-profiler, iree-tracy-capture, and iree-tracy-csvexport binaries:
$ find . -name iree-tracy-*\n./tracy/iree-tracy-profiler\n./tracy/iree-tracy-capture\n./tracy/iree-tracy-csvexport\n
When profiling on an Android device, in order to get the most useful information in the trace, tweak system permissions as follows before profiling. This needs to be done again after every reboot of the Android device.
From your desktop, get a shell on the Android device:
adb shell\n
The following commands are meant to be run from that Android device shell. First, get root access:
su\n
Now run the following commands as root on the Android device:
setenforce 0\nmount -o remount,hidepid=0 /proc\necho 0 > /proc/sys/kernel/perf_event_paranoid\necho 0 > /proc/sys/kernel/kptr_restrict\n
Note: in order for this to work, the device needs to be rooted, which means that the above su command must succeed. This is sometimes confused with the adb root command, but that's not the same. adb root restarts the adbd daemon as root, which causes device shells to be root shells by default. This is unnecessary here and we don't recommend it: real Android applications never run as root, so Tracy/Android has to support running benchmarks as regular user and it's best to stick to this for the sake of realistic benchmarks. Internally, Tracy executes su commands to perform certain actions, so it too relies on the device being rooted without relying on the benchmark process being run as root.
adbd
IREE benchmarking gives us an accurate and reproducible view of program performance at specific levels of granularity. To analyze system behavior in more depth, there are various ways to profile IREE.
For some advanced CPU profiling needs such as querying CPU cache and other events, one may need to use some OS-specific profilers. See Profiling CPUs.
Tracy offers great insights into CPU/GPU interactions and Vulkan API usage details. However, information at a finer granularity, especially inside a particular shader dispatch, is missing. To supplement general purpose tools like Tracy, vendor-specific tools can be used. Refer to Profiling GPUs using Vulkan.
Tracy is a profiler that's been used for a wide range of profiling tasks on IREE. Refer to Profiling with Tracy.
Start here: ML frameworks overview
Guides for specific frameworks:
Start here: Deplyment configurations overview
Guides for specific configurations:
Parameters in IREE are externalized storage for resources that are asynchronously accessible and device-aware. Parameters offer efficient ways to store, manipulate, and load data for large resources like the weights in a machine learning model.
Without using parameters, compiled programs include both code and data:
graph LR\n accTitle: .vmfb file without using parameters\n accDescr {\n Without using parameters, .vmfb files contain host code, device code,\n small data, and large resources all in the same file.\n }\n\n subgraph VMFB[\".vmfb file\"]\n HostCode(Host code)\n DeviceCode(Device code)\n SmallData(Small data)\n LargeResources(Large resources)\n end
Using parameters, data can be stored, transmitted, and loaded from separate sources:
graph BT\n accTitle: .vmfb file using parameters\n accDescr {\n Using parameters, .vmfb files contain host code, device code, small\n constants, and parameters. External .irpa, .safetensors, and .gguf files\n can be linked to these parameters.\n }\n\n subgraph VMFB[\".vmfb file using parameters\"]\n HostCode(Host code)\n DeviceCode(Device code)\n SmallData(Small data)\n Parameters(\"Parameters\n \u2022 scope_1::key_1\n \u2022 scope_1::key_2\n \u2022 scope_2::key_1\n \u2022 scope_2::key_2\")\n end\n\n subgraph IRPA[\".irpa file\"]\n key_1\n key_2\n end\n\n subgraph Safetensors[\".safetensors file\"]\n key_1a[key_1]\n end\n\n subgraph GGUF[\".gguf file\"]\n key_2a[key_2]\n end\n\n IRPA -. \"scope_1\" .-> Parameters\n Safetensors -. \"scope_2\" .-> Parameters\n GGUF -. \"scope_2\" .-> Parameters
Notice that parameters are identified by a scope and a unique key within that scope, not strong references to specific file paths. Data from any supported file format or \"parameter index provider\" can be loaded.
The IREE Parameter Archive (IRPA) file format (iree/schemas/parameter_archive.h) is IREE's own format optimized for deployment. Formats like GGUF and safetensors can be converted to IRPA.
iree/schemas/parameter_archive.h
The GGUF format is used by the GGML project and other projects in that ecosystem like llama.cpp.
The safetensors format is used by the Hugging Face community.
The core IREE tools are written in C and aim to be simple and pragmatic, with minimal dependencies. Other formats could be converted into supported file types:
.pt
.pth
torch.save
.ckpt
.h5
model.keras
In-tree formats for file-backed parameters are defined in the iree/io/formats/ folder. Additional formats could be defined out-of-tree to make use of external libraries as needed.
iree/io/formats/
Parameter loading from memory (or a cache, or some other location) is possible by adding new providers implementing iree_io_parameter_provider_t. The default parameter index provider operates on files on local disk.
iree_io_parameter_provider_t
The iree-create-parameters tool can create IREE Parameter Archive (.irpa) files. Each parameter in the archive can be created with either a real data value (taking up storage space in the final archive) or a splatted value (zeroed contents or a repeated value, taking up no storage space on disk).
iree-create-parameters
For a detailed list of options, pass --help:
$ iree-create-parameters --help\n\n# ============================================================================\n# \ud83d\udc7b IREE: iree-create-parameters\n# ============================================================================\n\nCreates IREE Parameter Archive (.irpa) files. Provide zero or more\nparameter value declarations and an output file with\n`--output=file.irpa` to produce a new file with zeroed or patterned\ncontents.\n\n...\n
Example creating a file with two zeroed embedded parameters and one with a repeating pattern:
$ iree-create-parameters \\\n --data=my.zeroed_param_1=4096xf32 \\\n --data=my.zeroed_param_2=2x4096xi16 \\\n --data=my.pattern_param_2=8x2xf32=2.1 \\\n --output=output_with_storage.irpa\n
Example creating a file with splatted values (no storage on disk):
$ iree-create-parameters \\\n --splat=my.splat_param_1=4096xf32=4.1 \\\n --splat=my.splat_param_2=2x4096xi16=123 \\\n --output=output_without_storage.irpa\n
Parameter archives can also be created using IREE's Python bindings:
import iree.runtime as rt\nimport numpy as np\n\nparameter_index = rt.ParameterIndex()\nparameter_index.add_buffer(\"weight\", np.zeros([32, 16]) + 2.0)\nparameter_index.add_buffer(\"bias\", np.zeros([32, 16]) + 0.5)\nparameter_index.create_archive_file(\"parameters.irpa\")\n
See the runtime/bindings/python/tests/io_test.py file for more usage examples.
runtime/bindings/python/tests/io_test.py
The iree-convert-parameters tool converts supported files into IREE Parameter Archives (.irpa) files.
iree-convert-parameters
$ iree-convert-parameters --help\n\n# ============================================================================\n# \ud83d\udc7b IREE: iree-convert-parameters\n# ============================================================================\n\nConverts supported parameter file formats into IREE Parameter Archives\n(.irpa) files. Provide one or more input parameter files in the same\nform as expected by the iree-run-module tool (`--parameters=foo.gguf`)\nand an output file with `--output=file.irpa`.\n\n...\n
Example converting from safetensors to IRPA:
$ iree-convert-parameters \\\n --parameters=input.safetensors \\\n --output=output.irpa\n
Example mutating parameters:
$ iree-convert-parameters \\\n --parameters=a.gguf \\\n --parameters=b.safetensors \\\n --exclude=unneeded_param \\\n --rename=old_name=new_name \\\n --splat=some_name=f32=4.2 \\\n --output=ab.irpa\n
Example stripping parameters and replacing them with zeros except for one with special handling:
$ iree-convert-parameters \\\n --parameters=input.irpa \\\n --strip \\\n --splat=special_param=f32=1.0 \\\n --output=output.irpa\n
The iree-dump-parameters tool outputs information about parsed parameter files.
iree-dump-parameters
$ iree-dump-parameters --help\n\n# ============================================================================\n# \ud83d\udc7b IREE: iree-dump-parameters\n# ============================================================================\n\nDumps information about parsed parameter files.\n\n...\n
Example listing all available parameters and their index information:
$ iree-dump-parameters \\\n --parameters=my_scope=my_file.gguf \\\n [--parameters=...]\n
Example extracting parameter binary contents from a file:
$ iree-dump-parameters ... \\\n --extract=scope::key0=file0.bin \\\n [--extract=...]\n
IREE command line tooling can load parameter files alongside module files:
iree-run-module --module=program.vmfb --parameters=data.irpa ...\n
For concrete examples, see these test files:
tools/test/parameters_scoped.mlir
tools/test/parameters_unscoped.mlir
See the runtime/bindings/python/tests/io_runtime_test.py file for usage examples.
runtime/bindings/python/tests/io_runtime_test.py
TODO: iree_io_parameters_module_create() sample code
iree_io_parameters_module_create()
IREE provides a flexible set of tools for various deployment scenarios. Fully featured environments can use IREE to load programs on demand and to take advantage of multi-threaded hardware, while embedded systems can bypass IREE's runtime entirely or interface with custom accelerators.
These are just the most stable configurations IREE supports. Feel free to reach out on any of IREE's communication channels if you have questions about a specific platform, hardware accelerator, or set of system features.
Compiler target backends are used to generate executable code for hardware APIs and device architectures. Compiler targets may implement special optimizations or generate distinct code for certain device/architecture/performance profiles.
When compiling programs, a list of target backends must be specified via
target_backends=[...]
vmvx
metal-spirv
metal
rocm
webgpu-spirv
webgpu
Tip - listing available backends
The list of compiler target backends can be queried:
$ iree-compile --iree-hal-list-target-backends\n\nRegistered target backends:\n cuda\n llvm-cpu\n metal\n metal-spirv\n rocm\n vmvx\n vmvx-inline\n vulkan\n vulkan-spirv\n
iree.compiler.query_available_targets()\n\n['cuda',\n 'llvm-cpu',\n 'metal',\n 'metal-spirv',\n 'rocm',\n 'vmvx',\n 'vmvx-inline',\n 'vulkan',\n 'vulkan-spirv']\n
Runtime HAL devices call into hardware APIs to load and run executable code. Devices may use multithreading or other system resources, depending on their focus and the build configuration.
Additional HAL drivers can also be defined external to the core project via IREE_EXTERNAL_HAL_DRIVERS.
IREE_EXTERNAL_HAL_DRIVERS
IREE supports model execution via CPU on bare-metal platforms. Bare metal platforms have no operating system support, and executables are built using machine-specific linker scripts and/or board support packages (BSPs).
Bare-metal deployment typically uses IREE's LLVM compiler target backend much like the CPU configuration, but using a limited subset of IREE's CPU HAL driver code at runtime to load and execute compiled programs.
Out-of-tree bare-metal platform tools and source code for the system should be ready, such as
Please follow the instructions to retrieve the IREE compiler.
The model can be compiled with the following command:
iree-compile \\\n --iree-stream-partitioning-favor=min-peak-memory \\\n --iree-hal-target-backends=llvm-cpu \\\n --iree-llvmcpu-target-triple=x86_64-pc-linux-elf \\\n --iree-llvmcpu-debug-symbols=false \\\n samples/models/simple_abs.mlir \\\n -o /tmp/simple_abs_cpu.vmfb\n
In which
--iree-stream-partitioning-favor=min-peak-memory
--iree-llvmcpu-target-triple
<arch>-pc-linux-elf
--iree-llvmcpu-debug-symbols=false
See generate.sh for example command-line instructions of some common architectures.
You can replace the MLIR file with the other MLIR model files, following the instructions.
See the static_library demo sample for an example and instructions on running a model with IREE's static_library_loader.
static_library_loader
By default, the demo targets the host machine when compiling. To produce a bare-metal compatible model, run iree-compile as in the previous example and add the additional -iree-llvmcpu-static-library-output-path= flag to specify the static library destination. This will produce a .h\\.o file to link directly into the target application.
-iree-llvmcpu-static-library-output-path=
.h\\.o
A few CMake options and macros should be set to build a subset of IREE runtime libraries compatible with the bare-metal platform. We assume there's no multi-thread control nor system library support in the bare-metal system. The model execution is in a single-thread synchronous fashion.
# Build the IREE runtime only\nset(IREE_BUILD_COMPILER OFF)\n\n# Tell CMake to skip targeting a specific operating system\nset(CMAKE_SYSTEM_NAME Generic)\n\n# Disable multi-thread library support\nset(IREE_ENABLE_THREADING OFF)\n\n# Only enable the local synchronous HAL driver\nset(IREE_HAL_DRIVER_DEFAULTS OFF)\nset(IREE_HAL_DRIVER_LOCAL_SYNC ON)\n\n# Only enable some executable loaders\nset(IREE_HAL_EXECUTABLE_LOADER_DEFAULTS OFF)\nset(IREE_HAL_EXECUTABLE_LOADER_EMBEDDED_ELF ON)\nset(IREE_HAL_EXECUTABLE_LOADER_VMVX_MODULE ON)\n\n# Only enable the embedded ELF executable plugin\nset(IREE_HAL_EXECUTABLE_PLUGIN_DEFAULTS OFF)\nset(IREE_HAL_EXECUTABLE_PLUGIN_EMBEDDED_ELF ON)\n\n# Disable tests until IREE supports running them on bare-metal platforms\nset(IREE_BUILD_TESTS OFF)\n\n# Build samples\nset(IREE_BUILD_SAMPLES ON)\n
Todo
Clean the list up after #6353 is fixed.
Also, set the toolchain-specific cmake file to match the tool path, target architecture, target abi, linker script, system library path, etc.
These macros should be defined, either in C/C++ or via CMake options like
set(MY_FLAGS \"-DIREE_PLATFORM_GENERIC=1\")\nset(CMAKE_C_FLAGS ${MY_FLAGS} ${CMAKE_C_FLAGS})\nset(CMAKE_CXX_FLAGS ${MY_FLAGS} ${CMAKE_CXX_FLAGS})\n
IREE_PLATFORM_GENERIC
IREE_SYNCHRONIZATION_DISABLE_UNSAFE=1
IREE_FILE_IO_ENABLE=0
IREE_TIME_NOW_FN
IREE_TIME_NOW_FN=\\\"\\{ return 0;\\}\\\"
IREE_WAIT_UNTIL_FN
bool(uint64_t nanos)
Examples of how to setup the CMakeLists.txt and .cmake file:
See simple_embedding for generic platform to see how to use the IREE runtime library to build/run the IREE model for the bare-metal target.
IREE supports efficient program execution on CPU devices by using LLVM to compile all dense computations in each program into highly optimized CPU native instruction streams, which are embedded in one of IREE's deployable formats.
To compile a program for CPU execution, pick one of IREE's supported executable formats:
At runtime, CPU executables can be loaded using one of IREE's CPU HAL drivers:
Add IREE's CPU support matrix: what architectures are supported; what architectures are well optimized; etc.
Python packages are regularly published to PyPI. See the Python Bindings page for more details. The core iree-compiler package includes the LLVM-based CPU compiler:
iree-compiler
Stable release packages are published to PyPI.
python -m pip install iree-compiler\n
Nightly releases are published on GitHub releases.
python -m pip install \\\n --find-links https://iree.dev/pip-release-links.html \\\n --upgrade iree-compiler\n
iree-compile is installed to your python module installation path. If you pip install with the user mode, it is under ${HOME}/.local/bin, or %APPDATA%Python on Windows. You may want to include the path in your system's PATH environment variable:
${HOME}/.local/bin
%APPDATA%Python
export PATH=${HOME}/.local/bin:${PATH}\n
Please make sure you have followed the Getting started page to build IREE for your host platform and the Android cross-compilation or iOS cross-compilation page if you are cross compiling for a mobile device. The llvm-cpu compiler backend is compiled in by default on all platforms.
Ensure that the IREE_TARGET_BACKEND_LLVM_CPU CMake option is ON when configuring for the host.
IREE_TARGET_BACKEND_LLVM_CPU
iree-compile will be built under the iree-build/tools/ directory. You may want to include this path in your system's PATH environment variable.
iree-build/tools/
You will need to get an IREE runtime that supports the local CPU HAL driver, along with the appropriate executable loaders for your application.
You can check for CPU support by looking for the local-sync and local-task drivers:
$ iree-run-module --list_drivers\n\n cuda: CUDA (dynamic)\n local-sync: Local execution using a lightweight inline synchronous queue\n local-task: Local execution using the IREE multithreading task system\n vulkan: Vulkan 1.x (dynamic)\n
Please make sure you have followed the Getting started page to build IREE for your host platform and the Android cross-compilation page if you are cross compiling for Android. The local CPU HAL drivers are compiled in by default on all platforms.
Ensure that the IREE_HAL_DRIVER_LOCAL_TASK and IREE_HAL_EXECUTABLE_LOADER_EMBEDDED_ELF (or other executable loader) CMake options are ON when configuring for the target.
IREE_HAL_DRIVER_LOCAL_TASK
IREE_HAL_EXECUTABLE_LOADER_EMBEDDED_ELF
With the requirements out of the way, we can now compile a model and run it.
The IREE compiler transforms a model into its final deployable format in many sequential steps. A model authored with Python in an ML framework should use the corresponding framework's import tool to convert into a format (i.e., MLIR) expected by the IREE compiler first.
Using MobileNet v2 as an example, you can download the SavedModel with trained weights from TensorFlow Hub and convert it using IREE's TensorFlow importer. Then run the following command to compile with the llvm-cpu target:
iree-compile \\\n --iree-hal-target-backends=llvm-cpu \\\n mobilenet_iree_input.mlir -o mobilenet_cpu.vmfb\n
Tip - CPU targets
The --iree-llvmcpu-target-triple flag tells the compiler to generate code for a specific type of CPU. You can see the list of supported targets with iree-compile --iree-llvmcpu-list-targets, or pass \"host\" to let LLVM infer the triple from your host machine (e.g. x86_64-linux-gnu).
iree-compile --iree-llvmcpu-list-targets
x86_64-linux-gnu
$ iree-compile --iree-llvmcpu-list-targets\n\n Registered Targets:\n aarch64 - AArch64 (little endian)\n aarch64_32 - AArch64 (little endian ILP32)\n aarch64_be - AArch64 (big endian)\n arm - ARM\n arm64 - ARM64 (little endian)\n arm64_32 - ARM64 (little endian ILP32)\n armeb - ARM (big endian)\n riscv32 - 32-bit RISC-V\n riscv64 - 64-bit RISC-V\n wasm32 - WebAssembly 32-bit\n wasm64 - WebAssembly 64-bit\n x86 - 32-bit X86: Pentium-Pro and above\n x86-64 - 64-bit X86: EM64T and AMD64\n
Tip - CPU features
The --iree-llvmcpu-target-cpu-features flag tells the compiler to generate code using certain CPU \"features\", like SIMD instruction sets. Like the target triple, you can pass \"host\" to this flag to let LLVM infer the features supported by your host machine.
--iree-llvmcpu-target-cpu-features
In the build directory, run the following command:
tools/iree-run-module \\\n --device=local-task \\\n --module=mobilenet_cpu.vmfb \\\n --function=predict \\\n --input=\"1x224x224x3xf32=0\"\n
The above assumes the exported function in the model is named as predict and it expects one 224x224 RGB image. We are feeding in an image with all 0 values here for brevity, see iree-run-module --help for the format to specify concrete values.
predict
iree-run-module --help
IREE can accelerate model execution on Nvidia GPUs using CUDA.
In order to use CUDA to drive the GPU, you need to have a functional CUDA environment. It can be verified by the following steps:
nvidia-smi | grep CUDA\n
If nvidia-smi does not exist, you will need to install the latest CUDA Toolkit SDK.
nvidia-smi
Python packages are regularly published to PyPI. See the Python Bindings page for more details. The core iree-compiler package includes the CUDA compiler:
Please make sure you have followed the Getting started page to build the IREE compiler, then enable the CUDA compiler target with the IREE_TARGET_BACKEND_CUDA option.
IREE_TARGET_BACKEND_CUDA
Next you will need to get an IREE runtime that includes the CUDA HAL driver.
Please make sure you have followed the Getting started page to build IREE from source, then enable the CUDA HAL driver with the IREE_HAL_DRIVER_CUDA option.
IREE_HAL_DRIVER_CUDA
With the compiler and runtime ready, we can now compile programs and run them on GPUs.
Using MobileNet v2 as an example, you can download the SavedModel with trained weights from TensorFlow Hub and convert it using IREE's TensorFlow importer. Then run one of the following commands to compile:
iree-compile \\\n --iree-hal-target-backends=cuda \\\n --iree-hal-cuda-llvm-target-arch=<...> \\\n mobilenet_iree_input.mlir -o mobilenet_cuda.vmfb\n
Note that a cuda target architecture (iree-hal-cuda-llvm-target-arch) of the form sm_<arch_number> is needed to compile towards each GPU architecture. If no architecture is specified then we will default to sm_35.
iree-hal-cuda-llvm-target-arch
sm_<arch_number>
sm_35
Here is a table of commonly used architectures:
sm_60
sm_70
sm_80
Run the following command:
iree-run-module \\\n --device=cuda \\\n --module=mobilenet_cuda.vmfb \\\n --function=predict \\\n --input=\"1x224x224x3xf32=0\"\n
Documentation coming soon!
IREE can accelerate model execution on AMD GPUs using ROCm.
In order to use ROCm to drive the GPU, you need to have a functional ROCm environment. It can be verified by the following steps:
rocm-smi | grep rocm\n
If rocm-smi does not exist, you will need to install the latest ROCm Toolkit SDK for Windows or Linux.
rocm-smi
Currently ROCm is NOT supported for the Python interface.
Please make sure you have followed the Getting started page to build the IREE compiler, then enable the ROCm compiler target with the IREE_TARGET_BACKEND_ROCM option.
IREE_TARGET_BACKEND_ROCM
Next you will need to get an IREE runtime that includes the ROCm HAL driver.
Please make sure you have followed the Getting started page to build IREE from source, then enable the experimental ROCm HAL driver with the IREE_EXTERNAL_HAL_DRIVERS=rocm option.
IREE_EXTERNAL_HAL_DRIVERS=rocm
iree-compile \\\n --iree-hal-target-backends=rocm \\\n --iree-rocm-target-chip=<...> \\\n mobilenet_iree_input.mlir -o mobilenet_rocm.vmfb\n
Note that IREE comes with bundled bitcode files, which are used for linking certain intrinsics on AMD GPUs. These will be used automatically or if the --iree-rocm-bc-dir is empty. As additional support may be needed for different chips, users can use this flag to point to an explicit directory. For example, in ROCm installations on Linux, this is often found under /opt/rocm/amdgcn/bitcode.
--iree-rocm-bc-dir
/opt/rocm/amdgcn/bitcode
Note that a ROCm target chip (iree-rocm-target-chip) of the form gfx<arch_number> is needed to compile towards each GPU architecture. If no architecture is specified then we will default to gfx908.
iree-rocm-target-chip
gfx<arch_number>
gfx908
gfx900
gfx906
gfx940
gfx942
iree-run-module \\\n --device=rocm \\\n --module=mobilenet_rocm.vmfb \\\n --function=predict \\\n --input=\"1x224x224x3xf32=0\"\n
IREE can accelerate model execution on GPUs via Vulkan, a low-overhead graphics and compute API. Vulkan is cross-platform: it is available on many operating systems, including Android, Linux, and Windows. Vulkan is also cross-vendor: it is supported by most GPU vendors, including AMD, ARM, Intel, NVIDIA, and Qualcomm.
As IREE and the compiler ecosystem it operates within matures, more target specific optimizations will be implemented. At this stage, expect reasonable performance across all GPUs and for improvements to be made over time for specific vendors and architectures.
In order to use Vulkan to drive the GPU, you need to have a functional Vulkan environment. IREE requires Vulkan 1.1 on Android and 1.2 elsewhere. It can be verified by the following steps:
Android mandates Vulkan 1.1 support since Android 10. You just need to make sure the device's Android version is 10 or higher.
Run the following command in a shell:
vulkaninfo | grep apiVersion\n
If vulkaninfo does not exist, you will need to install the latest Vulkan SDK. Installing via LunarG's package repository is recommended, as it places Vulkan libraries and tools under system paths so it's easy to discover.
If the listed version is lower than Vulkan 1.2, you will need to update the driver for your GPU.
If vulkaninfo does not exist, you will need to install the latest Vulkan SDK.
Vulkan expects the program running on GPU to be expressed by the SPIR-V binary exchange format, which the model must be compiled into.
Python packages are regularly published to PyPI. See the Python Bindings page for more details. The core iree-compiler package includes the SPIR-V compiler:
Please make sure you have followed the Getting started page to build IREE for your host platform and the Android cross-compilation page if you are cross compiling for Android. The SPIR-V compiler backend is compiled in by default on all platforms.
Ensure that the IREE_TARGET_BACKEND_VULKAN_SPIRV CMake option is ON when configuring for the host.
IREE_TARGET_BACKEND_VULKAN_SPIRV
Next you will need to get an IREE runtime that supports the Vulkan HAL driver.
You can check for Vulkan support by looking for a matching driver and device:
$ iree-run-module --list_devices\n\n cuda://GPU-00000000-1111-2222-3333-444444444444\n local-sync://\n local-task://\n vulkan://00000000-1111-2222-3333-444444444444\n
Please make sure you have followed the Getting started page to build IREE for Linux/Windows and the Android cross-compilation page for Android. The Vulkan HAL driver is compiled in by default on non-Apple platforms.
Ensure that the IREE_HAL_DRIVER_VULKAN CMake option is ON when configuring for the target.
IREE_HAL_DRIVER_VULKAN
With the SPIR-V compiler and Vulkan runtime, we can now compile programs and run them on GPUs.
Using MobileNet v2 as an example, you can download the SavedModel with trained weights from TensorFlow Hub and convert it using IREE's TensorFlow importer. Then run the following command to compile with the vulkan-spirv target:
iree-compile \\\n --iree-hal-target-backends=vulkan-spirv \\\n --iree-vulkan-target-triple=<...> \\\n mobilenet_iree_input.mlir -o mobilenet_vulkan.vmfb\n
Currently a target triple of the form <vendor/arch>-<product>-<os> is needed to compile towards a specific GPU architecture.
<vendor/arch>-<product>-<os>
We don't support the full spectrum here(1); the following table summarizes the currently recognized ones.
If no triple is specified, then a safe but more limited default will be used.
This is more of a mechanism to help us develop IREE itself--in the long term we want to perform multiple targetting to generate to multiple architectures if no target triple is given.
valhall-unknown-{android30|android31}
adreno-unknown-{android30|android31}
{rdna1|rdna2|rdna3}-unknown-unknown
{turing|ampere}-unknown-unknown
cpu-swiftshader-unknown
tools/iree-run-module \\\n --device=vulkan \\\n --module=mobilenet_vulkan.vmfb \\\n --function=predict \\\n --input=\"1x224x224x3xf32=0\"\n
IREE supports popular machine learning frameworks using the same underlying technology.
graph LR\n accTitle: ML framework to runtime deployment workflow overview\n accDescr {\n Programs start in some ML framework.\n Programs are imported into MLIR.\n The IREE compiler uses the imported MLIR.\n Compiled programs are used by the runtime.\n }\n\n A[ML frameworks]\n B[Imported MLIR]\n C[IREE compiler]\n D[Runtime deployment]\n\n A --> B\n B --> C\n C --> D
See guides on how to use each framework with IREE:
Check out the samples in IREE's samples/ directory, as well as the iree-experimental repository.
samples/
Each machine learning framework has some \"export\" mechanism that snapshots the structure and data in your program. These exported programs can then be \"imported\" into IREE's compiler by using either a stable import format or one of IREE's importer tools.
This export/import process is specific to each frontend and typically involves a number of stages:
This fully imported form can then be compiled indepedently of the source language and framework.
IREE compiles MLIR files for specified sets of backends (CPU, GPU, etc). Each backend generates optimized native code custom to the input program and intended target platform. Once compiled, modules can be executed using IREE's runtime.
See the deployment configuration guides for details on selecting a compiler backend and tuning options for your choice of target platform(s) or device(s).
Compiled modules can be executed by selecting what compute devices to use, loading the module, and then executing it with the intended inputs. IREE provides several language bindings for its runtime API.
IREE's JAX support is under development. This page is still under construction.
IREE offers two ways to interface with JAX programs:
integrations/pjrt
Caution - under development
Support for a broad set of ONNX operators and data types is an active investment area. See the ONNX Op Support tracking issue for the latest status.
Machine learning models using the Open Neural Network Exchange (ONNX) format can be deployed using the IREE compiler and runtime:
graph LR\n accTitle: ONNX to runtime deployment workflow overview\n accDescr {\n Programs start as ONNX protobufs.\n Programs are imported into MLIR using iree-import-onnx.\n The IREE compiler uses the imported MLIR.\n Compiled programs are used by the runtime.\n }\n\n A[\"ONNX\\n(protobuf)\"]\n B[\"MLIR\\n(torch-mlir)\"]\n C[IREE compiler]\n D[Runtime deployment]\n\n A -- iree-import-onnx --> B\n B --> C\n C --> D
Install ONNX:
python -m pip install onnx\n
Install IREE packages, either by building from source or from pip:
python -m pip install \\\n iree-compiler[onnx] \\\n iree-runtime\n
python -m pip install \\\n --find-links https://iree.dev/pip-release-links.html \\\n --upgrade \\\n iree-compiler[onnx] \\\n iree-runtime\n
Start with a .onnx protobuf file, such as a model from https://github.com/onnx/models.
.onnx
Convert the .onnx file into MLIR using the iree-import-onnx tool:
iree-import-onnx
iree-import-onnx [model.onnx] -o [model.mlir]\n
This tool produces a MLIR file with the help of the torch-mlir project.
Once imported, the standard set of tools and APIs available for any of IREE's deployment configurations and API bindings can be used:
iree-compile \\\n model.mlir \\\n --iree-hal-target-backends=llvm-cpu \\\n -o model_cpu.vmfb\n\niree-run-module \\\n model_cpu.vmfb \\\n --device=local-task \\\n --entry_function=... \\\n --input=... \\\n ...\n
e2eshark/onnx
iree_tests/onnx
test/python/onnx_importer
If you see an error compiling a converted .mlir file like this:
$ iree-compile model.mlir --iree-hal-target-backends=llvm-cpu -o model.vmfb\n\nmodel.mlir:507:12: error: failed to legalize operation 'torch.operator' that was explicitly marked illegal\n %503 = torch.operator \"onnx.Identity\"(%arg0) : (!torch.vtensor<[?],si64>) -> !torch.vtensor<[?],si64>\n ^\n
There are several possible scenarios:
The operator is implemented but only for a more recent ONNX version. You can try upgrading your .onnx file using the ONNX Version Converter:
import onnx\noriginal_model = onnx.load_model(\"model.onnx\")\nconverted_model = onnx.version_converter.convert_version(original_model, 17)\nonnx.save(converted_model, \"model_17.onnx\")\n
and then attempting the convert -> compile again:
iree-import-onnx model_17.onnx -o model_17.mlir\niree-compile model_17.mlir ...\n
We are still validating and fixing specific models. Between bug fixes in flight and releases running behind, we don't expect that you will be able to do a lot of advanced things without using nightly releases or working with us.
Stay tuned and join the discussion in our Discord server's #pytorch channel.
#pytorch
iree-turbine (rebrand pending from SHARK-Turbine) offers a tight integration between compatible versions of IREE, torch-mlir, and PyTorch.
Both just-in-time (JIT) and ahead-of-time (AOT) workflows are supported:
graph LR\n accTitle: PyTorch integration overview\n accDescr {\n PyTorch programs can be optimized within a Python session with\n iree-turbine's just-in-time tools.\n PyTorch programs can be exported out of Python to native binaries using\n iree-turbine's ahead-of-time export toolkit.\n }\n\n subgraph Python\n pytorch(PyTorch)\n subgraph turbine [iree-turbine]\n jit(\"Eager execution (JIT)\")\n aot(\"Export toolkit (AOT)\")\n end\n\n pytorch --> jit\n jit --> pytorch\n pytorch --> aot\n end\n\n subgraph Native\n binary([\"binary (.vmfb)\"])\n end\n\n aot -.-> binary
Install a recent version of PyTorch (2.3.0+, prerelease as of April 2024):
2.3.0+
python -m pip install \\\n --pre --index-url https://download.pytorch.org/whl/test/cpu torch==2.3.0\n
Install iree-turbine:
python -m pip install iree-turbine\n
Just-in-time integration allows for Python code using TorchDynamo to optimize PyTorch models/functions using IREE, all within an interactive Python session.
graph TD\n accTitle: PyTorch JIT workflow overview\n accDescr {\n Programs start as either PyTorch nn.Module objects or callable functions.\n Programs are compiled into optimized modules using torch.compile.\n Within torch.compile, Dynamo runs the program through Turbine and IREE.\n }\n\n subgraph Python\n input([nn.Module / function])\n\n subgraph compile [\"torch.compile()\"]\n direction LR\n dynamo{{TorchDynamo}}\n turbine{{iree-turbine}}\n iree{{IREE}}\n dynamo --> turbine --> iree\n end\n\n output([Optimized module])\n input --> compile --> output\n end
For deployment outside of Python, see the ahead-of-time sections below.
Turbine integrates into PyTorch as a custom backend for torch.compile.
torch.compile
Behind the scenes, PyTorch captures the structure of the input model into a computation graph and feeds that graph through to the selected backend compiler.
import torch\n\n# Define the `nn.Module` or Python function to run.\nclass LinearModule(torch.nn.Module):\n def __init__(self, in_features, out_features):\n super().__init__()\n self.weight = torch.nn.Parameter(torch.randn(in_features, out_features))\n self.bias = torch.nn.Parameter(torch.randn(out_features))\n\n def forward(self, input):\n return (input @ self.weight) + self.bias\n\nlinear_module = LinearModule(4, 3)\n\n# Compile the program using the turbine backend.(1)\nopt_linear_module = torch.compile(linear_module, backend=\"turbine_cpu\")\n\n# Use the compiled program as you would the original program.\nargs = torch.randn(4)\nturbine_output = opt_linear_module(args)\n
core/examples/eager_mlp/mlp_eager_simple.py
The ahead-of-time toolkit allows developers to define a program's structure in Python and then export deployment-ready artifacts that can be used in IREE's deployment configurations via the API bindings.
For simple models, a one-shot export API is available.
graph LR\n accTitle: PyTorch simple AOT workflow overview\n accDescr {\n Programs start as PyTorch nn.Module objects.\n Modules are exported using the \"aot\" API.\n Exported outputs are then compiled to .vmfb files with executable binaries.\n }\n\n subgraph Python\n input([nn.Module])\n export([\"ExportOutput (MLIR)\"])\n input -- \"aot.export()\" --> export\n end\n\n subgraph Native\n binary([\"binary (.vmfb)\"])\n end\n\n export -. \"compile()\" .-> binary
import iree.runtime as ireert\nimport numpy as np\nimport shark_turbine.aot as aot\nimport torch\n\n# Define the `nn.Module` to export.\nclass LinearModule(torch.nn.Module):\n def __init__(self, in_features, out_features):\n super().__init__()\n self.weight = torch.nn.Parameter(torch.randn(in_features, out_features))\n self.bias = torch.nn.Parameter(torch.randn(out_features))\n\n def forward(self, input):\n return (input @ self.weight) + self.bias\n\nlinear_module = LinearModule(4, 3)\n\n# Export the program using the simple API.\nexample_arg = torch.randn(4)\nexport_output = aot.export(linear_module, example_arg)\n\n# Compile to a deployable artifact.\nbinary = export_output.compile(save_to=None)\n\n# Use the IREE runtime API to test the compiled program.\nconfig = ireert.Config(\"local-task\")\nvm_module = ireert.load_vm_module(\n ireert.VmModule.wrap_buffer(config.vm_instance, binary.map_memory()),\n config,\n)\ninput = np.array([1.0, 2.0, 3.0, 4.0], dtype=np.float32)\nresult = vm_module.main(input)\nprint(result.to_host())\n
core/examples/aot_mlp/mlp_export_simple.py
For more complex models, an underlying advanced API is available that gives access to more features.
graph LR\n accTitle: PyTorch advanced AOT workflow overview\n accDescr {\n Programs are represented using the aot.CompiledModule class.\n CompiledModules can extend nn.Module objects, export globals, and set\n shapes and dtypes for each function.\n Modules are exported using the \"aot\" API.\n Exported outputs are then compiled to .vmfb files with executable binaries.\n }\n\n subgraph Python\n compiledmodule(\"aot.CompiledModule\\n\\n- extend nn.Module\\n- export globals\\n- set shapes/dtypes\")\n export([\"ExportOutput (MLIR)\"])\n compiledmodule -- \"aot.export()\" --> export\n end\n\n subgraph Native\n binary([\"binary (.vmfb)\"])\n end\n\n export -. \"compile()\" .-> binary
Advanced export workflows can use the aot.CompiledModule class to define and constrain the structure of a program prior to compiling it.
aot.CompiledModule
import shark_turbine.aot as aot\n\n# A minimal program, with no functions or variables.\nclass BasicModule(aot.CompiledModule):\n ...\n\n# Create an instance of the program and convert it to MLIR.\nfrom iree.compiler.ir import Context\ninstance = BasicModule(context=Context())\nmodule_str = str(aot.CompiledModule.get_mlir_module(instance))\n\nprint(module_str)\n# module @basic {\n# }\n
Exported functions are the API entry points into a compiled program.
Simple feed-forward neural networks used for inference may have a single exported function (typically called \"forward\"), while more complex programs can have multiple computation functions, initialization functions, \"backward\" methods for training, state management functions, debugging functions, etc.
Each instance method on a aot.CompiledModule-derived class is exported. These instance methods can include calls to other aot components, such as aot.jittable compute functions:
aot
aot.jittable
class GetOnesModule(aot.CompiledModule):\n @aot.jittable\n def compute_ones():\n return torch.ones(3)\n\n def get_ones(self):\n return self.compute_ones()\n
Instance methods can use aot.AbstractTensor to specify data types:
aot.AbstractTensor
class IntSumModule(aot.CompiledModule):\n @aot.jittable\n def compute_sum(a, b):\n return a + b\n\n def sum_int32(\n self,\n a=aot.AbstractTensor(2, dtype=torch.int32),\n b=aot.AbstractTensor(2, dtype=torch.int32),\n ):\n return self.compute_sum(a, b)\n
Shapes can be made dynamic using aot.AbstractTensor and aot.jittable constraints:
class DynamicSumModule(aot.CompiledModule):\n @aot.jittable\n def compute_sum(a, b):\n return a + b\n\n def sum_dynamic(\n self,\n a=aot.AbstractTensor(None),\n b=aot.AbstractTensor(None),\n ):\n return self.compute_sum(\n a,\n b,\n constraints=[\n a.dynamic_dim(0) == b.dynamic_dim(0),\n ],\n )\n
Global variables are used to represent persistent state within a program instance.
For example, they can be used to represent the weights and biases in a neural network, and exporting these as mutable variables can allow for setting their values independently at runtime.
Individual globals can be exported using aot.export_global():
aot.export_global()
state_example = torch.zeros([1], dtype=torch.int32)\n\nclass SampleModule(aot.CompiledModule):\n value = aot.export_global(state_example, mutable=True)\n\n def get_value(self):\n return self.value\n\n def update_value(self, new_value=aot.abstractify(value)):\n self.value = new_value\n
All named parameters on a nn.Module can be exported using export_parameters():
nn.Module
export_parameters()
class SimpleParams(torch.nn.Module):\n def __init__(self):\n super().__init__()\n self.classifier = torch.nn.Linear(20, 30)\n\n def forward(self, x):\n return self.classifier(x)\n\nm = SimpleParams()\n\nclass SimpleParamsModule(aot.CompiledModule):\n params = aot.export_parameters(m)\n compute = aot.jittable(m.forward)\n\n def run(self, x=aot.AbstractTensor(128, 20)):\n return self.compute(x)\n\n # torch.nn.Linear has 'weight' and 'bias' variables:\n # https://pytorch.org/docs/stable/generated/torch.nn.Linear.html\n # Add getters for both exported parameters.\n\n def get_weight(self):\n return self.params[\"classifier.weight\"]\n\n def get_bias(self):\n return self.params[\"classifier.bias\"]\n
core/tests/aot/
core/examples/aot_mlp/mlp_export_dynamic.py
models/turbine_models/custom_models/stateless_llama.py
IREE supports compiling and running TensorFlow programs represented as tf.Module classes or stored in the SavedModel format.
tf.Module
SavedModel
graph LR\n accTitle: TensorFlow to runtime deployment workflow overview\n accDescr {\n Programs start as either TensorFlow SavedModel or tf.Module programs.\n Programs are imported into MLIR as StableHLO.\n The IREE compiler uses the imported MLIR.\n Compiled programs are used by the runtime.\n }\n\n subgraph A[TensorFlow]\n direction TB\n A1[SavedModel]\n A2[tf.Module]\n\n A1 --- A2\n end\n\n subgraph B[MLIR]\n B1[StableHLO]\n end\n\n C[IREE compiler]\n D[Runtime deployment]\n\n A -- iree-import-tf --> B\n B --> C\n C --> D
Install TensorFlow by following the official documentation:
python -m pip install tf-nightly\n
python -m pip install \\\n iree-compiler \\\n iree-runtime \\\n iree-tools-tf\n
python -m pip install \\\n --find-links https://iree.dev/pip-release-links.html \\\n --upgrade \\\n iree-compiler \\\n iree-runtime \\\n iree-tools-tf\n
IREE compilers transform a model into its final deployable format in several sequential steps. The first step for a TensorFlow model is to use either the iree-import-tf command-line tool or IREE's Python APIs to import the model into a format (i.e., MLIR) compatible with the generic IREE compilers.
IREE supports importing and using SavedModels from TensorFlow Hub.
First download the SavedModel and load it to get the serving signature, which is used as the entry point for IREE compilation flow:
import tensorflow.compat.v2 as tf\nloaded_model = tf.saved_model.load('/path/to/downloaded/model/')\nprint(list(loaded_model.signatures.keys()))\n
If there are no serving signatures in the original SavedModel, you may add them by yourself by following \"Missing serving signature in SavedModel\".
Then you can import the model with iree-import-tf. You can read the options supported via iree-import-tf -help. Using MobileNet v2 as an example and assuming the serving signature is predict:
iree-import-tf -help
iree-import-tf\n --tf-import-type=savedmodel_v1 \\\n --tf-savedmodel-exported-names=predict \\\n /path/to/savedmodel -o iree_input.mlir\n
iree-import-tf is installed as /path/to/python/site-packages/iree/tools/tf/iree-import-tf. You can find out the full path to the site-packages directory via the python -m site command.
/path/to/python/site-packages/iree/tools/tf/iree-import-tf
site-packages
python -m site
-tf-import-type needs to match the SavedModel version. You can try both v1 and v2 if you see one of them gives an empty dump.
-tf-import-type
Next, you can compile the model in iree_input.mlir for one of IREE's supported targets by following one of the deployment configuration guides.
End-to-end execution tests can be found in IREE's integrations/tensorflow/e2e/ directory.
Sometimes SavedModels are exported without explicit serving signatures. This happens by default for TensorFlow Hub SavedModels. However, serving signatures are required as entry points for IREE compilation flow. You can use Python to load and re-export the SavedModel to give it serving signatures. For example, for MobileNet v2, assuming we want the serving signature to be predict and operating on a 224x224 RGB image:
import tensorflow.compat.v2 as tf\nloaded_model = tf.saved_model.load('/path/to/downloaded/model/')\ncall = loaded_model.__call__.get_concrete_function(\n tf.TensorSpec([1, 224, 224, 3], tf.float32))\nsignatures = {'predict': call}\ntf.saved_model.save(loaded_model,\n '/path/to/resaved/model/', signatures=signatures)\n
The above will create a new SavedModel with a serving signature, predict, and save it to /path/to/resaved/model/.
/path/to/resaved/model/
IREE supports compiling and running TensorFlow Lite (TFLite) programs stored as TFLite FlatBuffers. These files can be imported into an IREE-compatible format then compiled to a series of backends.
graph LR\n accTitle: TFLite to runtime deployment workflow overview\n accDescr {\n Programs start as TensorFlow Lite FlatBuffers.\n Programs are imported into MLIR's TOSA dialect using iree-import-tflite.\n The IREE compiler uses the imported MLIR.\n Compiled programs are used by the runtime.\n }\n\n subgraph A[TFLite]\n A1[FlatBuffer]\n end\n\n subgraph B[MLIR]\n B1[TOSA]\n end\n\n C[IREE compiler]\n D[Runtime deployment]\n\n A -- iree-import-tflite --> B\n B --> C\n C --> D
python -m pip install \\\n iree-compiler \\\n iree-runtime \\\n iree-tools-tflite\n
python -m pip install \\\n --find-links https://iree.dev/pip-release-links.html \\\n --upgrade \\\n iree-compiler \\\n iree-runtime \\\n iree-tools-tflite\n
IREE's tooling is divided into two components: import and compilation.
These two stages can be completed entirely via the command line.
WORKDIR=\"/tmp/workdir\"\nTFLITE_URL=\"https://storage.googleapis.com/iree-model-artifacts/tflite-integration-tests/posenet_i8.tflite\"\nTFLITE_PATH=${WORKDIR}/model.tflite\nIMPORT_PATH=${WORKDIR}/tosa.mlir\nMODULE_PATH=${WORKDIR}/module.vmfb\n\n# Fetch the sample model\nwget ${TFLITE_URL} -O ${TFLITE_PATH}\n\n# Import the sample model to an IREE compatible form\niree-import-tflite ${TFLITE_PATH} -o ${IMPORT_PATH}\n\n# Compile for the CPU backend\niree-compile \\\n --iree-input-type=tosa \\\n --iree-hal-target-backends=llvm-cpu \\\n ${IMPORT_PATH} \\\n -o ${MODULE_PATH}\n
The example below demonstrates downloading, compiling, and executing a TFLite model using the Python API. This includes some initial setup to declare global variables, download the sample module, and download the sample inputs.
Declaration of absolute paths for the sample repo and import all required libraries. The default setup uses the CPU backend as the only target. This can be reconfigured to select alternative targets.
import iree.compiler.tflite as iree_tflite_compile\nimport iree.runtime as iree_rt\nimport numpy\nimport os\nimport urllib.request\n\nfrom PIL import Image\n\nworkdir = \"/tmp/workdir\"\nos.makedirs(workdir, exist_ok=True)\n\ntfliteFile = \"/\".join([workdir, \"model.tflite\"])\njpgFile = \"/\".join([workdir, \"input.jpg\"])\ntfliteIR = \"/\".join([workdir, \"tflite.mlir\"])\ntosaIR = \"/\".join([workdir, \"tosa.mlir\"])\nbytecodeModule = \"/\".join([workdir, \"iree.vmfb\"])\n\nbackends = [\"llvm-cpu\"]\nconfig = \"local-task\"\n
The TFLite sample model and input are downloaded locally.
tfliteUrl = \"https://storage.googleapis.com/iree-model-artifacts/tflite-integration-tests/posenet_i8.tflite\"\njpgUrl = \"https://storage.googleapis.com/iree-model-artifacts/tflite-integration-tests/posenet_i8_input.jpg\"\n\nurllib.request.urlretrieve(tfliteUrl, tfliteFile)\nurllib.request.urlretrieve(jpgUrl, jpgFile)\n
Once downloaded we can compile the model for the selected backends. Both the TFLite and TOSA representations of the model are saved for debugging purposes. This is optional and can be omitted.
iree_tflite_compile.compile_file(\n tfliteFile,\n input_type=\"tosa\",\n output_file=bytecodeModule,\n save_temp_tfl_input=tfliteIR,\n save_temp_iree_input=tosaIR,\n target_backends=backends,\n import_only=False)\n
After compilation is completed we configure the VmModule using the local-task configuration and compiled IREE module.
config = iree_rt.Config(\"local-task\")\ncontext = iree_rt.SystemContext(config=config)\nwith open(bytecodeModule, 'rb') as f:\n vm_module = iree_rt.VmModule.from_flatbuffer(config.vm_instance, f.read())\n context.add_vm_module(vm_module)\n
Finally, the IREE module is loaded and ready for execution. Here we load the sample image, manipulate to the expected input size, and execute the module. By default TFLite models include a single function named 'main'. The final results are printed.
im = numpy.array(Image.open(jpgFile).resize((192, 192))).reshape((1, 192, 192, 3))\nargs = [im]\n\ninvoke = context.modules.module[\"main\"]\niree_results = invoke(*args)\nprint(iree_results)\n
The tflitehub folder in the iree-experimental repository contains test scripts to compile, run, and compare various TensorFlow Lite models sourced from TensorFlow Hub.
An example smoke test of the TensorFlow Lite C API is available here.
Failures during the import step usually indicate a failure to lower from TensorFlow Lite's operations to TOSA, the intermediate representation used by IREE. Many TensorFlow Lite operations are not fully supported, particularly those than use dynamic shapes. Please reach out on one of IREE's communication channels if you notice something missing.
IREE offers API bindings for compiling and running programs from various languages.
Automatically generated documentation for the MLIR dialects defined in the IREE repository.
Much of this describes provisions for extension within IREE but until the core of the system has settled little work will be done to fully flesh-out and document them in detail. A large majority of things that would make someone want to extend IREE can instead be accomplished much easier and performantly using native MLIR dialects that are then processed by the IREE compiler.
IREE has a compiler and runtime separation, a multi-layered architecture, and split between execution of \"host code\" that schedules compute-heavy work and SPMD \"device code\" that performs the bulk of compute operations. Each axis has a different set of extension mechanisms that can be used independently or combined.
Organized below are some of the mechanisms IREE provides for extending the core compiler and runtime and when they should(n't) be used. The goal of these progressively lower-level extension mechanisms is to make it easier for users to fall into the pit of success:
Quote
\"a well-designed system makes it easy to do the right things and annoying (but not impossible) to do the wrong things.\" - Jeff Atwood
The amount of engineering complexity for initial bring-up and maintenance increases with each subsequently lower-level approach and it is best to start from the top and exit as fast as possible: this is a choose-your-own-adventure where you're trying to escape the dungeon with both the loot and your limbs . Avoid the temptation of immediately dropping down to making external C calls at runtime because that's how it's been done before as it's easier, more robust, and more performant to use the system as it is intended to be used.
The primary goal when extending any framework should first be to avoid extending it at all. There is no mechanism that is free - whether in terms of engineering effort to develop and maintain over time, include in compiler deployments, or include in runtime deployments. As a system scales in deployment configurations the available mechanisms for extension increase but so too does the chaos introduced by extensions that do not also scale with that design. Users are the only ones who can determine the tradeoffs they are willing to accept: for example, the mechanism to extend device code with a custom runtime call to a C function does not work on GPUs and gets significantly more complicated on CPUs as sandboxes/enclaves are used - but if the user scenario is for local process CPU-only execution that may not matter.
Consider in normal software development when one would choose to write more code (possibly packaging it into a reusable library) vs. changing the programming language or compiler they are using to compile their code vs. changing the operating systems their code runs on. The further one gets from the problem they are trying to solve the more work, coordination, and maintenance is involved and though there are reasons to make changes across the stack they should be done only when a simpler solution would not suffice.
An author will retain more control over their logic the closer they sit to the inputs to the compiler. IREE provides several mechanisms that try to keep control with the author and robust to changes in IREE or MLIR internals and it is strongly encouraged that those looking to extend take those routes first. Contributions that help everyone are very welcome but do have a higher cost and it's often much easier to design and justify upstream changes with working examples in forks or at higher levels of the stack.
From a performance perspective the rule is to colocate code with the data it is acting on: tensor data, for example, should almost exclusively be manipulated by device code as tensors live on device. Attempting to use tensor data with host code will result in synchronization points and host/device transfers that can decimate performance. This can lead to seemingly paradoxical situations where swapping out compiler-generated code for a human-authored \"fast path\" can be slower than even the most naive compiler results. An important thing to keep in mind with compilers is that it is exceedingly difficult to produce code by hand that is consistently more performant across a broad range of deployments and the first temptation should always be to improve the compiler - extending it via other mechanisms when not required by the task is often just premature optimization.
TL;DR
Convert your custom ops into standard MLIR dialects.
+------------+ +--------+ +---------------+\n| Your input | -+-> | iree | -+-> | IREE compiler |\n+------------+ | +--------+ | +---------------+\n | +--------+ |\n +-> | linalg | -+\n | +--------+ |\n | .... |\n
The easiest, cleanest, and most robust path to extend IREE is to make use of what MLIR is designed for: composing dialects and converting between them. IREE supports several input dialects such as tosa, mhlo, linalg, and the standard arith, math, tensor, and scf dialects. Any source IR that can be turned into that mix of dialects (directly or transitively) will work with the whole IREE pipeline for all deployment configurations and targets. If possible to express the computation in this form it will always be the best route to getting small deployments without the need to modify or include any additional code at runtime and run on all device types and execution modes.
mhlo
arith
math
scf
This mechanism can also be layered with any of the subsequent lower-level ones: if some part of the operation runs on the host and some part on device then decomposing it such that it contains as many standard ops for flow control as possible and linear algebra/custom ops for the dense math will reduce the engineering effort required on both sides and lead to an easier to maintain solution even if lower-level extension is required.
A large majority of classic ML \"custom ops\" can be accomplished with this approach. When bringing up projects built on IREE it's best to concisely describe the operation in more elemental mathematical representations and then add optimizations where required knowing that things will still work even if those optimizations never happen.
To make use of this approach one just needs to follow the standard MLIR dialect conversion behavior: add a dialect with ops, add a conversion pass, and run that pass before providing the resulting IR to the IREE compiler. See Creating a Dialect.
Think of this like authoring C++ sources with templates that you compile into your application: Clang (and LLVM beyond) don't know about your library details and instead just process it as it would any other code. You can take the same source and pass it to GCC and it'll be robust to underlying changes in the system.
Import MLIR functions in the compiler and custom modules at runtime.
// Main user module compiled by IREE:\nmodule @model {\n // Declare a synchronous external function:\n func.func private @my_custom_module.sync_func(%input: tensor<?xf32>) -> i32\n // Declare an asynchronous external function:\n func.func private @my_custom_module.async_func(%input: tensor<?xf32>) -> tensor<?xf32> attributes {\n iree.abi.model = \"coarse-fences\",\n nosideeffects\n }\n func.func @predict() {\n ...\n // Call a synchronous/blocking external function:\n %sync_result = call @my_custom_module.sync_func(%sync_input) : (tensor<?xf32>) -> i32\n ...\n ...\n // Call an asynchronous/non-blocking external function:\n %async_result = call @my_custom_module.async_func(%async_input) : (tensor<?xf32>) -> tensor<?xf32>\n ...\n }\n}\n
IREE provides dynamic linking at runtime via its VM interfaces. For code that runs on the host and requires syscalls or calling out to existing libraries - such as file IO, text processing, and JPEG decoding - this is an easy way to interop without paying attention to the more complex details of device code. An IREE module compiled using custom modules is portable and dynamically deployable so long as the custom module is registered at runtime.
This approach conceptually matches what normal native binaries do in an OS: imports are declared and at runtime they are resolved based on the available exports of modules in the system. Just as with normal systems engineering design of the API between modules is up to the user and depending on rigor can have several pitfalls but these problems and their solutions are not IREE specific and anyone who has designed a shared library interface can apply the same rules here in IREE around versioning, performance, etc. One does not add 2 integers via a syscall and the same holds here: custom modules and the functions within should perform a large amount of work to hide overheads involved in the cross-module calls and users must be aware that the compiler cannot optimize across the call boundaries.
See the synchronous tensor I/O and asynchronous tensor I/O samples.
The runtime portion requires that the code be exported to the VM system by way of an iree_vm_module_t interface. A low-level native interface exists with minimal overhead and is used for example by the IREE HAL itself. There is also a C++ wrapper that is significantly easier to work with however it needs some performance improvements.
Full end-to-end examples can be found under samples/custom_modules/:
samples/custom_modules/
Add patterns to iree/Compiler/Codegen/ to emit target code.
iree/Compiler/Codegen/
The easiest and most robust path for specializations of device code is to emit such code mixed with the IREE compiler generated code at the highest possible level of abstraction within the target pipeline. For example, if the code can be represented with the vector dialect then inserting conversion patterns between linalg and vector enables the emitted code to be specialized further based on user configuration and optimized with the full set of available passes that run in the pipeline. For each level lower one goes the more flexibility they gain such as being able to emit inline assembly blocks that do anything while trading off generality and multi-targeting applicability.
vector
How much the tradeoff matters is based on the behavior of the extension. If a pattern changing a transcendental function to an approximation can operate at the vector level then all IREE deployment targets can benefit from the pattern and as new targets are made available they will automatically receive the benefits. In contrast, a pattern at the vector level that turns generic vector operations into architecture-specific LLVM intrinsics by its nature only pertains to a single target family and can be done at a lower level. As a rule of thumb if a particular pattern is going to need ~N implementations for ~N targets that are all mostly the same it's better to try to move that higher in the stack.
At this point the complexity of extending things is still fairly constrained: a C++ pass or pattern is verified with normal lit tests and can be upstreamed easily either into MLIR or IREE (a large number of IREE patterns are upstreamed, benefiting all users of MLIR). Cross-compilation and versioning are not a factor and the IREE artifacts can be considered durable at a coarse level (outside of major target architectural changes).
Note that depending on the target there are various mechanisms for representing code in MLIR, up to including inline assembly snippets in IR via llvm.inline_asm.
llvm.inline_asm
There are several ways to author patterns and passes in MLIR. As examples:
There are many examples within both MLIR and IREE, one specifically being the polynomial approximation expansion patterns.
Statically link external object files into IREE executables.
For large bodies of existing device code or library calls that are available for static linkage the work involved to reimplement them at higher levels of the stack can be cost prohibitive even if it leads to better results. In these cases just as with a normal toolchain one would just want to declare an external function, call it, and add the object file to the linker command line. In IREE the same can be performed by way of taking compatible bitcode or native object files and linking them in with the generated code. An MLIR pattern would declare and emit the call and the target-specific IREE linker would pull in the objects.
As the linking behavior varies per target (for example, some targets like SPIR-V don't have traditional linkers) how this is performed is up to the IREE target backends. The complexity involved in producing the object files to link will also vary per-backend and the complexity of the deployment: cross-compiling for multiple architectures or compilation modes (ASAN, etc) will require unique copies of the object files matching that precise configuration.
At this point generality is largely out as is the ability to cleanly upstream such files. It should be apparent how a few dozen lines of C++ or PDL that avoids the need for any of this complexity is more appealing. In extremely specific cases of a single platform/architecture/version for a single program deployed via a specific artifact composition it's not so bad but IREE is designed such that extreme specificity is an optional mode of the more general solution. This does not mean this mechanism is not useful in some situations and only that it should be a last-resort when one of the easier to manage solutions is not viable - not a shortcut to avoid writing some C++ patterns.
As the linking behavior varies per target backend there is no general solution at this level: if targeting the CPU then the system native linker or lld need to be provided the object files, while SPIR-V will need to merge the SPIR-V binaries directly, and Metal shader libraries will need to be constructed with the Apple-specific metallib tooling. Producing these files and performing the linking is outside the scope of IREE.
metallib
If the files can be acquired then compiler changes will be required to emit calls to them and invoke the linker with the the files.
On the CPU an alternative is to use the static library output mode where IREE produces an object file and then the user invokes the linker themselves; this still requires the compiler changes to emit the calls but avoids needing to teach the compiler how to link the files.
Dynamically link external C functions at runtime from device code.
It is pitch black. You are likely to be eaten by a grue.
This is the lowest-level integration in the system and is designed to act as an escape hatch and - as with any emergency escape hatch - it's not designed for ergonomics. Users should try first to come in through the door and attempting to use this mechanism should trigger alarms about the approach being attempted.
IREE's execution model for device code and native machine binary deployment mechanisms are designed with several constraints in order to make all of the above approaches possible and performant. Calling arbitrary C functions from deep within the system can introduce subtle (and not-so-subtle) bugs that are extremely difficult to track down and versioning between the compiler emitting the calls and the runtime providing the implementations can cause skew unless held carefully. Consider the methods added here like syscalls in that they must be extremely focused and if they are ever likely to change (including being removed) then care will be needed just as with versioning or redirecting a syscall. Designing good stable interfaces is hard and a classic pit of failure.
Some things to note:
Most of the constraints here come from the SPMD parallelism model, platform-agnostic deployment format, and overall data-oriented design of IREE. Code operating in this fashion has a certain shape and that is usually not the same as big legacy single-threaded CPU-focused BLAS libraries that perform their own caching, internal thread and state management, and other shenanigans. IREE is not designed to wrap such things and if any of these notes are issues it is more an indicator that the approach needs adjustment than anything else. Trying to bypass or workaround the constraints is possible - after all IREE is an open source project and any user is welcome to fork it - but unsupported by the core IREE team.
The compiler is changed to produce calls to imports via a dynamic import table provided to each dispatch function. The import table is declared in the executable library for use at runtime. Runtime applications register an import provider to resolve named symbols in the import table to C functions that marshal arguments and results.
The compiler-side needs some additional work but an example is included here: Issue 7504. The runtime-side is complete and resolution is performed by a user-supplied iree_hal_executable_import_provider_t.
iree_hal_executable_import_provider_t
IREE exists in an ecosystem of projects and acts as a bridge between machine learning frameworks and a variety of hardware platforms. This glossary outlines some of those projects and technologies.
Something missing?
Don't see a project of technology here that you think should be? We welcome contributions on our GitHub page!
JAX is Python framework supporting high-performance machine learning research by bridging automatic differentiation and ML compilers like XLA and IREE.
See the JAX Integration guide for details on how to use JAX programs with IREE.
Multi-Level Intermediate Representation (MLIR) is the compiler framework that IREE is built around. Beyond the tooling this includes a set of common dialects and transformations that IREE utilizes for its code generation system.
For general discussion on MLIR see the project's discourse forum.
Linalg is an MLIR dialect that defines Linear Algebra operations in a generalized fashion by modeling iteration spaces together with compute payloads. Linalg includes a set of commonly used operations as well as generic interfaces.
IREE uses the Linalg dialect during its code generation pipeline to define tensor operations then generate loop structures for its various backend targets.
OpenXLA is a community-driven, open source ML compiler ecosystem.
IREE interfaces with some of the OpenXLA projects, such as StableHLO.
PyTorch is an optimized tensor library for deep learning.
PyTorch uses the Torch-MLIR project to interface with projects like IREE. See the PyTorch Integration guide for details on how to use PyTorch programs with IREE.
SPIR-V is a shader and kernel intermediate language for expressing parallel computation typically used for GPUs. It serves as a hardware agnostic assembly format for distributing complex, computationally intensive programs.
IREE uses the SPIR-V MLIR Dialect in its code generation pipeline for Vulkan and other compute APIs.
StableHLO is a set of versioned high-level operations (HLOs) for ML models with backward and forward compatibility guarantees. StableHLO aims to improve interoperability between frameworks (such as TensorFlow, JAX, and PyTorch) and ML compilers.
StableHLO has both a specification and an MLIR dialect.
IREE uses the StableHLO MLIR Dialect as one of its input formats.
Tensor Operator Set Architecture (TOSA) provides a set of tensor operations commonly employed by Deep Neural Networks. TOSA defines accuracy and compatibility constraints so frameworks that use it can trust that applications will produce similar results on a variety of hardware targets.
TOSA has both a specification and an MLIR dialect.
IREE uses the TOSA MLIR dialect as one of its input formats.
TensorFlow Lite (TFLite) is a library for deploying models on mobile and other edge devices.
IREE supports running TFLite programs that have been imported into MLIR using the TOSA dialect. See the TFLite Integration guide for details on how to use TFLite programs with IREE.
IREE also has bindings for the TFLite C API, see the runtime/bindings/tflite/ directory for details.
runtime/bindings/tflite/
This page documents various supported flags for optimizing IREE programs. Each is presented with its English name, flag to enable/disable, and default state.
These flags can be passed to the:
iree.compiler.tools
iree.compiler.transforms.iree-compile.CompilerOptions(\"--flag\", \"--flag2\")
ireeCompilerOptionsSetFlags()
--iree-opt-const-eval
Performs compile-time evaluation of any global initializers which produce the initial values for global constants, storing the global directly in the program as constant data. This extracts such constant program fragments and recursively compiles them, using the runtime to evaluate the results.
Note that this only has any effect on computations in module initializer functions, not free-standing operations in the program which may produce constant-derived results. See --iree-opt-const-expr-hoisting for options to optimize these.
--iree-opt-const-expr-hoisting
Identifies all trees of constant expressions in the program and uses a heuristic to determine which would be profitable to hoist into global initializers for evaluation at module load. Together with --iree-opt-const-eval, this will convert eligible trees of expressions to purely static data embedded in the module.
The heuristic is currently relatively primitive, using static information to disable hoisting of leaf operations which are metadata only (i.e. broadcasts, etc) or are expected to fold away as part of operator fusion. Notably, the current heuristic is likely to pessimize module size in the case of complicated programs with trees of constant, large tensors.
--iree-opt-numeric-precision-reduction
Analyzes program constant data and program flow to identify math operations which can be safely evaluated with reduced precision (currently with a minimum of 8bit integers but being extended to infer any bit depth) and inserts appropriate casts. In conjunction with Constant Expression Hoisting, Constant Evaluation and other automatic optimizations, this can produce programs where large amounts (up to the whole) have had their numeric operations and constant data rewritten to lower precision types.
This feature is actively evolving and will be the subject of dedicated documentation when ready.
--iree-opt-strip-assertions
Strips all std.assert ops in the input program after useful information for optimization analysis has been extracted. Assertions provide useful user-visible error messages but can prevent critical optimizations. Assertions are not, however, a substitution for control flow and frontends that want to check errors in optimized release builds should do so via actual code - similar to when one would if (foo) return false; vs. assert(foo); in a normal program.
std.assert
if (foo) return false;
assert(foo);
API bindings allow for programmatic use of IREE's compiler and runtime components. The core IREE project is written in C1, allowing for API bindings to be written in a variety of other languages.
Want to use another language? Looking for something specific out of one of those already listed?
We welcome discussions on our communication channels and contributions on our GitHub page!
Members of the core project team and other partner groups maintain these official bindings:
See the C API reference page.
See the Python reference page.
Members of our developer community have authored bindings using other languages:
experimental/web/
runtime/bindings/tflite/java
with some C++ tools and utilities\u00a0\u21a9
The IREE compiler and IREE runtime both have their own C/C++ APIs. This page introduces the available APIs and describes how to use them from your applications.
There are multiple ways to distribute and depend on C/C++ projects, each with varying levels of portability, flexibility, and toolchain compatibility. IREE aims to support common configurations and platforms.
The IREE compiler is structured as a monolithic shared object with a dynamic plugin system allowing for extensions. The shared object exports symbols for versioned API functions.
graph TD\n accTitle: IREE compiler linkage model diagram\n accDescr {\n The libIREECompiler.so or IREECompiler.dll shared object contains pipelines,\n target backends, and general passes as private implementation details.\n Compiler plugins interface with the compiler shared object to extend it with\n custom targets, dialects, etc.\n Applications interface with the compiler shared object through the compiler\n C API's exported symbols.\n }\n\n subgraph compiler[libIREECompiler.so / IREECompiler.dll]\n pipelines(\"Pipelines\n\n \u2022 Flow\n \u2022 Stream\n \u2022 etc.\")\n\n targets(\"Target backends\n\n \u2022 llvm-cpu\n \u2022 vulkan-spirv\n \u2022 etc.\")\n\n passes(\"General passes\n\n \u2022 Const eval\n \u2022 DCE\n \u2022 etc.\")\n end\n\n plugins(\"Compiler plugins\n\n \u2022 Custom targets\n \u2022 Custom dialects\n \u2022 etc.\")\n\n application(Your application)\n\n compiler <-- \"Plugin API<br>(static or dynamic linking)\" --> plugins\n compiler -. \"Compiler C API<br>(exported symbols)\" .-> application
API definitions can be found in the following locations:
iree/compiler/embedding_api.h
iree/compiler/PluginAPI/
mlir/include/mlir-c/
The compiler API is centered around running pipelines to translate inputs to artifacts. These are modeled via sessions, invocations, sources, and outputs.
stateDiagram-v2\n accTitle: IREE compiler session and invocation state diagram\n accDescr {\n Input files are opened (or buffers are wrapped) as sources in a session.\n Sources are parsed into invocations, which run pipelines.\n Output files are written (or buffers are mapped) for compilation artifacts.\n Sessions can contain multiple sources and run multiple invocations.\n }\n\n direction LR\n InputFile --> Source1 : open file\n InputBuffer --> Source2 : wrap buffer\n\n state Session {\n Source1 --> Invocation1\n Source2 --> Invocation2\n Invocation1 --> Invocation1 : run pipeline\n Invocation2 --> Invocation2 : run pipeline\n }\n\n Invocation1 --> Output1File : write file\n Invocation1 --> Output1Buffer : map memory\n Invocation2 --> Output2Buffer : map memory
A session (iree_compiler_session_t) is a scope where one or more invocations can run.
iree_compiler_session_t
MLIRContext
An invocation (iree_compiler_invocation_t) is a discrete run of the compiler.
iree_compiler_invocation_t
A source (iree_compiler_source_t) represents an input program, including operations and data.
iree_compiler_source_t
An output (iree_compiler_output_t) represents a compilation artifact.
iree_compiler_output_t
A plugin extends the compiler with some combination of target backends, options, passes, or pipelines. For documentation on compiler plugins, see compiler/PluginAPI/README.md.
compiler/PluginAPI/README.md
This snippet shows the general layout of the API. For working examples, see the samples below.
To build a custom tool using the compiler API:
set(_IREE_COMPILER_API \"${_IREE_COMPILER_ROOT}/bindings/c/iree/compiler\")\ntarget_include_directories(${_NAME} SYSTEM PRIVATE ${_IREE_COMPILER_API})\ntarget_link_libraries(${_NAME} iree_compiler_bindings_c_loader)\n
#include <iree/compiler/embedding_api.h>\n#include <iree/compiler/loader.h>\n\nint main(int argc, char** argv) {\n // Load the compiler library then initialize it.\n ireeCompilerLoadLibrary(\"libIREECompiler.so\");\n ireeCompilerGlobalInitialize();\n\n // Create a session to track compiler state and set flags.\n iree_compiler_session_t *session = ireeCompilerSessionCreate();\n ireeCompilerSessionSetFlags(session, argc, argv);\n\n // Open a file as an input source to the compiler.\n iree_compiler_source_t *source = NULL;\n ireeCompilerSourceOpenFile(session, \"input.mlir\", &source);\n\n // Use an invocation to compile from the input source to one or more outputs.\n iree_compiler_invocation_t *inv = ireeCompilerInvocationCreate(session);\n ireeCompilerInvocationPipeline(inv, IREE_COMPILER_PIPELINE_STD);\n\n // Output the compiled artifact to a file.\n iree_compiler_output_t *output = NULL;\n ireeCompilerOutputOpenFile(\"output.vmfb\", &output);\n ireeCompilerInvocationOutputVMBytecode(inv, output);\n\n // Cleanup state.\n ireeCompilerInvocationDestroy(inv);\n ireeCompilerOutputDestroy(output);\n ireeCompilerSourceDestroy(source);\n ireeCompilerSessionDestroy(session);\n ireeCompilerGlobalShutdown();\n}\n
hello_compiler.c
integrations/pjrt/.../iree_compiler.cc
compiler/plugins
samples/compiler_plugins/
plugins/.../iree-amd-aie
The IREE runtime is structured as a modular set of library components. Each component is designed to be linked into applications directly and compiled with LTO style optimizations.
The low level library components can be used directly or through a higher level API.
The high level 'runtime' API sits on top of the low level components. It is relatively terse but does not expose the full flexibility of the underlying systems.
graph TD\n accTitle: IREE runtime high level API diagram\n accDescr {\n The IREE runtime includes 'base', 'HAL', and 'VM' components, each with\n their own types and API methods.\n A high level \"runtime API\" sits on top of these component APIs.\n Applications can interface indirectly with the IREE runtime via this\n high level runtime API.\n }\n\n subgraph iree_runtime[IREE Runtime]\n subgraph base\n base_types(\"Types\n\n \u2022 allocator\n \u2022 status\n \u2022 etc.\")\n end\n\n subgraph hal[HAL]\n hal_types(\"Types\n\n \u2022 buffer\n \u2022 device\n \u2022 etc.\")\n\n hal_drivers(\"Drivers\n\n \u2022 local-*\n \u2022 vulkan\n \u2022 etc.\")\n end\n\n subgraph vm[VM]\n vm_types(\"Types\n\n \u2022 context\n \u2022 invocation\n \u2022 etc.\")\n end\n\n runtime_api(\"Runtime API\n\n \u2022 instance\n \u2022 session\n \u2022 call\")\n\n base_types & hal_types & hal_drivers & vm_types --> runtime_api\n end\n\n application(Your application)\n\n runtime_api --> application
Each runtime component has its own low level API. The low level APIs are typically verbose as they expose the full flexibility of each underlying system.
graph TD\n accTitle: IREE runtime low level API diagram\n accDescr {\n The IREE runtime includes 'base', 'HAL', and 'VM' components, each with\n their own types and API methods.\n Applications can interface directly with the IREE runtime via the low\n level component APIs.\n }\n\n subgraph iree_runtime[IREE Runtime]\n subgraph base\n base_types(\"Types\n\n \u2022 allocator\n \u2022 status\n \u2022 etc.\")\n end\n subgraph hal[HAL]\n hal_types(\"Types\n\n \u2022 buffer\n \u2022 device\n \u2022 etc.\")\n\n hal_drivers(\"Drivers\n\n \u2022 local-*\n \u2022 vulkan\n \u2022 etc.\")\n end\n subgraph vm[VM]\n vm_types(\"Types\n\n \u2022 context\n \u2022 invocation\n \u2022 etc.\")\n end\n end\n\n application(Your application)\n\n base_types & hal_types & hal_drivers & vm_types --> application
Runtime API header files are organized by component:
iree/runtime/api.h
iree/base/api.h
iree/vm/api.h
iree/hal/api.h
The high level API uses instances, sessions, and calls to run programs with a small API surface.
stateDiagram-v2\n accTitle: IREE runtime high level API state diagram\n accDescr {\n Instances track sessions and state: options, drivers, devices.\n Sessions track calls and state: a device and bytecode/VM modules.\n Calls track input and output lists.\n }\n\n state iree_runtime_instance_t {\n instance_state: state<br>- options<br>- drivers<br>- devices\n\n state iree_runtime_session_t {\n session_state: state<br>- device<br>- VM / bytecode modules\n state iree_runtime_call_t {\n inputs\n outputs\n }\n }\n }
An instance (iree_runtime_instance_t) isolates runtime usage and manages device resources.
iree_runtime_instance_t
A session (iree_runtime_session_t) contains a set of loaded modules and their state.
iree_runtime_session_t
A call (iree_runtime_call_t) is a stateful VM function call builder.
iree_runtime_call_t
Under construction, more coming soon
IREE uses its own Virtual Machine (VM) at runtime to interpret program instructions on the host system.
VM instructions may be further lowered to C source code for static or resource constrained deployment.
See the --output-format=vm-c compiler option and the samples in samples/emitc_modules/ for more information.
samples/emitc_modules/
The VM supports generic operations like loads, stores, arithmetic, function calls, and control flow. The VM builds streams of more complex program logic and dense math into HAL command buffers that are dispatched to hardware backends.
VM modules provide all functionality to execution contexts, including access to hardware accelerators through the HAL. Compiled user programs are also modules.
stateDiagram-v2\n accTitle: Sample VM Modules\n accDescr {\n Bytecode modules contain program state, program functions, and debug\n information.\n HAL modules contain devices, executables, HAL functions, and HAL types.\n Custom modules may contain external functions and custom types.\n }\n\n state \"Bytecode module\" as bytecode {\n bytecode_contents: Module state<br>Program functions<br>Debug information\n }\n\n state \"HAL module\" as HAL {\n hal_contents: Devices<br>Executables<br>HAL functions<br>HAL types\n }\n\n state \"Custom module\" as custom {\n custom_contents: External functions<br>Custom types\n }
IREE uses a Hardware Abstraction Layer (HAL) to model and interact with hardware devices like CPUs, GPUs and other accelerators.
For other examples, see the samples below.
Source file: runtime/src/iree/runtime/demo/hello_world_terse.c
runtime/src/iree/runtime/demo/hello_world_terse.c
#include <stdio.h>\n\n#include \"iree/runtime/api.h\"\n#include \"iree/runtime/testdata/simple_mul_module_c.h\"\n\nstatic void iree_runtime_demo_run_session(iree_runtime_instance_t* instance);\nstatic void iree_runtime_demo_perform_mul(iree_runtime_session_t* session);\n\n//===----------------------------------------------------------------------===//\n// 1. Entry point / shared iree_runtime_instance_t setup\n//===----------------------------------------------------------------------===//\n\nint main(int argc, char** argv) {\n // Create and configure the instance shared across all sessions.\n iree_runtime_instance_options_t instance_options;\n iree_runtime_instance_options_initialize(&instance_options);\n iree_runtime_instance_options_use_all_available_drivers(&instance_options);\n iree_runtime_instance_t* instance = NULL;\n IREE_CHECK_OK(iree_runtime_instance_create(\n &instance_options, iree_allocator_system(), &instance));\n\n // All sessions should share the same instance.\n iree_runtime_demo_run_session(instance);\n\n iree_runtime_instance_release(instance);\n return 0;\n}\n\n//===----------------------------------------------------------------------===//\n// 2. Load modules and initialize state in iree_runtime_session_t\n//===----------------------------------------------------------------------===//\n\nstatic void iree_runtime_demo_run_session(iree_runtime_instance_t* instance) {\n // TODO(#5724): move device selection into the compiled modules.\n iree_hal_device_t* device = NULL;\n IREE_CHECK_OK(iree_runtime_instance_try_create_default_device(\n instance, iree_make_cstring_view(\"local-task\"), &device));\n\n // Create one session per loaded module to hold the module state.\n iree_runtime_session_options_t session_options;\n iree_runtime_session_options_initialize(&session_options);\n iree_runtime_session_t* session = NULL;\n IREE_CHECK_OK(iree_runtime_session_create_with_device(\n instance, &session_options, device,\n iree_runtime_instance_host_allocator(instance), &session));\n iree_hal_device_release(device);\n\n // Load your user module into the session (from memory, from file, etc).\n const iree_file_toc_t* module_file =\n iree_runtime_testdata_simple_mul_module_create();\n IREE_CHECK_OK(iree_runtime_session_append_bytecode_module_from_memory(\n session, iree_make_const_byte_span(module_file->data, module_file->size),\n iree_allocator_null()));\n\n // Run your functions; you should reuse the session to make multiple calls.\n iree_runtime_demo_perform_mul(session);\n\n iree_runtime_session_release(session);\n}\n\n//===----------------------------------------------------------------------===//\n// 3. Call a function within a module with buffer views\n//===----------------------------------------------------------------------===//\n\n// func.func @simple_mul(%arg0: tensor<4xf32>, %arg1: tensor<4xf32>) ->\n// tensor<4xf32>\nstatic void iree_runtime_demo_perform_mul(iree_runtime_session_t* session) {\n iree_runtime_call_t call;\n IREE_CHECK_OK(iree_runtime_call_initialize_by_name(\n session, iree_make_cstring_view(\"module.simple_mul\"), &call));\n\n // %arg0: tensor<4xf32>\n iree_hal_buffer_view_t* arg0 = NULL;\n static const iree_hal_dim_t arg0_shape[1] = {4};\n static const float arg0_data[4] = {1.0f, 1.1f, 1.2f, 1.3f};\n IREE_CHECK_OK(iree_hal_buffer_view_allocate_buffer_copy(\n iree_runtime_session_device(session),\n iree_runtime_session_device_allocator(session),\n IREE_ARRAYSIZE(arg0_shape), arg0_shape, IREE_HAL_ELEMENT_TYPE_FLOAT_32,\n IREE_HAL_ENCODING_TYPE_DENSE_ROW_MAJOR,\n (iree_hal_buffer_params_t){\n .type = IREE_HAL_MEMORY_TYPE_DEVICE_LOCAL,\n .access = IREE_HAL_MEMORY_ACCESS_ALL,\n .usage = IREE_HAL_BUFFER_USAGE_DEFAULT,\n },\n iree_make_const_byte_span(arg0_data, sizeof(arg0_data)), &arg0));\n IREE_CHECK_OK(iree_hal_buffer_view_fprint(\n stdout, arg0, /*max_element_count=*/4096,\n iree_runtime_session_host_allocator(session)));\n IREE_CHECK_OK(iree_runtime_call_inputs_push_back_buffer_view(&call, arg0));\n iree_hal_buffer_view_release(arg0);\n\n fprintf(stdout, \"\\n * \\n\");\n\n // %arg1: tensor<4xf32>\n iree_hal_buffer_view_t* arg1 = NULL;\n static const iree_hal_dim_t arg1_shape[1] = {4};\n static const float arg1_data[4] = {10.0f, 100.0f, 1000.0f, 10000.0f};\n IREE_CHECK_OK(iree_hal_buffer_view_allocate_buffer_copy(\n iree_runtime_session_device(session),\n iree_runtime_session_device_allocator(session),\n IREE_ARRAYSIZE(arg1_shape), arg1_shape, IREE_HAL_ELEMENT_TYPE_FLOAT_32,\n IREE_HAL_ENCODING_TYPE_DENSE_ROW_MAJOR,\n (iree_hal_buffer_params_t){\n .type = IREE_HAL_MEMORY_TYPE_DEVICE_LOCAL,\n .access = IREE_HAL_MEMORY_ACCESS_ALL,\n .usage = IREE_HAL_BUFFER_USAGE_DEFAULT,\n },\n iree_make_const_byte_span(arg1_data, sizeof(arg1_data)), &arg1));\n IREE_CHECK_OK(iree_hal_buffer_view_fprint(\n stdout, arg1, /*max_element_count=*/4096,\n iree_runtime_session_host_allocator(session)));\n IREE_CHECK_OK(iree_runtime_call_inputs_push_back_buffer_view(&call, arg1));\n iree_hal_buffer_view_release(arg1);\n\n IREE_CHECK_OK(iree_runtime_call_invoke(&call, /*flags=*/0));\n\n fprintf(stdout, \"\\n = \\n\");\n\n // -> tensor<4xf32>\n iree_hal_buffer_view_t* ret0 = NULL;\n IREE_CHECK_OK(iree_runtime_call_outputs_pop_front_buffer_view(&call, &ret0));\n IREE_CHECK_OK(iree_hal_buffer_view_fprint(\n stdout, ret0, /*max_element_count=*/4096,\n iree_runtime_session_host_allocator(session)));\n iree_hal_buffer_view_release(ret0);\n\n iree_runtime_call_deinitialize(&call);\n}\n
Source file: runtime/src/iree/runtime/demo/hello_world_explained.c
runtime/src/iree/runtime/demo/hello_world_explained.c
#include <stdio.h>\n\n#include \"iree/runtime/api.h\"\n\nstatic int iree_runtime_demo_main(void);\nstatic iree_status_t iree_runtime_demo_run_session(\n iree_runtime_instance_t* instance);\nstatic iree_status_t iree_runtime_demo_perform_mul(\n iree_runtime_session_t* session);\n\n#if defined(IREE_RUNTIME_DEMO_LOAD_FILE_FROM_COMMAND_LINE_ARG)\n\nstatic const char* demo_file_path = NULL;\n\n// Takes the first argument on the command line as a file path and loads it.\nint main(int argc, char** argv) {\n if (argc < 2) {\n fprintf(stderr, \"usage: session_demo module_file.vmfb\\n\");\n return 1;\n }\n demo_file_path = argv[1];\n return iree_runtime_demo_main();\n}\n\n// Loads a compiled IREE module from the file system.\nstatic iree_status_t iree_runtime_demo_load_module(\n iree_runtime_session_t* session) {\n return iree_runtime_session_append_bytecode_module_from_file(session,\n demo_file_path);\n}\n\n#elif defined(IREE_RUNTIME_DEMO_LOAD_FILE_FROM_EMBEDDED_DATA)\n\n#include \"iree/runtime/testdata/simple_mul_module_c.h\"\n\nint main(int argc, char** argv) { return iree_runtime_demo_main(); }\n\n// Loads the bytecode module directly from memory.\n//\n// Embedding the compiled output into your binary is not always possible (or\n// recommended) but is a fairly painless way to get things working on a variety\n// of targets without worrying about how to deploy files or pass flags.\n//\n// In cases like this the module file is in .rodata and does not need to be\n// freed; if the memory needs to be released when the module is unloaded then a\n// custom allocator can be provided to get a callback instead.\nstatic iree_status_t iree_runtime_demo_load_module(\n iree_runtime_session_t* session) {\n const iree_file_toc_t* module_file =\n iree_runtime_testdata_simple_mul_module_create();\n return iree_runtime_session_append_bytecode_module_from_memory(\n session, iree_make_const_byte_span(module_file->data, module_file->size),\n iree_allocator_null());\n}\n\n#else\n#error \"must specify a way to load the module data\"\n#endif // IREE_RUNTIME_DEMO_LOAD_FILE_FROM_*\n\n//===----------------------------------------------------------------------===//\n// 1. Entry point / shared iree_runtime_instance_t setup\n//===----------------------------------------------------------------------===//\n// Applications should create and share a single instance across all sessions.\n\n// This would live in your application startup/shutdown code or scoped to the\n// usage of IREE. Creating and destroying instances is expensive and should be\n// avoided.\nstatic int iree_runtime_demo_main(void) {\n // Set up the shared runtime instance.\n // An application should usually only have one of these and share it across\n // all of the sessions it has. The instance is thread-safe, while the\n // sessions are only thread-compatible (you need to lock if its required).\n iree_runtime_instance_options_t instance_options;\n iree_runtime_instance_options_initialize(&instance_options);\n iree_runtime_instance_options_use_all_available_drivers(&instance_options);\n iree_runtime_instance_t* instance = NULL;\n iree_status_t status = iree_runtime_instance_create(\n &instance_options, iree_allocator_system(), &instance);\n\n // Run the demo.\n // A real application would load its models (at startup, on-demand, etc) and\n // retain them somewhere to be reused. Startup time and likelihood of failure\n // varies across different HAL backends; the synchronous CPU backend is nearly\n // instantaneous and will never fail (unless out of memory) while the Vulkan\n // backend may take significantly longer and fail if there are not supported\n // devices.\n if (iree_status_is_ok(status)) {\n status = iree_runtime_demo_run_session(instance);\n }\n\n // Release the shared instance - it will be deallocated when all sessions\n // using it have been released (here it is deallocated immediately).\n iree_runtime_instance_release(instance);\n\n int ret = (int)iree_status_code(status);\n if (!iree_status_is_ok(status)) {\n // Dump nice status messages to stderr on failure.\n // An application can route these through its own logging infrastructure as\n // needed. Note that the status is a handle and must be freed!\n iree_status_fprint(stderr, status);\n iree_status_ignore(status);\n }\n return ret;\n}\n\n//===----------------------------------------------------------------------===//\n// 2. Load modules and initialize state in iree_runtime_session_t\n//===----------------------------------------------------------------------===//\n// Each instantiation of a module will live in its own session. Module state\n// like variables will be retained across calls within the same session.\n\n// Loads the demo module and uses it to perform some math.\n// In a real application you'd want to hang on to the iree_runtime_session_t\n// and reuse it for future calls - especially if it holds state internally.\nstatic iree_status_t iree_runtime_demo_run_session(\n iree_runtime_instance_t* instance) {\n // TODO(#5724): move device selection into the compiled modules.\n iree_hal_device_t* device = NULL;\n IREE_RETURN_IF_ERROR(iree_runtime_instance_try_create_default_device(\n instance, iree_make_cstring_view(\"local-task\"), &device));\n\n // Set up the session to run the demo module.\n // Sessions are like OS processes and are used to isolate modules from each\n // other and hold runtime state such as the variables used within the module.\n // The same module loaded into two sessions will see their own private state.\n iree_runtime_session_options_t session_options;\n iree_runtime_session_options_initialize(&session_options);\n iree_runtime_session_t* session = NULL;\n iree_status_t status = iree_runtime_session_create_with_device(\n instance, &session_options, device,\n iree_runtime_instance_host_allocator(instance), &session);\n iree_hal_device_release(device);\n\n // Load the compiled user module in a demo-specific way.\n // Applications could specify files, embed the outputs directly in their\n // binaries, fetch them over the network, etc.\n if (iree_status_is_ok(status)) {\n status = iree_runtime_demo_load_module(session);\n }\n\n // Build and issue the call.\n if (iree_status_is_ok(status)) {\n status = iree_runtime_demo_perform_mul(session);\n }\n\n // Release the session and free all resources.\n iree_runtime_session_release(session);\n return status;\n}\n\n//===----------------------------------------------------------------------===//\n// 3. Call a function within a module with buffer views\n//===----------------------------------------------------------------------===//\n// The inputs and outputs of a call are reusable across calls (and possibly\n// across sessions depending on device compatibility) and can be setup by the\n// application as needed. For example, an application could perform\n// multi-threaded buffer view creation and then issue the call from a single\n// thread when all inputs are ready. This simple demo just allocates them\n// per-call and throws them away.\n\n// Sets up and calls the simple_mul function and dumps the results:\n// func.func @simple_mul(%arg0: tensor<4xf32>, %arg1: tensor<4xf32>) ->\n// tensor<4xf32>\n//\n// NOTE: this is a demo and as such this performs no memoization; a real\n// application could reuse a lot of these structures and cache lookups of\n// iree_vm_function_t to reduce the amount of per-call overhead.\nstatic iree_status_t iree_runtime_demo_perform_mul(\n iree_runtime_session_t* session) {\n // Initialize the call to the function.\n iree_runtime_call_t call;\n IREE_RETURN_IF_ERROR(iree_runtime_call_initialize_by_name(\n session, iree_make_cstring_view(\"module.simple_mul\"), &call));\n\n // Append the function inputs with the HAL device allocator in use by the\n // session. The buffers will be usable within the session and _may_ be usable\n // in other sessions depending on whether they share a compatible device.\n iree_hal_device_t* device = iree_runtime_session_device(session);\n iree_hal_allocator_t* device_allocator =\n iree_runtime_session_device_allocator(session);\n iree_allocator_t host_allocator =\n iree_runtime_session_host_allocator(session);\n iree_status_t status = iree_ok_status();\n {\n // %arg0: tensor<4xf32>\n iree_hal_buffer_view_t* arg0 = NULL;\n if (iree_status_is_ok(status)) {\n static const iree_hal_dim_t arg0_shape[1] = {4};\n static const float arg0_data[4] = {1.0f, 1.1f, 1.2f, 1.3f};\n status = iree_hal_buffer_view_allocate_buffer_copy(\n device, device_allocator,\n // Shape rank and dimensions:\n IREE_ARRAYSIZE(arg0_shape), arg0_shape,\n // Element type:\n IREE_HAL_ELEMENT_TYPE_FLOAT_32,\n // Encoding type:\n IREE_HAL_ENCODING_TYPE_DENSE_ROW_MAJOR,\n (iree_hal_buffer_params_t){\n // Where to allocate (host or device):\n .type = IREE_HAL_MEMORY_TYPE_DEVICE_LOCAL,\n // Access to allow to this memory:\n .access = IREE_HAL_MEMORY_ACCESS_ALL,\n // Intended usage of the buffer (transfers, dispatches, etc):\n .usage = IREE_HAL_BUFFER_USAGE_DEFAULT,\n },\n // The actual heap buffer to wrap or clone and its allocator:\n iree_make_const_byte_span(arg0_data, sizeof(arg0_data)),\n // Buffer view + storage are returned and owned by the caller:\n &arg0);\n }\n if (iree_status_is_ok(status)) {\n IREE_IGNORE_ERROR(iree_hal_buffer_view_fprint(\n stdout, arg0, /*max_element_count=*/4096, host_allocator));\n // Add to the call inputs list (which retains the buffer view).\n status = iree_runtime_call_inputs_push_back_buffer_view(&call, arg0);\n }\n // Since the call retains the buffer view we can release it here.\n iree_hal_buffer_view_release(arg0);\n\n fprintf(stdout, \"\\n * \\n\");\n\n // %arg1: tensor<4xf32>\n iree_hal_buffer_view_t* arg1 = NULL;\n if (iree_status_is_ok(status)) {\n static const iree_hal_dim_t arg1_shape[1] = {4};\n static const float arg1_data[4] = {10.0f, 100.0f, 1000.0f, 10000.0f};\n status = iree_hal_buffer_view_allocate_buffer_copy(\n device, device_allocator, IREE_ARRAYSIZE(arg1_shape), arg1_shape,\n IREE_HAL_ELEMENT_TYPE_FLOAT_32,\n IREE_HAL_ENCODING_TYPE_DENSE_ROW_MAJOR,\n (iree_hal_buffer_params_t){\n .type = IREE_HAL_MEMORY_TYPE_DEVICE_LOCAL,\n .access = IREE_HAL_MEMORY_ACCESS_ALL,\n .usage = IREE_HAL_BUFFER_USAGE_DEFAULT,\n },\n iree_make_const_byte_span(arg1_data, sizeof(arg1_data)), &arg1);\n }\n if (iree_status_is_ok(status)) {\n IREE_IGNORE_ERROR(iree_hal_buffer_view_fprint(\n stdout, arg1, /*max_element_count=*/4096, host_allocator));\n status = iree_runtime_call_inputs_push_back_buffer_view(&call, arg1);\n }\n iree_hal_buffer_view_release(arg1);\n }\n\n // Synchronously perform the call.\n if (iree_status_is_ok(status)) {\n status = iree_runtime_call_invoke(&call, /*flags=*/0);\n }\n\n fprintf(stdout, \"\\n = \\n\");\n\n // Dump the function outputs.\n iree_hal_buffer_view_t* ret0 = NULL;\n if (iree_status_is_ok(status)) {\n // Try to get the first call result as a buffer view.\n status = iree_runtime_call_outputs_pop_front_buffer_view(&call, &ret0);\n }\n if (iree_status_is_ok(status)) {\n // This prints the buffer view out but an application could read its\n // contents, pass it to another call, etc.\n status = iree_hal_buffer_view_fprint(\n stdout, ret0, /*max_element_count=*/4096, host_allocator);\n }\n iree_hal_buffer_view_release(ret0);\n\n iree_runtime_call_deinitialize(&call);\n return status;\n}\n
hello_world.c
runtime/demo/
runtime-library/
libireert.so
simple_embedding.c
The compiler and runtime APIs may be used together to build a \"just in time\" (JIT) execution engine. JIT compilation allows for last-minute specialization with no prior knowledge of target devices and avoids issues with version drift, but it can also constrain deployment options and usage scenarios.
IREE offers Python bindings split into several packages, covering different components:
iree-tools-tf
iree-tools-tflite
iree-jax
Collectively, these packages allow for importing from frontends, compiling towards various targets, and executing compiled code on IREE's backends.
To use IREE's Python bindings, you will first need to install Python 3 and pip, as needed.
python -m pip install \\\n iree-compiler \\\n iree-runtime\n
python -m pip install \\\n --find-links https://iree.dev/pip-release-links.html \\\n --upgrade \\\n iree-compiler \\\n iree-runtime\n
See Building Python bindings page for instructions for building from source.
Info - API reference pages
API reference pages for IREE's runtime and compiler Python APIs are hosted on readthedocs.
Documentation for the MLIR compiler Python APIs can be found at https://mlir.llvm.org/docs/Bindings/Python/.
from iree import compiler as ireec\n\n# Compile a module.\nINPUT_MLIR = \"\"\"\nmodule @arithmetic {\n func.func @simple_mul(%arg0: tensor<4xf32>, %arg1: tensor<4xf32>) -> tensor<4xf32> {\n %0 = arith.mulf %arg0, %arg1 : tensor<4xf32>\n return %0 : tensor<4xf32>\n }\n}\n\"\"\"\n\n# Compile using the vmvx (reference) target:\ncompiled_flatbuffer = ireec.tools.compile_str(\n INPUT_MLIR,\n target_backends=[\"vmvx\"])\n
from iree import runtime as ireert\nimport numpy as np\n\n# Register the module with a runtime context.\n# Use the \"local-task\" CPU driver, which can load the vmvx executable:\nconfig = ireert.Config(\"local-task\")\nctx = ireert.SystemContext(config=config)\nvm_module = ireert.VmModule.copy_buffer(ctx.instance, compiled_flatbuffer)\nctx.add_vm_module(vm_module)\n\n# Invoke the function and print the result.\nprint(\"INVOKE simple_mul\")\narg0 = np.array([1., 2., 3., 4.], dtype=np.float32)\narg1 = np.array([4., 5., 6., 7.], dtype=np.float32)\nf = ctx.modules.arithmetic[\"simple_mul\"]\nresults = f(arg0, arg1).to_host()\nprint(\"Results:\", results)\n
Check out the samples in IREE's samples/colab/ directory and the iree-experimental repository for examples using the Python APIs.
The Python packages include console scripts for most of IREE's native tools like iree-compile and iree-run-module. After installing a package from pip, these should be added to your path automatically:
$ python -m pip install iree-runtime\n$ which iree-run-module\n\n/projects/.venv/Scripts/iree-run-module\n
The tools in the iree-runtime package support variants:
Switch between variants of the installed tools using the IREE_PY_RUNTIME environment variable:
IREE_PY_RUNTIME
IREE_PY_RUNTIME=tracy iree-run-module ...\n
See the developer documentation page on Profiling with Tracy for information on using Tracy.
Tip - flushing profile data
When writing a Python-based program that you want to profile you may need to insert IREE runtime calls to periodically flush the profile data:
device = ... # HalDevice\ndevice.flush_profiling()\n
These pages contain automatically generated documentation for the MLIR dialects defined in the IREE repository. IREE also makes extensive use of dialects from the upstream MLIR repository, which are documented at https://mlir.llvm.org/docs/Dialects/.
These dialects are an implementation detail of the IREE compiler, though they can be used by plugins and other advanced integrations. The sources for most of these dialects can be found in the iree/compiler/Dialect/ directory.
iree/compiler/Dialect/
The ops in these dialects are legal to include in compiler inputs. The sources for these dialects can be found in the llvm-external-projects/iree-dialects/ directory that is designed to be used from other projects via LLVM's external projects mechanism.
llvm-external-projects/iree-dialects/
Hardware Abstraction Layer\u00a0\u21a9
A dialect implementing test assertions for IREE modules.
check.expect_all_true
Checks that the operand contains only values that are true
Syntax:
operation ::= `check.expect_all_true` (`` `<` $device^ `>`)?\n `` `(` $operand `)` attr-dict `:` type($operand)\n
Verifies that the operand contains true values, which are represented by any non-zero integer.
Issues a non-fatal failure if the verification fails.
check.expect_all_true<%device>(%arg0) : !hal.buffer_view\ncheck.expect_all_true(%arg1) : tensor<2x2xi32>\n
device
operand
check.expect_almost_eq
Checks that the operands are almost equal
operation ::= `check.expect_almost_eq` (`` `<` $device^ `>`)?\n `` `(` $lhs `,` $rhs `)` attr-dict `:` type($lhs)\n
Verifies that the buffer view or tensor operands with float elements are almost equal to within an implementation-defined \"reasonable\" tolerance.
check.expect_almost_eq(%arg0, %arg1) : tensor<5xf32>\n
lhs
rhs
check.expect_almost_eq_const
Checks that the tensor operand is almost equal to some constant
operation ::= `check.expect_almost_eq_const` (`` `<` $device^ `>`)?\n `` `(` $lhs `,` $value `)` attr-dict `:` type($lhs)\n
Verifies that the tensor operand with float elements is almost equal to the constant attribute within an implementation-defined \"reasonable\" tolerance.
This op is just a convenience wrapper around the expect_almost_eq op.
check.expect_almost_eq_const(%const0, dense<[0.999999, 2.0]> : tensor<5xf32>) : tensor<5xf32>\n
check.expect_eq
Checks that the tensor or buffer view operands are equal
operation ::= `check.expect_eq` (`` `<` $device^ `>`)?\n `` `(` $lhs `,` $rhs `)` attr-dict `:` type($lhs)\n
Verifies that the operands are exactly equal.
check.expect_eq(%arg0, %arg1) : tensor<5xi32>\n
check.expect_eq_const
Checks that the tensor operand is equal to some constant
operation ::= `check.expect_eq_const` (`` `<` $device^ `>`)?\n `` `(` $lhs `,` $value `)` attr-dict `:` type($lhs)\n
Verifies that the tensor operand is exactly equal to a constant attribute.
This op is just a convenience wrapper around the expect_eq op.
check.expect_eq_const(%arg0, dense<[1, 2]> : tensor<2xi32>) : tensor<2xi32>\n
check.expect_false
Checks that the operand is false
operation ::= `check.expect_false` `(` $operand `)` attr-dict `:` type($operand)\n
Verifies that the operand contains a false value, which is represented by zero.
check.expect_false(%arg0) : i32\n
check.expect_true
Checks that the operand is true
operation ::= `check.expect_true` `(` $operand `)` attr-dict `:` type($operand)\n
Verifies that the operand contains a true value, which is represented by any non-zero integer.
check.expect_true(%arg0) : i32\n
A dialect designed to model execution data flow and partitioning.
The flow dialect is used to model regions of dense computation and the data flow between them. MLIR value-semantic tensors are used as the primary data type to allow SSA use-def to provide a bulk of the infrastructure required to perform the computation partitioning and outlining.
The dialect is designed to ingest relatively high-level linear algebra via XLA HLO ops (that also operate on the value-semantic tensor types) and optionally MLIR standard ops for control flow and other actions. After conversion of any higher-level ops that have special semantics in the flow dialect, such as global variables, the rest are partitioned into regions containing simple and compatible computations. Finally, outlining moves the computations into executables and leaves only the execution flow encoded via dispatch operations.
The primary unit of interest is a \"dispatch region\" containing compatible computations that can be scheduled together efficiently (and safely). \"Compatible\" here is specified as similarly shaped workloads that indicate how many invocations a computation can be parallelized across when running in a SPMD execution model. Though it depends on the particular runtime backends this more concretely means things like the untiled workload (or tiled workgroups) used in GPU dispatches or similar thread pool executors.
After identification of the dispatchable regions a set of transformations performs folding and simplification to reduce the total number of dispatches. Heuristics are used in certain cases to more efficiently schedule special ops (such as GEMM) and the design is amenable to profile- guided analysis that can be added in the future.
The resulting outlined executable modules containing the dispatchable code can be translated to one or more backends (such as SPIR-V for Vulkan, or LLVM IR for running on the CPU, etc). The IR that is outlined is untouched and in the input format (such as XLA HLO ops) allowing conversion using any MLIR target that supports ingesting such input. A few special ops are used to communicate statically available information such as the expected workload size, shapes of inputs and outputs, etc.
flow.channel.count
Returns the total number of participants in the group
operation ::= `flow.channel.count` $channel `:` type($result)\n attr-dict-with-keyword\n
Returns the total participant count in the collective communicator group.
Traits: AlwaysSpeculatableImplTrait
AlwaysSpeculatableImplTrait
Interfaces: ConditionallySpeculatable, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface), OpAsmOpInterface
ConditionallySpeculatable
InferTypeOpInterface
NoMemoryEffect (MemoryEffectOpInterface)
OpAsmOpInterface
Effects: MemoryEffects::Effect{}
MemoryEffects::Effect{}
channel
result
flow.channel.default
Returns a default collective communication channel
operation ::= `flow.channel.default` ($group^)?\n `:` type($result)\n attr-dict-with-keyword\n
Returns a channel initialized using the runtime environment.
group
flow.channel.rank
Returns the rank of the local participant in the group
operation ::= `flow.channel.rank` $channel `:` type($result)\n attr-dict-with-keyword\n
Returns the rank the channel represents as a participant in a collective group in [0, count).
[0, count)
flow.channel.split
Splits a collective communication channel
operation ::= `flow.channel.split` $channel `,` $color `,` $key\n `:` type($channel) `->` type($result)\n attr-dict-with-keyword\n
Partitions the group associated with the given channel into disjoint subgroups for each unique value of color. Each new subgroup contains all participants of the same color and within each subgroup the key argument is used to define the rank order. When multiple participants in a group use the same key the tie will be broken using their rank in the parent group.
Interfaces: InferTypeOpInterface, OpAsmOpInterface
color
key
flow.collective.all_gather
Performs all-gather operation
operation ::= `flow.collective.all_gather` $element_type `,` $target `,` $source `,` $channel `:`\n `(` type($target) `,` type($source) `,` type($channel) `)` `->`\n custom<ShapedTiedResult>(type($result), $target_dims, $tied_operands)\n attr-dict-with-keyword\n
It gathers data from all ranks and concatenates them on the 0-th dimension. Interfaces: InferTypeOpInterface, TiedOpInterface
TiedOpInterface
tied_operands
target
target_dims
source
flow.collective.all_reduce
Performs all-reduce operation
operation ::= `flow.collective.all_reduce` $reduction_op `,` $element_type `,` $target `,` $source `,` $channel `:`\n `(` type($target) `,` type($source) `,` type($channel) `)` `->`\n custom<ShapedTiedResult>(type($result), $target_dims, $tied_operands)\n attr-dict-with-keyword\n
The operation reduces data across all the ranks in the channel. Interfaces: InferTypeOpInterface, TiedOpInterface
reduction_op
flow.collective.all_to_all
Performs all-to-all operation
operation ::= `flow.collective.all_to_all` $element_type `,` $target `,` $source `,` $channel `:`\n `(` type($target) `,` type($source) `,` type($channel) `)` `->`\n custom<ShapedTiedResult>(type($result), $target_dims, $tied_operands)\n attr-dict-with-keyword\n
This operation mutually exchanges data acrosss all of the ranks in the channel. Interfaces: InferTypeOpInterface, TiedOpInterface
flow.collective.reduce_scatter
Performs reduce and scatter operations
operation ::= `flow.collective.reduce_scatter` $reduction_op `,` $element_type `,` $target `,` $source `,` $channel `:`\n `(` type($target) `,` type($source) `,` type($channel) `)` `->`\n custom<ShapedTiedResult>(type($result), $target_dims, $tied_operands)\n attr-dict-with-keyword\n
The operation reduces data across all the ranks in the channel and scatters the result to each rank. Interfaces: InferTypeOpInterface, TiedOpInterface
flow.collective.send_recv
Performs a grouped send and receive operation
operation ::= `flow.collective.send_recv` $element_type `,` $target `,` $source `,` $channel `,` $send `,` $recv `:`\n `(` type($target) `,` type($source) `,` type($channel) `,` type($send) `,` type($recv) `)` `->`\n custom<ShapedTiedResult>(type($result), $target_dims, $tied_operands)\n attr-dict-with-keyword\n
The operation sends data to the rank specificied by send and receives data from the rank specified by recv. If send is -1, this rank will not send any data. If recv is -1, this rank will not receive any data and the output will be all zeros. Interfaces: InferTypeOpInterface, TiedOpInterface
send
recv
A dispatch of workgroups across a grid
operation ::= `flow.dispatch` custom<DispatchEntryPoints>($entry_points)\n (`[` $workload^ `]`)? ``\n `(` $arguments `)` attr-dict `:`\n custom<ShapedFunctionType>(ref($arguments),\n type($arguments), $argument_dims,\n type($results), $result_dims,\n $tied_operands)\n
Dispatches workgroups across an grid defined by the captured workload parameters carrying the information required to compute the workgroup count at runtime. The function for converting the workload into a 3D workgroup count is attached to the dispatch entry point and may contain arbitrary host logic.
Traits: AlwaysSpeculatableImplTrait, AttrSizedOperandSegments
AttrSizedOperandSegments
Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), SymbolUserOpInterface, TiedOpInterface, Util_ShapeAwareOp
SymbolUserOpInterface
Util_ShapeAwareOp
entry_points
workload
arguments
argument_dims
result_dims
results
Executables for outlined regions.
flow.executable_end
Terminator pseudo-op for the executable op
operation ::= `flow.executable_end` attr-dict\n
Traits: HasParent<IREE::Flow::ExecutableOp>, Terminator
HasParent<IREE::Flow::ExecutableOp>
Terminator
flow.executable.export
Defines an executable entry point for dispatch operations
operation ::= `flow.executable.export` custom<SymbolVisibility>($sym_visibility)\n custom<SymbolAlias>($sym_name, $function_ref)\n custom<WorkgroupCountRegion>($workgroup_count)\n attr-dict-with-keyword\n
Specifies an exported function with an externally-visible alias. Multiple exports can reference the same internal function.
Each entry point can have a unique workgroup count calculation region. This region takes the workload parameters passed to each flow.dispatch and produces an XYZ workgroup count for the 3D grid dispatch.
Traits: HasParent<IREE::Flow::ExecutableOp>, IsolatedFromAbove
IsolatedFromAbove
Interfaces: Symbol
Symbol
sym_visibility
sym_name
function_ref
Generic executable module
operation ::= `flow.executable` custom<SymbolVisibility>($sym_visibility)\n $sym_name\n attr-dict-with-keyword\n regions\n
An executable module containing one or more public functions. The contents of the functions are safe to dispatch and can be lowered further to target-specific backend IR representations.
Traits: IsolatedFromAbove, SingleBlockImplicitTerminator<IREE::Flow::ExecutableEndOp>, SingleBlock, SymbolTable, Util_ObjectLike
SingleBlockImplicitTerminator<IREE::Flow::ExecutableEndOp>
SingleBlock
SymbolTable
Util_ObjectLike
flow.dispatch.region
A group of ops
This op is a container/grouping of ops. It represents a fusion group before being lowered to a dispatch region. Ops are collected inside of the region body of the op. Values from parent regions can be captured. Results are yielded with a return terminator and returned from this op.
dispatch.region ops are lowered to dispatch.workgroups ops. Workgroups isolated from above. dispatch.region ops are a more lightweight abstraction for implementing fusion heuristics, i.e., the process of deciding which ops should form a dispatch region.
dispatch.region
dispatch.workgroups
This op also has a second region: workload_count. The arguments to the region represent the workload for the dispatch, and returns the number of workgroups for the dispatch. The region is lowered directly to workload_count region of dispatch.workgroups.
workload_count
Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), Util_ShapeAwareOp
Loads a tensor from a dispatch input placeholder
operation ::= `flow.dispatch.tensor.load` $source\n `,` `offsets` `=` custom<DynamicIndexList>(\n $offsets, $static_offsets)\n `,` `sizes` `=` custom<DynamicIndexList>(\n $sizes, $static_sizes)\n `,` `strides` `=` custom<DynamicIndexList>(\n $strides, $static_strides)\n attr-dict `:` type($source) (`{` $source_dims^ `}`)? `->` type($result)\n
Loads an input tensor or subtensor from an input placeholder. As each workgroup executes concurrently all workgroups will receive identical loaded results of regions that may overlap.
Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OffsetSizeAndStrideOpInterface, ReifyRankedShapedTypeOpInterface, TiedOpInterface, Util_ShapeAwareOp
OffsetSizeAndStrideOpInterface
ReifyRankedShapedTypeOpInterface
static_offsets
static_sizes
static_strides
source_dims
offsets
sizes
strides
Stores a tensor into a dispatch output placeholder
operation ::= `flow.dispatch.tensor.store` $value `,` $target\n `,` `offsets` `=` custom<DynamicIndexList>(\n $offsets, $static_offsets)\n `,` `sizes` `=` custom<DynamicIndexList>(\n $sizes, $static_sizes)\n `,` `strides` `=` custom<DynamicIndexList>(\n $strides, $static_strides)\n attr-dict `:` type($value) `->` type($target) (`{` $target_dims^ `}`)?\n
Stores a tensor or subtensor into an output tensor placeholder. As each workgroup executes concurrently behavior is undefined if more than one workgroup stores into overlapping regions of the full output tensor.
Traits: AttrSizedOperandSegments
Interfaces: OffsetSizeAndStrideOpInterface, Util_ShapeAwareOp
flow.dispatch.tie_shape
Ties a runtime shape to a dispatch I/O argument
operation ::= `flow.dispatch.tie_shape` $operand attr-dict\n `:` type($result) (`{` $dynamic_dims^ `}`)?\n
Metadata op used to tie a runtime-computed shape with dynamic dimensions to a dispatch input/output argument. All uses of the argument should use the pass-through result of this op to allow for SSA-based shape resolution.
Interfaces: ConditionallySpeculatable, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface), ReifyRankedShapedTypeOpInterface, Util_ShapeAwareOp
dynamic_dims
flow.dispatch.workgroup.count
Returns the total workgroup count of the grid
operation ::= `flow.dispatch.workgroup.count` `[` $dimension `]` attr-dict `:` type($result)\n
The total number of workgroups along each dimension in the dispatch grid.
Represented as a 3D grid classically written as XYZ. Corresponds to the NumWorkgroups SPIR-V built-in and the gridDim CUDA built-in variable.
NumWorkgroups
gridDim
%x = flow.dispatch.workgroup.count[0] : index\n%y = flow.dispatch.workgroup.count[1] : index\n%z = flow.dispatch.workgroup.count[2] : index\n
dimension
flow.dispatch.workgroup.id
Returns the index of the current workgroup in the grid
operation ::= `flow.dispatch.workgroup.id` `[` $dimension `]` attr-dict `:` type($result)\n
The global workgroup ID of the current workgroup in the range of [0, flow.dispatch.workgroup.count) along each dimension.
[0, flow.dispatch.workgroup.count)
Represented as a 3D grid classically written as XYZ. Corresponds to the WorkgroupId SPIR-V built-in and the blockIdx CUDA built-in variable.
WorkgroupId
blockIdx
%x = flow.dispatch.workgroup.id[0] : index\n%y = flow.dispatch.workgroup.id[1] : index\n%z = flow.dispatch.workgroup.id[2] : index\n
flow.dispatch.workgroup.size
Returns the size of each workgroup in invocations
operation ::= `flow.dispatch.workgroup.size` `[` $dimension `]` attr-dict `:` type($result)\n
The number of local invocations within the current workgroup along each dimension. Depending on backend this may map to the SIMT thread count or inner loop nest parameters.
Workgroup sizes are not determined at the flow dialect level as they are dependent on the target backend determined when lowering into the HAL. It's still possible to use the symbolic workgroup size inside of dispatch executables as a placeholder for the resolved value once in the HAL.
Represented as a 3D grid classically written as XYZ. Corresponds to the WorkgroupSize SPIR-V built-in and the blockDim CUDA built-in variable.
WorkgroupSize
blockDim
%x = flow.dispatch.workgroup.size[0] : index\n%y = flow.dispatch.workgroup.size[1] : index\n%z = flow.dispatch.workgroup.size[2] : index\n
flow.dispatch.workgroups
A dispatch of workgroups across a 3-dimensional grid
operation ::= `flow.dispatch.workgroups` (`[` $workload^ `]`)? ``\n `(` $arguments `)` `:`\n custom<ShapedFunctionType>(ref($arguments),\n type($arguments), $argument_dims,\n type($results), $result_dims,\n $tied_operands)\n attr-dict-with-keyword\n `=` `\\n` ` ` ` ` ` `\n custom<DispatchWorkgroupBody>(ref(type($arguments)),\n ref(type($results)),\n $workgroup_body)\n `` custom<DispatchWorkgroupsCountRegion>($workgroup_count)\n
Dispatches some number of workgroups across a 3-dimensional grid. The body region will be invoked for each workgroup with a unique flow.dispatch.workgroup.id in the range of [0, flow.dispatch.workgroup.count) (along each dimension XYZ).
From the outside the dispatch operation has value semantics: some tensors (and optionally other primitive types) are consumed and one or more new result tensors are produced. Inside each workgroup, however, the input and output tensors are available for arbitrary loads and stores. In many cases each workgroup will load some particular tile(s) from the input tensors and store some particular tile(s) to the output tensors unique to that workgroup. Though it's possible for multiple workgroups to load the same regions of the input tensors behavior is undefined if multiple workgroups store to the same regions of the output tensors.
Though the representation is similar to the GPU-style grid dispatch model here we still have not yet allocated buffers, determined the target device for execution, or even completed fully resolving shapes/types/etc. Because of this it's important that the workgroup body use the flow.dispatch.workgroup.* ops to query the workgroup ID/count/size instead of hardcoding them to a particular set of values. Assume that any workgroup dispatch may end up being specialized for several different target devices and even several different variants for a particular target device (differing workgroup sizes, etc).
flow.dispatch.workgroup.*
Because at this point in the layering devices have not yet been selected the workgroup count cannot be fully evaluated. Instead workload parameters are captured that are then passed to a function that when later evaluated computes the actual workgroup count based on target information. The workload is not limited to the 3D XYZ grid dispatch of the workgroup count and can contain any number of parameters used to compute it.
%r = flow.dispatch.workgroups[%c5, %c5](%0, %1)\n : (tensor<5x5xf32>, tensor<5xf32>) -> tensor<5x5xf32> =\n (%arg0: !flow.dispatch.tensor<readonly:tensor<5x5xf32>>,\n %arg1: !flow.dispatch.tensor<readonly:tensor<5xf32>>,\n %arg2: !flow.dispatch.tensor<writeonly:tensor<5x5xf32>>) {\n ...\n}\n
The number of results of the operation is equal to the number of results in the type signature ((tensor<5x5xf32>, tensor<5xf32>) -> tensor<5x5xf32>). Each tensor argument and result in the type signature has a corresponding block argument of type !flow.dispatch.tensor. Furthermore, each argument has a corresponding arguments operand.
(tensor<5x5xf32>, tensor<5xf32>) -> tensor<5x5xf32>
!flow.dispatch.tensor
There are no arguments operands for results, but a result can be tied an argument by writing the argument operand's SSA value instead of its type: E.g., in the above example, -> %0 would tie the first argument to the result. In that case, there would be no separate block argument for the result.
-> %0
Traits: AlwaysSpeculatableImplTrait, AttrSizedOperandSegments, IsolatedFromAbove
Interfaces: ClosureOpInterface, ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TiedOpInterface, Util_ShapeAwareOp
ClosureOpInterface
flow.return
Return from a flow.dispatch_region
operation ::= `flow.return` attr-dict ($operands^ `:` type($operands))?\n
Returns the given values from the region and back to the host code.
Traits: AlwaysSpeculatableImplTrait, ReturnLike, Terminator
ReturnLike
Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), RegionBranchTerminatorOpInterface
RegionBranchTerminatorOpInterface
operands
flow.call
Calls a streamable external host function
operation ::= `flow.call` $callee\n `(` $arguments `)` attr-dict `:`\n custom<ShapedFunctionType>(ref($arguments),\n type($arguments), $argument_dims,\n type($results), $result_dims,\n $tied_operands)\n
Calls a function taking/returning tensor values with stream semantics. Tensors have their shapes captured and may be tied to denote in-place operations. Asynchronous calls must have no side-effects.
Note that returned tensors must have their shapes declared prior to the call as this is what allows the call to be made on the stream. If external host logic is required to compute the shape (avoid at all costs!) a separate func.call can be used outside of the stream to do so. If shapes are unknowable until the operation is performed it should be made as a normal asynchronous host call with 'coarse-fences' instead.
Interfaces: CallOpInterface, SymbolUserOpInterface, TiedOpInterface, Util_ShapeAwareOp
CallOpInterface
callee
flow.func
Streamable function declaration
operation ::= `flow.func` custom<SymbolVisibility>($sym_visibility)\n $sym_name\n ``\n custom<ShapedFunctionSignature>($function_type,\n $tied_operands,\n $arg_attrs,\n $res_attrs)\n attr-dict-with-keyword\n ($body^)?\n
Declares a function that can be called as an asynchronous streaming operation via flow.call. Today only external functions are allowed.
Traits: IsolatedFromAbove
Interfaces: CallableOpInterface, FunctionOpInterface, Symbol
CallableOpInterface
FunctionOpInterface
function_type
arg_attrs
res_attrs
flow.dispatch.workgroup_count_from_dag_root
Workgroup count computed based on iteration range of the root of the DAG for ops within the dispatch.
operation ::= `flow.dispatch.workgroup_count_from_dag_root` attr-dict $operands\n
When using tile + distribution of the root of the DAG (Directed Acyclic Graph) of ops within the dispatch to split the work amongst workgroups. The workload captured is the size of the iteration space of the root of the DAG. This op represents the computation that given the workload returns the number of workgroups to use. The backends are responsible for lowering this op into actual computation (typically based on the tile sizes used to tile and distribute the root of the DAG).
Interfaces: ConditionallySpeculatable, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface)
y
z
flow.dispatch.workgroup_count_from_slice
Place holder to signify default workgroup count calculation.
operation ::= `flow.dispatch.workgroup_count_from_slice` attr-dict $operands\n
The default computation of the number of workgroups (or workgroup count) assumes that the dispatch + captured values is enough to compute the workgroup count. It does so by using a program slice of the values within the dispatch that represent the number of workgroups when available within the dispatch. Currently the arguments of index types captured by the flow.dispatch.workgroups is treated as the workload for the operation. It is a requirement that the slice of the program that computes the number of workgroups will need to have its leaves be these captured values.
TODO: This could be generalized in future to allow the slices to encompass arbitrary computation. The computation of the workgroup count can then be done on the device itself, if this is data dependent. In such cases the workload could be more than just values of index types.
flow.dispatch.workload.ordinal
Annotates the values captured as workload within the body of flow.dispatch.workgroups op.
operation ::= `flow.dispatch.workload.ordinal` attr-dict $operand `,` $ordinal `:` type($operand)\n
The arguments that represent the captured/returned values of the `flow.dispatch.workgroups, i.e. the signature of the body of the op is not preserved during IREEs compilation. Since the workloads are derived from the operands captured by the operation, this op denotes the values captured as workloads. This can be used in the backends to map back to the workload values while materializing the workgroup count computation.
TODO: Find a better way to represent this information, either by somehow propagating the signature of the created dispatch workgroup op through the compilation stack until the codegen backends, or as a separate list/attribute that can be plumbed through without using explicit ops.
ordinal
flow.tensor.alloca
An empty tensor allocation with undefined contents
operation ::= `flow.tensor.alloca` `:` type($result) (`{` $result_dims^ `}`)?\n attr-dict-with-keyword\n
Returns a new transient tensor allocation with undefined contents. Subsequent writes must populate any ranges of the tensor that are later read. The resulting tensor may be long-lived and allocated as part of a dedicated allocation. Prefer using flow.tensor.empty whenever possible as this op disables nearly all allocation-related optimizations performed by the compiler. The presence of this op is often an indication of an improper lowering.
flow.tensor.empty
Interfaces: MemoryEffectOpInterface (MemoryEffectOpInterface), Util_ShapeAwareOp
MemoryEffectOpInterface (MemoryEffectOpInterface)
Effects: MemoryEffects::Effect{MemoryEffects::Allocate on ::mlir::SideEffects::DefaultResource}
MemoryEffects::Effect{MemoryEffects::Allocate on ::mlir::SideEffects::DefaultResource}
flow.tensor.bitcast
Bitcasts a tensor
operation ::= `flow.tensor.bitcast` $source `:`\n type($source) (`{` $source_dims^ `}`)? `->`\n type($result) (`{` $result_dims^ `}`)?\n attr-dict-with-keyword\n
Bitcasts a tensor to a new type without modifying the contents.
Interfaces: ConditionallySpeculatable, HoistableOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TiedOpInterface, Util_ShapeAwareOp
HoistableOpInterface
flow.tensor.clone
Performs a full tensor clone operation
operation ::= `flow.tensor.clone` $operand `:` type($result) (`{` $argument_dims^ `}`)?\n attr-dict-with-keyword\n
Clones the input tensor into an identical output tensor.
Interfaces: ConditionallySpeculatable, HoistableOpInterface, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface), Util_ShapeAwareOp
flow.tensor.constant
Tensor constant that can have dynamic dimensions
operation ::= `flow.tensor.constant` $value attr-dict `->` type($result)\n
Allows specifying a constant where the return value can erase shape information. This operation is declared as having side effects and has no folder, so will not be optimized away by the compiler. The underlying shape information should be hidden from the compiler and resolved at runtime.
%c = flow.tensor.constant tensor<2x2xf32> -> tensor<?x?xf32>\n%res = math.absf %c : tensor<?x?xf32>\n
An empty tensor carrying metadata but no contents
operation ::= `flow.tensor.empty` `:` type($result) (`{` $result_dims^ `}`)?\n attr-dict-with-keyword\n
Returns a tensor with undefined contents. Subsequent writes must populate any ranges of the tensor that are later read.
Interfaces: ConditionallySpeculatable, HoistableOpInterface, NoMemoryEffect (MemoryEffectOpInterface), Util_ShapeAwareOp
Loads a value from a tensor element
operation ::= `flow.tensor.load` $source (`[` $indices^ `]`)? `:`\n type($source) (`{` $source_dims^ `}`)?\n attr-dict-with-keyword\n
Returns the element at the given location from within the tensor.
Interfaces: ConditionallySpeculatable, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface), Util_ShapeAwareOp
indices
flow.tensor.reshape
Reshapes a tensor
operation ::= `flow.tensor.reshape` $source `:`\n type($source) (`{` $source_dims^ `}`)? `->`\n type($result) (`{` $result_dims^ `}`)?\n attr-dict-with-keyword\n
Reshapes a tensor to a new shape without modifying the contents.
flow.tensor.slice
Slices out a subregion of a tensor
operation ::= `flow.tensor.slice` $source `[` $start_indices `for` $lengths `]` `:`\n type($source) (`{` $source_dims^ `}`)? `->`\n type($result) (`{` $result_dims^ `}`)?\n attr-dict-with-keyword\n
Clones a subregion of a tensor.
start_indices
lengths
flow.tensor.splat
Splats a value into a shaped tensor
operation ::= `flow.tensor.splat` $value `:` type($result) (`{` $result_dims^ `}`)?\n attr-dict-with-keyword\n
Returns a tensor initialized to the given primitive value.
flow.tensor.store
Stores a value into a tensor element
operation ::= `flow.tensor.store` $value `,` $target (`[` $indices^ `]`)? `:`\n type($target) (`{` $target_dims^ `}`)?\n attr-dict-with-keyword\n
Returns a tensor with the element at the given index set to the given value.
flow.tensor.tie_shape
Ties a runtime shape to a tensor value
operation ::= `flow.tensor.tie_shape` $operand attr-dict\n `:` type($result) (`{` $dynamic_dims^ `}`)?\n
Metadata op used to tie tensors with their runtime-computed dynamic dimensions. This only exists transiently in the IR as a witness to shape calculations and is removed during lowering.
flow.tensor.trace
Traces one or more tensor values at runtime
operation ::= `flow.tensor.trace` $key `=` `[`\n custom<ShapedOperandList>($values, type($values), $value_dims)\n `]` attr-dict-with-keyword\n
Traces out to a runtime trace sink (console, log file, etc) the given tensors. The key is arbitrary and can be used for identifying the set of values being traced.
Interfaces: ShapeAwareOpInterface
ShapeAwareOpInterface
values
value_dims
Updates a tensor with the contents of another tensor
operation ::= `flow.tensor.update` $update `,` $target `[` $start_indices `]` `:`\n type($update) (`{` $update_dims^ `}`)? `->`\n custom<ShapedTiedResult>(type($result), $target_dims)\n attr-dict-with-keyword\n
Updates the target tensor with the contents of the update tensor at the given offset indices.
Interfaces: ConditionallySpeculatable, HoistableOpInterface, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TiedOpInterface, Util_ShapeAwareOp
update
update_dims
Syntax: #flow.dummy
#flow.dummy
A placeholder for a dispatch region input/output operand. This can be used to query the metadata about the tensor (such as its shape) as well as both load and store from the backing tensor representation.
A placeholder for a dispatch region input operand. This can be used to query the metadata about the tensor (such as its shape) as well as load from the backing tensor representation.
A placeholder for a dispatch region output operand. This can be used to query the metadata about the tensor (such as its shape) as well as store to the backing tensor representation.
a collecive communication channel
Syntax: !flow.channel
!flow.channel
Represents a single participant in a collective clique. Multiple channels may exist within the same program to allow for partial operations or hierarchical operations.
In programs that have already been partitioned prior to being compiled there will often exist only one channel and flow.channel.default can be used to reference it. In programs that model SPMD behavior internally channels can be created or provided by hosting applications.
Syntax: !flow.dummy
!flow.dummy
A dialect representing operations against the IREE HAL.
This can be thought of as a Vulkan-like model with all of the graphics bits chopped out.
The type set is limited to those that can be represented in the IREE HAL design: buffers and views, synchronization primitives like semaphores, and and command buffers. The intent is that if a device could implement the HAL interface the sequencer ops could run on that device, such as being able to run on a GPU via indirect command buffers.
Though this is mostly a 1:1 mapping to the iree::hal API there are some methods omitted as they are not likely to be needed in IR. It's assumed that either sequencer interfaces will encapsulate the logic (such as device resolution) or that certain features are unsafe to expose to user-defined input.
Ops for !hal.allocator / iree_hal_allocator_t.
!hal.allocator
hal.allocator.allocate
Empty buffer allocation operation
operation ::= `hal.allocator.allocate` `<` $allocator `:` type($allocator) `>`\n `affinity` `(` $queue_affinity `)`\n `type` `(` $memory_types `)`\n `usage` `(` $buffer_usage `)`\n `:` custom<SizeAwareType>(type($result), $result_size)\n attr-dict-with-keyword\n
Allocates a buffer of the given size from the allocator. The size of the buffer returned may be larger than the requested size if the allocator has specific alignment requirements or minimum allocation sizes.
Interfaces: OpAsmOpInterface, SizeAwareOpInterface
SizeAwareOpInterface
memory_types
buffer_usage
allocator
queue_affinity
result_size
hal.allocator.import
Allocator-supported host buffer import operation
operation ::= `hal.allocator.import` `<` $allocator `:` type($allocator) `>`\n `source` `(` $source `:` type($source) `)` `` `[` $offset `,` $length `]`\n `affinity` `(` $queue_affinity `)`\n `type` `(` $memory_types `)`\n `usage` `(` $buffer_usage `)`\n `:` type($did_import) `,` type($result)\n attr-dict-with-keyword\n
Tries importing host memory backed by the given byte buffer into a device accessible !hal.buffer. The returned buffer may be host-only and not directly usable on devices. If the mapping cannot be completed (such as trying to map the host memory as device-local on devices with discrete memory) then did_import will indicate that the returned buffer is null.
!hal.buffer
did_import
offset
length
Ops for !hal.buffer / iree_hal_buffer_t.
hal.buffer.assert
Buffer compatibility assertion
operation ::= `hal.buffer.assert` `<` $buffer `:` type($buffer) `>`\n `message` `(` $message `)`\n `allocator` `(` $allocator `:` type($allocator) `)`\n `minimum_length` `(` $minimum_length `)`\n `type` `(` $memory_types `)`\n `usage` `(` $buffer_usage `)`\n attr-dict-with-keyword\n
Asserts that the buffer is compatible with the given allocator and usage. Program execution will abort as if std.assert had been used.
This only checks that the buffer can be used and not that it matches the given parameters exactly. Buffers may be from other allocators so long as the allocators are compatible (devices can address each other's memory), the type and usage contain all the requested bits (having more bits is ok), and the length is at least the requested minimum (as padding may be ignored).
message
buffer
minimum_length
hal.buffer.length
Buffer byte length accessor
operation ::= `hal.buffer.length` `<` $buffer `:` type($buffer) `>`\n `:` type($result)\n attr-dict-with-keyword\n
Returns the allocated size of a buffer in bytes. May be less than the underlying buffer allocation if this is a subspan or view into another buffer.
Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpAsmOpInterface
hal.buffer.load
Buffer element load operation
operation ::= `hal.buffer.load` `<` $source_buffer `:` type($source_buffer) `>`\n `` `[` $source_offset `]`\n `:` type($result)\n attr-dict-with-keyword\n
Loads a value from a buffer by mapping it.
Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)
source_buffer
source_offset
hal.buffer.store
Buffer element store operation
operation ::= `hal.buffer.store` `<` $target_buffer `:` type($target_buffer) `>`\n `` `[` $target_offset `]`\n `value` `(` $value `:` type($value) `)`\n attr-dict-with-keyword\n
Stores a value into a buffer by mapping it.
target_buffer
target_offset
hal.buffer.subspan
Buffer subspan operation
operation ::= `hal.buffer.subspan` `<` $source_buffer `:` type($source_buffer) `>`\n `` `[` $source_offset `,` $length `]`\n `:` type($result)\n attr-dict-with-keyword\n
Returns a reference to a subspan of the buffer.
Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpAsmOpInterface, SizeAwareOpInterface
Ops for !hal.buffer_view / iree_hal_buffer_view_t.
!hal.buffer_view
iree_hal_buffer_view_t
hal.buffer_view.assert
Buffer view contents assertion
operation ::= `hal.buffer_view.assert` `<` $buffer_view `:` type($buffer_view) `>`\n `message` `(` $message `)`\n `shape` `(` `[` $shape `]` `)`\n `type` `(` $element_type `)`\n `encoding` `(` $encoding_type `)`\n attr-dict-with-keyword\n
Asserts that the buffer view contains a data compatible tensor with the given encoding. Program execution will abort as if std.assert had been used.
buffer_view
encoding_type
shape
hal.buffer_view.buffer
Buffer view buffer accessor
operation ::= `hal.buffer_view.buffer` `<` $buffer_view `:` type($buffer_view) `>`\n `:` type($result)\n attr-dict-with-keyword\n
Returns the buffer backing this view's contents.
hal.buffer_view.create
Buffer view reference initializer
operation ::= `hal.buffer_view.create` `buffer` `(` $source_buffer `:` type($source_buffer) `)`\n `` `[` $source_offset `,` $source_length `]`\n `shape` `(` `[` $shape `]` `)`\n `type` `(` $element_type `)`\n `encoding` `(` $encoding_type `)`\n `:` type($result)\n attr-dict-with-keyword\n
Creates a reference to a buffer with a particular shape and element type. The buffer is not copied and both the original and view references must be synchronized. This makes it easier to associate commonly-carried metadata along with the contents.
source_length
hal.buffer_view.dim
Buffer view dimension value query
operation ::= `hal.buffer_view.dim` `<` $buffer_view `:` type($buffer_view) `>`\n `` `[` $index `]`\n `:` type($result)\n attr-dict-with-keyword\n
Returns the value of the given dimension.
index
hal.buffer_view.element_type
Buffer view element type query
operation ::= `hal.buffer_view.element_type` `<` $buffer_view `:` type($buffer_view) `>`\n `:` type($result)\n attr-dict-with-keyword\n
Returns the element type of the buffer view.
hal.buffer_view.encoding_type
Buffer view encoding type query
operation ::= `hal.buffer_view.encoding_type` `<` $buffer_view `:` type($buffer_view) `>`\n `:` type($result)\n attr-dict-with-keyword\n
Returns the encoding type of the buffer view.
hal.buffer_view.rank
Buffer view rank query
operation ::= `hal.buffer_view.rank` `<` $buffer_view `:` type($buffer_view) `>`\n `:` type($result)\n attr-dict-with-keyword\n
Returns the rank of the buffer view.
hal.buffer_view.trace
Trace value(s) operation
operation ::= `hal.buffer_view.trace` $key `=`\n $operands `:` type($operands)\n attr-dict-with-keyword\n
Traces out to a runtime trace sink (console, log file, etc) the given buffer views and titles them with the given key. The key is informational only and useful for titling/marking specific sets of buffers for easier searching.
hal.element_type
An iree_hal_element_type_t for the given MLIR type
operation ::= `hal.element_type` `<` $type `>`\n attr-dict\n `:` type($result)\n
Maps an MLIR type to a runtime iree_hal_element_type_t value for all types that are convertable.
iree_hal_element_type_t
type
hal.encoding_type
An iree_hal_encoding_type_t for the given MLIR encoding
operation ::= `hal.encoding_type` `<` ($encoding^):( `` `dense_row_major`)? `>`\n attr-dict\n `:` type($result)\n
Maps an MLIR encoding to a runtime iree_hal_encoding_type_t value for all encodings that are convertable.
iree_hal_encoding_type_t
encoding
Ops for !hal.channel / iree_hal_channel_t.
!hal.channel
iree_hal_channel_t
hal.channel.create
Creates a new channel for collective communication
operation ::= `hal.channel.create` `device` `(` $device `:` type($device) `)`\n `affinity` `(` $queue_affinity `)`\n `flags` `(` $flags `)`\n `id` `(` $id `)`\n `group` `(` $group `)`\n `rank` `(` $rank `)`\n `count` `(` $count `)`\n `:` type($result)\n attr-dict-with-keyword\n
Returns a new channel with the given rank associated with the given device queue. Collective operations using this channel must only be submitted on compatible queues.
The group and ID are optional and may be null. A rank or count of -1 can be used to indicate a default inherited from the environment or device configuration.
Interfaces: OpAsmOpInterface
flags
id
count
hal.channel.rank_and_count
operation ::= `hal.channel.rank_and_count` `<` $channel `:` type($channel) `>`\n `:` type($rank) `,` type($count)\n attr-dict-with-keyword\n
Returns the rank the channel represents as a participant in a collective group in [0, count) and the total participant count.
hal.channel.split
operation ::= `hal.channel.split` `<` $channel `:` type($channel) `>`\n `color` `(` $color `)`\n `key` `(` $key `)`\n `flags` `(` $flags `)`\n `:` type($result)\n attr-dict-with-keyword\n
Partitions the group associated with the given channel into disjoint subgroups for each unique value of color. Each new subgroup contains all participants of the same color and within each subgroup the key argument is used to define the rank order. When multiple participants in a group use the same key the tie will be broken using their rank in the parent group. A color of -1 indicates that the rank does not participate in any subgroup and will return a null channel.
Ops for !hal.command_buffer / iree_hal_command_buffer_t.
!hal.command_buffer
hal.command_buffer.begin_debug_group
Pushes a command buffer debug group label
operation ::= `hal.command_buffer.begin_debug_group` `<` $command_buffer `:` type($command_buffer) `>`\n `label` `(` $label `)`\n attr-dict-with-keyword\n
Pushes a new debug group with the given label. All commands between this and a mandatory matching call to hal.command_buffer.end_debug_group will be grouped together with the given label.
hal.command_buffer.end_debug_group
label
command_buffer
hal.command_buffer.collective
Command buffer collective dispatch recording operation
operation ::= `hal.command_buffer.collective` `<` $command_buffer `:` type($command_buffer) `>`\n `channel` `(` $channel `:` type($channel) `)`\n `op` `(` $op `)`\n (`param` `(` $param^ `:` type($param) `)`)?\n (`send` `(` $send_buffer^ `:` type($send_buffer) `)`\n `` `[` $send_offset `,` $send_length `]`)?\n (`recv` `(` $recv_buffer^ `:` type($recv_buffer) `)`\n `` `[` $recv_offset `,` $recv_length `]`)?\n `count` `(` $element_count `)`\n attr-dict-with-keyword\n
Dispatches a collective operation defined by op using the given buffers.
op
element_count
param
send_buffer
send_offset
send_length
recv_buffer
recv_offset
recv_length
hal.command_buffer.copy_buffer
Command buffer buffer copy recording operation
operation ::= `hal.command_buffer.copy_buffer` `<` $command_buffer `:` type($command_buffer) `>`\n `source` `(` $source_buffer `:` type($source_buffer) `)`\n `` `[` $source_offset `]`\n `target` `(` $target_buffer `:` type($target_buffer) `)`\n `` `[` $target_offset `]`\n `length` `(` $length `)`\n attr-dict-with-keyword\n
Copies a range of one buffer to another.
hal.command_buffer.create
Command buffer allocation operation
operation ::= `hal.command_buffer.create` `device` `(` $device `:` type($device) `)`\n `mode` `(` $modes `)`\n `categories` `(` $command_categories `)`\n (`bindings` `(` $binding_capacity^ `)`)?\n `:` type($result)\n attr-dict-with-keyword\n
Returns a command buffer from the device pool ready to begin recording.
modes
command_categories
binding_capacity
hal.command_buffer.device
Command buffer device query operation
operation ::= `hal.command_buffer.device` `<` $command_buffer `:` type($command_buffer) `>`\n `:` type($device)\n attr-dict-with-keyword\n
Used during conversion to access the device used to create a command buffer.
hal.command_buffer.dispatch.indirect
Command buffer indirect dispatch recording operation
operation ::= `hal.command_buffer.dispatch.indirect` `<` $command_buffer `:` type($command_buffer) `>`\n `target` `(` $executable `:` type($executable) `)`\n `` `[` $entry_point `]`\n `workgroups` `(` $workgroups_buffer `:` type($workgroups_buffer) `)`\n `` `[` $workgroups_offset `]`\n attr-dict-with-keyword\n
Dispatches an execution request with the dispatch parameters loaded from the given buffer.
entry_point
executable
workgroups_buffer
workgroups_offset
hal.command_buffer.dispatch.indirect.symbol
Command buffer indirect dispatch recording operation, using symbolref
operation ::= `hal.command_buffer.dispatch.indirect.symbol` `<` $command_buffer `:` type($command_buffer) `>`\n `target` `(` $entry_point `)`\n `workgroups` `(` $workgroups_buffer `:` type($workgroups_buffer) `)`\n `` `[` $workgroups_offset `]`\n attr-dict-with-keyword\n
Dispatches an execution request with the dispatch parameters loaded from the given buffer, using using a nested symbol reference to the entry point.
hal.command_buffer.dispatch.indirect.symbol %cmd, @executable::@target::@entry,\n workgroups = %buffer[%offset]\n
hal.command_buffer.dispatch
Command buffer dispatch recording operation
operation ::= `hal.command_buffer.dispatch` `<` $command_buffer `:` type($command_buffer) `>`\n `target` `(` $executable `:` type($executable) `)`\n `` `[` $entry_point `]`\n `workgroups` `(` `[`\n $workgroup_x `,`\n $workgroup_y `,`\n $workgroup_z\n `]` `)`\n attr-dict-with-keyword\n
Dispatches an execution request.
workgroup_x
workgroup_y
workgroup_z
hal.command_buffer.dispatch.symbol
Command buffer dispatch recording operation, using symbolref
operation ::= `hal.command_buffer.dispatch.symbol` `<` $command_buffer `:` type($command_buffer) `>`\n `target` `(` $entry_point `)`\n `workgroups` `(` `[`\n $workgroup_x `,`\n $workgroup_y `,`\n $workgroup_z\n `]` `)`\n attr-dict-with-keyword\n
Dispatches an execution request, using a nested symbol reference to the entry point.
Pops a command buffer debug group label
operation ::= `hal.command_buffer.end_debug_group` `<` $command_buffer `:` type($command_buffer) `>`\n attr-dict-with-keyword\n
Pops a debug group from the stack.
hal.command_buffer.execution_barrier
Command buffer execution barrier recording operation
operation ::= `hal.command_buffer.execution_barrier` `<` $command_buffer `:` type($command_buffer) `>`\n `source` `(` $source_stage_mask `)`\n `target` `(` $target_stage_mask `)`\n `flags` `(` $flags `)`\n attr-dict-with-keyword\n
Defines an execution dependency between all commands recorded before the barrier and all commands recorded after the barrier. Only the stages provided will be affected.
source_stage_mask
target_stage_mask
hal.command_buffer.fill_buffer
Command buffer buffer fill recording operation
operation ::= `hal.command_buffer.fill_buffer` `<` $command_buffer `:` type($command_buffer) `>`\n `target` `(` $target_buffer `:` type($target_buffer) `)`\n `` `[` $target_offset `,` $length `]`\n `pattern` `(` $pattern `:` type($pattern) `)`\n attr-dict-with-keyword\n
Fills the target buffer with the given repeating value.
pattern
hal.command_buffer.finalize
Finalizes command buffer recording
operation ::= `hal.command_buffer.finalize` `<` $command_buffer `:` type($command_buffer) `>`\n attr-dict-with-keyword\n
Ends recording into the command buffer and prepares it for submission. No more commands may be recorded into the command buffer.
hal.command_buffer.push_constants
Command buffer push constants operation
operation ::= `hal.command_buffer.push_constants` `<` $command_buffer `:` type($command_buffer) `>`\n `layout` `(` $pipeline_layout `:` type($pipeline_layout) `)`\n `offset` `(` $offset `)`\n `values` `(` `[` $values `]` `)`\n `:` type($values)\n attr-dict-with-keyword\n
Pushes an inline set of constants that can be accessed by subsequent dispatches using a compatible pipeline layout.
Push constants are always 4-byte values and treated as opaque, meaning that they may be bit-casted floats, bit-packed booleans, etc.
pipeline_layout
hal.command_buffer.push_descriptor_set
Command buffer descriptor set push binding operation
operation ::= `hal.command_buffer.push_descriptor_set` `<` $command_buffer `:` type($command_buffer) `>`\n `layout` `(` $pipeline_layout `:` type($pipeline_layout) `)`\n `` `[` $set `]`\n `bindings` `(` `[`\n custom<DescriptorSetBindings>($binding_ordinals,\n $binding_buffers,\n type($binding_buffers),\n $binding_offsets,\n $binding_lengths)\n `]` `)`\n attr-dict-with-keyword\n
Pushes an inline-defined descriptor set to the command buffer. The provided buffers may either be HAL buffers or indirect references into the command buffer binding table.
Traits: SameVariadicOperandSize
SameVariadicOperandSize
binding_ordinals
binding_buffers
binding_offsets
binding_lengths
Ops for !hal.descriptor_set_layout / iree_hal_descriptor_set_layout_t.
!hal.descriptor_set_layout
hal.descriptor_set_layout.create
Creates a descriptor set layout
operation ::= `hal.descriptor_set_layout.create` `device` `(` $device `:` type($device) `)`\n `flags` `(` $flags `)`\n `bindings` `(` $bindings `)`\n `:` type($result)\n attr-dict-with-keyword\n
Creates a descriptor set layout that defines the bindings used within a set. The same descriptor set layout may be shared with many different executable layouts and by doing so some runtime binding overhead when switching between executables that use the same set layouts can be reduced.
bindings
Device availability and selection support.
hal.devices.count
Returns the number of available devices
operation ::= `hal.devices.count` attr-dict `:` type($result)\n
Returns the total number of available devices registered at runtime.
hal.devices.get
Returns the device with the given index
operation ::= `hal.devices.get` $index attr-dict `:` type($result)\n
Returns the device with the given index in the [0, hal.devices.count) range. Devices may be lazily initialized upon first use.
Ops for !hal.device / iree_hal_device_t.
!hal.device
hal.device.allocator
Device allocator accessor operation
operation ::= `hal.device.allocator` `<` $device `:` type($device) `>` `:` type($result) attr-dict-with-keyword\n
Returns the allocator that can be used to allocate buffers compatible with the device.
hal.device.query
Returns a runtime configuration parameter from the device
operation ::= `hal.device.query` `<` $device `:` type($device) `>`\n `key` `(` $category `:` `` `:` $key `)`\n `:` type($ok) `,` type($value)\n (`=` $default_value^)?\n attr-dict-with-keyword\n
Queries a device configuration parameter with the given key. Returns a status indicating whether the pair was recognized/available and if it was the value converted to the specified type. Queries must return the same value for the lifetime of the module though may vary from run to run.
This is roughly equivalent to the sysconf linux syscall (https://man7.org/linux/man-pages/man3/sysconf.3.html) in that the exact set of keys available and their interpretation is target-dependent.
sysconf
Users of the op must check the ok result before using the value as what set of keys is available may change over time. If in doubt: don't use this. Each key used adds additional versioning and testing complexity as runtime code path changes will explode combinatorially and should be treated with as much care as a binary file format change. Keys should be prefixed with ex. when experimental indicating that they are not expected to be present forever; all non-experimental keys should be vetted.
ok
ex.
Well-known keys:
hal.device.id :: {some id pattern} Returns 1 if the device identifier matches the given pattern string.
hal.executable.format :: {some format pattern} Returns 1 if the given format is supported by the device loader.
hal.device :: concurrency The maximum concurrently executable submissions, mapping roughly to the queue count. The actual concurrency available may be less than this based on dynamic runtime parameters such as power/thermal modes, quota limits, or user choice.
hal.dispatch :: concurrency The maximum concurrently executable workgroups for a particular dispatch. The actual concurrency available may be less depending on device state.
category
default_value
hal.device.queue.alloca
Allocates a queue-ordered transient buffer
operation ::= `hal.device.queue.alloca` `<` $device `:` type($device) `>`\n `affinity` `(` $queue_affinity `)`\n `wait` `(` $wait_fence `)`\n `signal` `(` $signal_fence `)`\n `pool` `(` $pool `)`\n `type` `(` $memory_types `)`\n `usage` `(` $buffer_usage `)`\n `:` custom<SizeAwareType>(type($result), $result_size)\n attr-dict-with-keyword\n
Returns a queue-ordered transient buffer that will be available for use when the signal fence is reached. The allocation will not be made until the wait fence has been reached.
The size of the buffer returned may be larger than the requested size if the allocator has specific alignment requirements or minimum allocation sizes.
The buffer handle will remain live so long as there are retainers but the contents are undefined before the allocation signal fence has been signaled and after the deallocation wait fence has been reached.
wait_fence
signal_fence
pool
hal.device.queue.dealloca
Deallocates a queue-ordered transient buffer
operation ::= `hal.device.queue.dealloca` `<` $device `:` type($device) `>`\n `affinity` `(` $queue_affinity `)`\n `wait` `(` $wait_fence `)`\n `signal` `(` $signal_fence `)`\n `buffer` `(` $buffer `:` type($buffer) `)`\n attr-dict-with-keyword\n
Deallocates a queue-ordered transient buffer. The deallocation will not be made until the wait fence has been reached and once the storage is available for reuse the signal fence will be signaled.
After deallocation the contents of the buffer may still be accessible but will have undefined contents as other operations reuse the memory.
hal.device.queue.execute
Enqueues command buffer execution
operation ::= `hal.device.queue.execute` `<` $device `:` type($device) `>`\n `affinity` `(` $queue_affinity `)`\n `wait` `(` $wait_fence `)`\n `signal` `(` $signal_fence `)`\n (`commands` `(` `[` $command_buffers^ `]` `)`)?\n attr-dict-with-keyword\n
Executes one or more command buffers on a device queue. The command buffers are executed in order as if they were recorded as one. No commands will execute until the wait fence has been reached and the signal fence will be signaled when all commands have completed.
command_buffers
hal.device.queue.flush
Flushes locally-pending submissions to the queue
operation ::= `hal.device.queue.flush` `<` $device `:` type($device) `>`\n `affinity` `(` $queue_affinity `)`\n attr-dict-with-keyword\n
Flushes any locally-pending submissions in the queue. When submitting many queue operations this can be used to eagerly flush earlier submissions while later ones are still being constructed. This may be a no-op.
hal.device.queue.read
Reads a segment from a file into a device buffer
operation ::= `hal.device.queue.read` `<` $device `:` type($device) `>`\n `affinity` `(` $queue_affinity `)`\n `wait` `(` $wait_fence `)`\n `signal` `(` $signal_fence `)`\n `source` `(` $source_file `:` type($source_file) `)`\n `` `[` $source_offset `]`\n `target` `(` $target_buffer `:` type($target_buffer) `)`\n `` `[` $target_offset `]`\n `length` `(` $length `)`\n `flags` `(` $flags `)`\n attr-dict-with-keyword\n
Enqueues a file read operation that streams a segment of the source file defined by the source offset and length into the target HAL buffer at the specified target offset. The queue affinity should be set to where the target buffer will be consumed. The source file must have read permission and the target buffer must have transfer-target usage. Read failure will result in propagated semaphore failure or device loss.
source_file
hal.device.queue.write
Writes a segment from a device buffer into a file
operation ::= `hal.device.queue.write` `<` $device `:` type($device) `>`\n `affinity` `(` $queue_affinity `)`\n `wait` `(` $wait_fence `)`\n `signal` `(` $signal_fence `)`\n `source` `(` $source_buffer `:` type($source_buffer) `)`\n `` `[` $source_offset `]`\n `target` `(` $target_file `:` type($target_file) `)`\n `` `[` $target_offset `]`\n `length` `(` $length `)`\n `flags` `(` $flags `)`\n attr-dict-with-keyword\n
Enqueues a file write operation that streams a segment of the source HAL buffer defined by the source offset and length into the target file at the specified target offset. The queue affinity should be set to where the source buffer was produced. The source buffer must have transfer-source usage and the target file must have write permission. Write failure will result in propagated semaphore failure or device loss.
target_file
hal.return
Return from a hal.* region
operation ::= `hal.return` ($operands^ `:` type($operands))? attr-dict\n
Traits: Terminator
Ops for !hal.executable / iree_hal_executable_t.
!hal.executable
hal.executable.binary
Compiled executable binary data
operation ::= `hal.executable.binary` custom<SymbolVisibility>($sym_visibility)\n $sym_name\n attr-dict-with-keyword\n
A compiled executable binary with an optional nested module containing the IR prior to serialization (for debugging).
Traits: HasParent<IREE::HAL::ExecutableOp>
HasParent<IREE::HAL::ExecutableOp>
format
data
mime_type
hal.executable.calculate_workgroups
Calculates workgroup count from workload for an exported function
operation ::= `hal.executable.calculate_workgroups` `device` `(` $device `:` type($device) `)`\n `target` `(` $entry_point `)`\n (`workload` `(` `[` $workload^ `]` `)`)?\n `:` type($workgroup_x) `,` type($workgroup_y) `,` type($workgroup_z)\n attr-dict-with-keyword\n
Calculates the workgroup count (grid XYZ) based on the given workload using the workgroup count calculation region of the target hal.executable.export op.
hal.executable.export
hal.executable.condition
Host code to determine if the executable is enabled
Variants are selected based on their target and this optional condition op that returns true if the variant is valid for use on the provided runtime !hal.device. If no variants within an executable are valid then loading will fail at runtime. If multiple variants are valid the first valid one found will be loaded and used for execution.
hal.executable.constant.block
Executable constant block initializer
Initializes one or more constants in the executable constant block by returning one value per identified constant. Each constant block is evaluated on the host prior to instantiating the executable for a given device and allows for the executable to be specialized based on device capabilities and limits.
The keys specified are unique per variant and will be deduplicated across multiple constant blocks when present. They are only used during lowering and will not survive to runtime so they need only have descriptive enough names to avoid collisions and represent the semantics of the value.
Constant values can be loaded in the device code with the hal.executable.constant.load op:
hal.executable.constant.load
hal.executable.variant public @target {\n hal.executable.constant.block(%device: !hal.device) -> (i32, i32) as (\"foo\", \"bar\") {\n %0 = hal.device.query<%device> key(\"some.device.prop\")...\n %1 = hal.device.query<%device> key(\"another.device.prop\")...\n hal.return %0, %1 : i32, i32\n }\n builtin.module {\n func @dispatch0() {\n %0 = hal.executable.constant.load \"foo\" : i32\n %1 = hal.executable.constant.load \"bar\" : i32\n return\n }\n }\n}\n
Each target backend will implement the constant initialization and access in a way compatible with its execution model. Examples: - CPU: read-only buffer initialized on load and passed to each dispatch - CUDA: read-only buffer initialized on load and passed to each dispatch - SPIR-V: specialization constants - Metal: function constants - WebGPU: pipeline-overridable constants
Traits: HasParent<IREE::HAL::ExecutableSourceOp, IREE::HAL::ExecutableVariantOp>, IsolatedFromAbove
HasParent<IREE::HAL::ExecutableSourceOp, IREE::HAL::ExecutableVariantOp>
keys
Loads a constant value from the executable constant block
operation ::= `hal.executable.constant.load` $key attr-dict `:` type($result)\n
Loads a scalar constant value from the static executable constant block. The value provided by a constant block with the given key will be loaded and bitcast (possibly with truncation or zero-extension) to the result type.
Note that backends are allowed to implement their own mechanisms for referencing constant block values and this is provided only as a default for those not needing special behavior.
hal.executable.create
Creates an executable
operation ::= `hal.executable.create` `device` `(` $device `:` type($device) `)`\n `target` `(` $executable_target `)`\n `layouts` `(` `[` $layouts `]` `)`\n (`constants` `(` `[` $constants^ `]` `)`)?\n `:` type($result)\n attr-dict-with-keyword\n
Creates a target-dependent executable cached on the provided device. Entry points contained within the executable can be dispatched using the resulting executable handle.
Depending on the driver creation may take a non-trivial amount of time (such as when JITing/etc). As the cache is internally synchronized callers can issue preparation requests from multiple threads - even for the same executables - and calls will block until preparation completes.
Optional constants provide for specialization of the executable based on runtime-derived parameters.
executable_target
layouts
constants
hal.executable_end
operation ::= `hal.executable_end` attr-dict\n
Traits: HasParent<IREE::HAL::ExecutableOp>, Terminator
Executable entry point declaration
An entry point exported by the executable with statically-available information describing the IO interface it uses and other dispatch metadata.
The workgroup_count region represents the computation that returns the number of workgroups to use in the 3D grid dispatch. The arguments to the region represents the workload as captured by each dispatch. It returns the number of workgroups along x, y, and z.
workgroup_count
layout
workgroup_size
subgroup_size
workgroup_local_memory
hal.executable.lookup
Executable cache lookup pseudo-op
operation ::= `hal.executable.lookup` `device` `(` $device `:` type($device) `)`\n `executable` `(` $executable `)`\n `:` type($result)\n attr-dict-with-keyword\n
Used during conversion to provide a placeholder for a globally cached and possibly lazy-initialized executable.
Target-specific executable module
operation ::= `hal.executable` custom<SymbolVisibility>($sym_visibility)\n $sym_name\n attr-dict-with-keyword\n regions\n
An executable module representing a target-specific compiled kernel/shader/etc.
Traits: IsolatedFromAbove, SingleBlockImplicitTerminator<IREE::HAL::ExecutableEndOp>, SingleBlock, SymbolTable, Util_ObjectLike
SingleBlockImplicitTerminator<IREE::HAL::ExecutableEndOp>
hal.executable.source_end
Terminator pseudo-op for the executable source op
operation ::= `hal.executable.source_end` attr-dict\n
Traits: HasParent<IREE::HAL::ExecutableSourceOp>, Terminator
HasParent<IREE::HAL::ExecutableSourceOp>
hal.executable.source
Generic source contents of an executable op
operation ::= `hal.executable.source` custom<SymbolVisibility>($sym_visibility)\n $sym_name\n attr-dict-with-keyword\n ``\n $body\n
This is an unspecialized source representation of an executable module without an assigned target. This is useful for hand-authoring executables prior to device specification.
Traits: IsolatedFromAbove, SingleBlockImplicitTerminator<IREE::HAL::ExecutableSourceEndOp>, SingleBlock, SymbolTable
SingleBlockImplicitTerminator<IREE::HAL::ExecutableSourceEndOp>
objects
hal.executable.variant_end
Terminator pseudo-op for the executable variant op
operation ::= `hal.executable.variant_end` attr-dict\n
Traits: HasParent<IREE::HAL::ExecutableVariantOp>, Terminator
HasParent<IREE::HAL::ExecutableVariantOp>
hal.executable.variant
Target-specific variant of an executable op
operation ::= `hal.executable.variant` custom<SymbolVisibility>($sym_visibility)\n $sym_name\n `target` `(` $target `)`\n (`objects` `(` $objects^ `)` )?\n attr-dict-with-keyword\n $body\n
The target IR for the executable. This can be preserved for debugging but is usually removed during transformation.
Variants are selected based on their target and an optional condition op that returns true if the variant is valid for use on the provided runtime !hal.device. If no variants within an executable are valid then loading will fail at runtime. If multiple variants are valid the first valid one found will be loaded and used for execution.
Traits: HasParent<IREE::HAL::ExecutableOp>, IsolatedFromAbove, SingleBlockImplicitTerminator<IREE::HAL::ExecutableVariantEndOp>, SingleBlock, SymbolTable
SingleBlockImplicitTerminator<IREE::HAL::ExecutableVariantEndOp>
Temporary hack ops expected to be removed in the future.
hal.ex.file.from_memory
Creates a file mapped into a byte range of a host buffer
operation ::= `hal.ex.file.from_memory` `device` `(` $device `:` type($device) `)`\n `affinity` `(` $queue_affinity `)`\n `access` `(` $access `)`\n `buffer` `(` $buffer `:` type($buffer) `)`\n `` `[` $offset `for` $length `]`\n `flags` `(` $flags `)`\n `:` type($result)\n attr-dict-with-keyword\n
Returns a file handle that is backed by the given buffer contents. Behavior is undefined if the buffer contents change while the accesses are in-flight.
Experimental as the exact interface for getting files from module contents still needs iteration. Most hardware APIs require a file descriptor or native platform handle but here we only have host pointers. When memory-mapped some systems allow for retrieval of the platform handle from a virtual address (GetMappedFileNameA/posix_mem_offset) but the APIs are sketchy and likely slow. Instead we should probably have a way to query for a file handle derived from the calling module by stack-walking and asking the VM module for its handle. Until we can figure this out this method will be marked epxerimental.
access
Ops for !hal.fence / iree_hal_fence_t.
!hal.fence
iree_hal_fence_t
hal.fence.await
Asynchronous fence wait operation
operation ::= `hal.fence.await` `until` `(` `[` $fences `]` `)`\n `timeout_millis` `(` $timeout_millis `)`\n `:` type($status)\n attr-dict-with-keyword\n
Yields the caller until all fences is reached. Returns the status of the fence after the wait, with a non-zero value indicating failure.
status
Traits: Util_YieldPoint
Util_YieldPoint
timeout_millis
fences
hal.fence.create
Creates an unsignaled fence
operation ::= `hal.fence.create` `device` `(` $device `:` type($device) `)`\n `flags` `(` $flags `)`\n `:` type($result)\n attr-dict-with-keyword\n
Returns a fence that defines a point in time. By default fences will remain unsignaled unless they are explicitly signaled with hal.fence.signal or asynchronously signaled by the device by passing them as an operand to queue submission ops.
hal.fence.signal
Interfaces: MemoryEffectOpInterface (MemoryEffectOpInterface), OpAsmOpInterface
hal.fence.fail
Fence failure operation
operation ::= `hal.fence.fail` `<` $fence `:` type($fence) `>`\n `status` `(` $status `)`\n attr-dict-with-keyword\n
Signals the fence with a failure. The status will be returned from each timepoint semaphores hal.semaphore.query and hal.semaphore.signal for the lifetime of each semaphore.
hal.semaphore.query
hal.semaphore.signal
fence
hal.fence.join
Creates a fence from the given timepoints
operation ::= `hal.fence.join` `at` `(` `[` $fences `]` `)`\n `->` type($result)\n attr-dict-with-keyword\n
Returns a fence that joins the input fences as a wait-all operation.
hal.fence.query
Fence query operation
operation ::= `hal.fence.query` `<` $fence `:` type($fence) `>`\n `:` type($status)\n attr-dict-with-keyword\n
Queries whether the fence has been reached and its status. Returns OK if the fence has been signaled successfully, DEFERRED if it is unsignaled, and otherwise an error indicating the failure.
Fence signal operation
operation ::= `hal.fence.signal` `<` $fence `:` type($fence) `>`\n attr-dict-with-keyword\n
Signals the fence to indicate that the timepoints contained have been reached. Waiting work may begin immediately.
Ops for !hal.instrument.*.
!hal.instrument.*
hal.instrument.memory.load
Emits a memory load instrumentation event
operation ::= `hal.instrument.memory.load` `` `[` $buffer `:` type($buffer) `for` $workgroupKey `]`\n $base `[` $indices `]` `,` $loadValue\n attr-dict `:` type($base) `,` type($result)\n
Emits a workgroup-specific memory load event indicating that a number of bytes from the given resolved pointer have been loaded by the workgroup.
workgroupKey
loadValue
base
hal.instrument.memory.store
Emits a memory store instrumentation event
operation ::= `hal.instrument.memory.store` `` `[` $buffer `:` type($buffer) `for` $workgroupKey `]`\n $base `[` $indices `]` `,` $storeValue\n attr-dict `:` type($base) `,` type($result)\n
Emits a workgroup-specific memory store event indicating that a number of bytes have been stored to the given resolved pointer by the workgroup.
storeValue
hal.instrument.print
Emits a human-readable printf-style string event
operation ::= `hal.instrument.print` `` `[` $buffer `:` type($buffer) `for` $workgroupKey `]`\n $format (`*` `(` $values^ `:` type($values) `)`)?\n attr-dict\n
Formats a string using a limited subset of printf format specifiers and the provided values and then emits an iree_instrument_dispatch_print_t event. Final formatted string lengths may be limited to as much as 1024 characters and should be kept as small as possible to avoid easily exceeding the instrumentation storage buffers with redundant strings.
iree_instrument_dispatch_print_t
hal.instrument.value
Emits a scalar value instrumentation event
operation ::= `hal.instrument.value` `` `[` $buffer `:` type($buffer) `for` $workgroupKey `]`\n $ordinal `=` $operand attr-dict `:` type($operand)\n
Emits a workgroup-specific typed value with the given workgroup-relative ordinal.
This op will be preserved even if the output is not used as it is only for debugging purposes.
hal.instrument.workgroup
Emits a dispatch workgroup instrumentation event
operation ::= `hal.instrument.workgroup` `` `[` $buffer `:` type($buffer) `]`\n `dispatch` `(` $dispatchId `)`\n attr-dict `:` type($workgroupKey)\n
Emits an iree_instrument_dispatch_workgroup_t event into the instrumentation stream. The workgroup event identifies the unique dispatch, its workgroup count, and the ID of the emitting workgroup within the dispatch. Optionally targets that support querying the processor ID executing the workgroup can attach that information for tracking purposes.
iree_instrument_dispatch_workgroup_t
On targets such as CPUs where entire workgroups execute as atomic units only one workgroup event should be emitted. On targets such as GPUs where there may be multiple invocations executing as part of a single workgroup only the first invocation within the workgroup should emit the workgroup event (by checking if the LocalInvocationIndex or threadIdx == 0, etc).
The resulting workgroup key is used by subsequent workgroup-specific instrumentation events.
dispatchId
Ops for !hal.interface.*.
!hal.interface.*
hal.interface.binding.subspan
Returns an alias to a subspan of interface binding data
operation ::= `hal.interface.binding.subspan` `set` `(` $set `)`\n `binding` `(` $binding `)`\n `type` `(` custom<DescriptorType>($descriptor_type) `)`\n (`alignment` `(` $alignment^ `)`)?\n (`offset` `(` $byte_offset^ `)`)?\n (`flags` `(` $descriptor_flags^ `)`)?\n attr-dict `:` type($result) (`{` $dynamic_dims^ `}`)?\n
Returns a subspan of an interface binding storage buffer in a generic type. The exact shape, type, and alignment of the returned type are defined by the result type (tensor, memref, etc).
An optional alignment indicates the byte alignment of the base binding resource. Note that the byte offset is added to the base and the alignment will be the minimum of the two.
binding
descriptor_type
alignment
descriptor_flags
byte_offset
hal.interface.constant.load
Loads a constant value from the interface constant block
operation ::= `hal.interface.constant.load` `` `[` $index `]`\n (`alignment` `(` $alignment^ `)`)?\n (`values` `(` $values^ `)`)?\n attr-dict `:` type($result)\n
Loads a scalar constant value from an executable IO push constant block. The value will be loaded from the given constant offset and will be bitcast (possibly with truncation or zero-extension) to the result type.
An optional alignment indicates the byte alignment of potential values for the constant when it could be determined from analysis. If omitted the value may be anything and its interpretation is up to the usage. This is intended to provide pointer alignment-like semantics to constants that are used to index into binding resources.
An optional set of values indicates all possible values that can be passed to the constant from all dispatch sites in the program. If omitted the value may be from an unanalyzable source (outside of the program, indirect, etc) and must be assumed to have any value.
hal.interface.workgroup.count
operation ::= `hal.interface.workgroup.count` `[` $dimension `]` attr-dict `:` type($result)\n
The total number of workgroups along each dimension in the dispatch grid. Matches what was passed to the hal.command_buffer.dispatch command (or what was indirectly specified).
Corresponds to the NumWorkgroups SPIR-V built-in and the gridDim CUDA built-in variable.
%x = hal.interface.workgroup.count[0] : index\n%y = hal.interface.workgroup.count[1] : index\n%z = hal.interface.workgroup.count[2] : index\n
hal.interface.workgroup.id
operation ::= `hal.interface.workgroup.id` `[` $dimension `]` attr-dict `:` type($result)\n
The global workgroup ID of the current tile in the range of [0, hal.interface.workgroup.count) along each XYZ dimension.
[0, hal.interface.workgroup.count)
Corresponds to the WorkgroupId SPIR-V built-in and the blockIdx CUDA built-in variable.
%x = hal.interface.workgroup.id[0] : index\n%y = hal.interface.workgroup.id[1] : index\n%z = hal.interface.workgroup.id[2] : index\n
hal.interface.workgroup.size
operation ::= `hal.interface.workgroup.size` `[` $dimension `]` attr-dict `:` type($result)\n
Corresponds to the WorkgroupSize SPIR-V built-in and the blockDim CUDA built-in variable.
%x = hal.interface.workgroup.size[0] : index\n%y = hal.interface.workgroup.size[1] : index\n%z = hal.interface.workgroup.size[2] : index\n
Ops for !hal.pipeline_layout / iree_hal_pipeline_layout_t.
!hal.pipeline_layout
hal.pipeline_layout.create
Creates an pipeline layout
operation ::= `hal.pipeline_layout.create` `device` `(` $device `:` type($device) `)`\n `push_constants` `(` $push_constants `)`\n `layouts` `(` `[` $set_layouts `]` `)`\n `:` type($result)\n attr-dict-with-keyword\n
Creates an pipeline layout from the given descriptor sets and push constant required size. Pipeline layouts can be shared across any executable that uses the same layout and push constant information. Sharing the layout between executables will reduce runtime binding overhead and it is often worth the cost to allow a small number of unused bindings in one executable such that it can share layouts with others that will be scheduled adjacent to it.
push_constants
set_layouts
hal.pipeline_layout.lookup
Pipeline layout cache lookup pseudo-op
operation ::= `hal.pipeline_layout.lookup` `device` `(` $device `:` type($device) `)`\n `layout` `(` $layout `)`\n `:` type($result)\n attr-dict-with-keyword\n
Used during conversion to provide a placeholder for a globally cached and possibly lazy-initialized pipeline layout.
Pseudo ops for conversion support.
hal.dispatch.extern
operation ::= `hal.dispatch.extern` $export\n (`[` $workload^ `]`)? ``\n `(` $arguments `)` `:`\n custom<ShapedFunctionType>(ref($arguments),\n type($arguments), $argument_dims,\n type($results), $result_dims,\n $tied_operands)\n `count` `` custom<WorkgroupCountRegion>($workgroup_count)\n `layout` `(` $layout `)`\n (`bindings` `(` $bindings^ `)`)?\n `objects` `(` `{` custom<TargetConditionObjects>($targets,\n $target_ordinals,\n $target_objects,\n $target_regions) `}` `)`\n attr-dict-with-keyword\n
Dispatches some number of workgroups across a 3-dimensional grid using a function defined externally in one or more referenced objects. Objects are declared per executable target and selected automatically during linking based on where the dispatch is used. Semantically this is equivalent to a flow.dispatch.workgroups but with the workgroup region invisible to the compiler. See hal.executable for more information about object linkage.
Note that since this happens at tensor level the dispatch operation has value semantics: some tensors (and optionally other primitive types) are consumed and one or more new result tensors are produced. Inside each workgroup, however, the input and output tensors are available for arbitrary loads and stores. In many cases each workgroup will load some particular tile(s) from the input tensors and store some particular tile(s) to the output tensors unique to that workgroup. Though it's possible for multiple workgroups to load the same regions of the input tensors behavior is undefined if multiple workgroups store to the same regions of the output tensors. Codegen guarantees this behavior but when sourcing externally authored dispatch functions it's critical that this behavior is observed.
Though the representation is similar to the GPU-style grid dispatch model here we still have not yet allocated buffers, determined the target device for execution, or even completed fully resolving shapes/types/etc. Because of this it's important that the workgroup body use the platform-dependent primitives for accessing workgroup ID, size, and count intrinsics instead of hardcoding them to a particular set of values. Assume that any workgroup dispatch may end up being specialized for several different target devices and even several different variants for a particular target device (differing workgroup sizes, etc). To aid deduplication code producing these external dispatches should try not to specialize early for particular shapes and instead emit the most generic code possible as having 500 slightly different hal.dispatch.extern ops pointing at the same object file is likely to require 500 copies of the object instead of 500 calls to the same object.
Because at this point in the layering devices have not yet been selected the workgroup count cannot be fully evaluated. Instead workload parameters are captured that are then passed to a function that when later evaluated computes the actual workgroup count based on target information. The workload is not limited to the 3D XYZ grid dispatch of the workgroup count and can contain any number of parameters used to compute it. If workgroup size or distribution varies based on the target device a !hal.device argument can be used by the workgroup count calculation region to factor in device parameters. See hal.device.query for more information on how to query information.
%r = hal.dispatch.extern \"some_function\"[%c5, %c5](%0, %1)\n : (tensor<5x5xf32>, tensor<5xf32>) -> tensor<5x5xf32>\n ...\n
The number of results of the operation is equal to the number of results in the type signature ((tensor<5x5xf32>, tensor<5xf32>) -> tensor<5x5xf32>). Each tensor argument and result in the type signature has a corresponding pipeline layout slot and must be declared. If multiple arguments or results share the same layout slot they can be aliased using the bindings attribute and otherwise each is assumed unique.
Objects for multiple targets can be specified and the ones used are selected based on their target and an optional condition region that returns true if the variant is valid for use on the provided runtime !hal.device. If no variants within an executable are valid then loading will fail at runtime. If multiple variants are valid the first valid one found will be loaded and used for execution.
Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TiedOpInterface, Util_ShapeAwareOp
export
targets
target_ordinals
target_objects
hal.tensor.barrier
Signals a fence when all tensors are available
operation ::= `hal.tensor.barrier` `join` `` `(` $sources `:` type($sources) `)`\n `=` `` `>`\n $signal_fence `:` type($signal_fence)\n attr-dict-with-keyword\n
Defines a barrier that is used to indicate availability of an entire set of tensors by signaling a fence. The source tensors are returned for chaining.
Interfaces: TiedOpInterface
sources
hal.tensor.export
Exports a tensor to a HAL buffer view
operation ::= `hal.tensor.export` $source\n ($name^)?\n (`into` `(` $target_storage^ `:` type($target_storage) `)`)?\n `:`\n custom<TypeAlias>($source_encoding, type($source)) (`{` $source_dims^ `}`)?\n `->`\n type($target)\n attr-dict\n
Defines an export of an SSA-form tensor to an external HAL buffer view.
The provided source_encoding, if different from the source type, indicates that the ABI-facing type may differ from the internal representation. The types must be bitcastable (same storage size) and dynamically shaped values must have the same number of dynamic dimensions. This allows for casting between rank-0 and rank-N types, different element types, etc.
source_encoding
An optional target_storage buffer can be provided to hold the exported result. The export will fail at runtime if the storage is null or if it has insufficient capacity to store the output. The storage must be device-visible and defined for transfer-target and dispatch usage.
target_storage
name
hal.tensor.import
Imports a tensor from a HAL buffer view
operation ::= `hal.tensor.import` (`wait` `(` $wait_fence^ `)` `=` `` `>`)?\n $source\n ($name^)?\n `:` type($source) `->`\n custom<TypeAlias>($target_encoding, type($target)) (`{` $target_dims^ `}`)?\n attr-dict\n
Defines an import of an external HAL buffer view into a SSA-form tensor. An optional semaphore timepoint can be specified indicating when the buffer view is available for use. If no semaphore timepoint is provided it is assumed the buffer view is immediately available.
The provided target_encoding, if different from the target type, indicates that the ABI-facing type may differ from the internal representation. The types must be bitcastable (same storage size) and dynamically shaped values must have the same number of dynamic dimensions. This allows for casting between rank-0 and rank-N types, different element types, etc.
target_encoding
specifies a set of allowed queues for an operation
WIP; see #10765. This may change in the future to either be a nested attribute on a larger affinity struct or be defined by an implementation of the affinity attr interface. For now this allows higher levels of the stack to specify queues such that the stream dialect can understand them and they can be lowered into the HAL dialect.
Specifies that an annotated operation or scope is only allowed to execute on the set of queues (0-64) provided. Operations will not run on other queues.
Example:
// any queue\n#hal.affinity.queue<*>\n// queues 4 and 5\n#hal.affinity.queue<[4, 5]>\n
int64_t
collective operation and specification
#hal.collective<\n CollectiveKind, # kind\n std::optional<CollectiveReductionOp>, # reduction\n CollectiveElementType # element_type\n>\n
Specifies the collective operation to perform and any mode bits required.
CollectiveKind
std::optional<CollectiveReductionOp>
CollectiveElementType
descriptor set binding specification
#hal.descriptor_set.binding<\n int64_t, # ordinal\n DescriptorType, # type\n std::optional<DescriptorFlags> # flags\n>\n
Specifies a single binding within a descriptor set layout.
DescriptorType
std::optional<DescriptorFlags>
descriptor set layout specification
#hal.descriptor_set.layout<\n int64_t, # ordinal\n ::llvm::ArrayRef<DescriptorSetBindingAttr>, # bindings\n std::optional<DescriptorSetLayoutFlags> # flags\n>\n
Specifies the layout information of a single set of descriptors used within an pipeline layout. Multiple of these sets may be used by a single entry point to allow for bindings with similar update frequencies to be grouped.
::llvm::ArrayRef<DescriptorSetBindingAttr>
std::optional<DescriptorSetLayoutFlags>
valid DescriptorType
#hal.descriptor_type<\n ::mlir::iree_compiler::IREE::HAL::DescriptorType # value\n>\n
Enum cases: * uniform_buffer (UniformBuffer) * storage_buffer (StorageBuffer)
UniformBuffer
StorageBuffer
::mlir::iree_compiler::IREE::HAL::DescriptorType
generic device target specification
Specifies the properties of a target runtime device. Target devices are specified with a canonical identifier matching those used by the runtime (such as cpu, vulkan, etc). Target devices may support several target executable formats specified with #hal.executable.target. An optional configuration dictionary allows for overriding backend defaults.
cpu
#hal.executable.target
#hal.device.target<\"llvm-cpu\", {\n executable_targets = [\n #hal.executable.target<\"llvm-cpu\", \"embedded-elf-arm_32\">,\n #hal.executable.target<\"llvm-cpu\", \"embedded-elf-arm_64\">,\n ]\n}>\n
StringAttr
DictionaryAttr
object file reference
Defines an object file that can be linked into executables. Today this is only supported for external file references with paths the compiler can successfully resolve from its current working directory. Inlined data can optionally be provided to avoid the need for file system access and ensure the data source is attached to the IR as it makes its way through multiple compiler stages or reproducers.
Future revisions may change this to an interface that allows both internal and external resources to define the object contents. Linking needs to be updated to support various object compositions and certain backends may require additional infrastructure support.
In the long term the goal is to allow combinations of declared objects and generated code in order to give control of linking behavior to frontends. Instead of needing global command line flags to link in additional blobs the frontend can emit executables with the dependencies already defined per variant without needing to reach into the IREE compiler code.
#hal.executable.object<{path = \"some/file.obj\"}>\n#hal.executable.object<{\n path = \"some/embedded/file.obj\",\n data = dense<[...]> : vector<2048xi8>\n}>\n
DenseIntElementsAttr
target-specific object file references
A dictionary mapping executable target specifications to a list of objects. This is used to allow layers of the stack that support multi-targeting to specify information used during lowering into each particular target.
The key attributes are matched against each target variant based on the backend and format as well as any configuration data provided. When comparing the configuration only fields present in both the key and target variant will be checked and must match. This allows specification of generic sets (\"all x86_64 targets get these objects\") as well as specific ones (\"only x86_64 targets with vector_size = 64 get these objects\").
#hal.executable.objects<{\n #hal.executable.target<\"llvm-cpu\", \"embedded-elf-arm_64\"> = [\n #hal.executable.object<{path = \"some/file_arm_64.obj\"}>\n ],\n #hal.executable.target<\"llvm-cpu\", \"embedded-elf-x86_64\"> = [\n #hal.executable.object<{path = \"some/file_x86_64.obj\"}>\n ]\n}>\n
ArrayAttr
generic executable target specification
Specifies how to compile an executable for a specific target backend. A backend is used to translate and serialize the executable into the final form passed to the runtime. The format of the executable is a target-specific value indicating the required runtime support to load the deployed artifact. An optionally provided configuration dictionary overrides backend-specific defaults.
// Produce a system-native ELF for x86-64 systems using the LLVM backend:\n #hal.executable.target<\"llvm-cpu\", \"system-elf-x86_64\", {\n triple = \"x86_64-unknown-linux-elf\",\n cpu = \"host\",\n cpu_features = \"host\",\n abi = \"lp32\",\n ...\n }>\n
The same compilation backend may be used to translate executables for several different runtime devices. Likewise the same runtime device may use one of many different executable targets. Assume an N:M mapping between the two in all cases.
interface binding specification
#hal.interface.binding<\n int64_t, # set\n int64_t # binding\n>\n
Specifies the descriptor set and binding ordinal of a particular layout binding.
#hal.interface.binding<0, 1>\n
executable entry point layout specification
#hal.pipeline.layout<\n int64_t, # pushConstants\n ::llvm::ArrayRef<DescriptorSetLayoutAttr> # setLayouts\n>\n
Specifies the layout information used for interacting with executable functions. This allows host code to correctly map parameters to the lower-level target-specific argument passing behavior.
::llvm::ArrayRef<DescriptorSetLayoutAttr>
Allocates buffers for a particular device memory space.
A memory buffer with a specific memory_type that is used to describe the capabilities and behavior of the backing memory of the buffer. Buffers may be any mix of host-accessible, host-coherent, or device-accessible for various usages. Depending on these memory types the buffers may be mapped for access on the host as memory though certain restrictions may be imposed.
A shaped and typed buffer reference. This just wraps an existing hal.buffer with its associated metadata to make it easier to pass across ABI boundaries. In most cases buffer views can be elided entirely by the compiler and they'll only be seen when calling external functions.
Channel identifier used to allow for participation in multiple collective groups.
Asynchronous command buffer recording interface. Commands are recorded by the implementation for later submission to command queues.
Descriptor set layout.
Logical device instance.
Events are used for defining synchronization scopes within CommandBuffers. An event only exists within a single CommandBuffer and must not be used across CommandBuffers from the same device or others.
A prepared and ready-to-dispatch executable.
A set of semaphore timepoints defining a common point in time across multiple timelines.
A stateless file handle that can be read/written using queue-ordered transfer operations.
A pipeline layout describing the descriptor sets and push constants used.
IREE inline HAL interop runtime module dialect.
Low-level dialect for limited in-process ABI interop with the full HAL. Only operates synchronously, single-threaded, and on host-local buffers. Use the full HAL for all other cases.
This dialect can be used alongside the full HAL but is intended for use in standalone configurations or paired with the hal_loader dialect which also carries the same usage restrictions.
hal_loader
See hal_inline.imports.mlir for the full list of exported functions.
hal_inline.imports.mlir
hal_inline.buffer.allocate.initialized
Buffer allocation with cloning
operation ::= `hal_inline.buffer.allocate.initialized` `source` `(` $source `:` type($source) `)` `` `[` $offset `,` $length `]`\n `alignment` `(` $minimum_alignment `)`\n `:` custom<SizeAwareType>(type($result), ref($length)) `in` type($storage)\n attr-dict-with-keyword\n
Allocates a buffer with a copy of the provided contents.
minimum_alignment
storage
hal_inline.buffer.allocate
operation ::= `hal_inline.buffer.allocate` `alignment` `(` $minimum_alignment `)`\n `:` custom<SizeAwareType>(type($result), $allocation_size) `in` type($storage)\n attr-dict-with-keyword\n
Allocates a buffer of the given size. The size of the buffer returned may be larger than the requested size if the allocator has specific alignment requirements or minimum allocation sizes.
allocation_size
hal_inline.buffer.length
operation ::= `hal_inline.buffer.length` `<` $buffer `:` type($buffer) `>`\n `:` type($result)\n attr-dict-with-keyword\n
hal_inline.buffer.storage
Buffer backing storage accessor
operation ::= `hal_inline.buffer.storage` `<` $buffer `:` type($buffer) `>`\n `:` type($storage)\n attr-dict-with-keyword\n
Returns the host backing storage of the HAL buffer as a subspan limited to to the buffer's logical range (meaning that byte 0 of the returned buffer is byte 0 of the HAL buffer).
hal_inline.buffer.subspan
operation ::= `hal_inline.buffer.subspan` `<` $source_buffer `:` type($source_buffer) `>`\n `` `[` $source_offset `,` $length `]`\n `:` type($result)\n attr-dict-with-keyword\n
hal_inline.buffer.wrap
Host buffer wrapping operation
operation ::= `hal_inline.buffer.wrap` `source` `(` $source `:` type($source) `)` `` `[` $offset `,` $length `]`\n `:` type($result)\n attr-dict-with-keyword\n
Tries wrapping a !hal.buffer around host memory backed by the given byte buffer.
hal_inline.buffer_view.assert
operation ::= `hal_inline.buffer_view.assert` `<` $buffer_view `:` type($buffer_view) `>`\n `message` `(` $message `)`\n `shape` `(` `[` $shape `]` `)`\n `type` `(` $element_type `)`\n `encoding` `(` $encoding_type `)`\n attr-dict-with-keyword\n
hal_inline.buffer_view.buffer
operation ::= `hal_inline.buffer_view.buffer` `<` $buffer_view `:` type($buffer_view) `>`\n `:` type($result)\n attr-dict-with-keyword\n
hal_inline.buffer_view.create
operation ::= `hal_inline.buffer_view.create` `buffer` `(` $source_buffer `:` type($source_buffer) `)`\n `` `[` $source_offset `,` $source_length `]`\n `shape` `(` `[` $shape `]` `)`\n `type` `(` $element_type `)`\n `encoding` `(` $encoding_type `)`\n `:` type($result)\n attr-dict-with-keyword\n
hal_inline.buffer_view.dim
operation ::= `hal_inline.buffer_view.dim` `<` $buffer_view `:` type($buffer_view) `>`\n `` `[` $index `]`\n `:` type($result)\n attr-dict-with-keyword\n
hal_inline.buffer_view.element_type
operation ::= `hal_inline.buffer_view.element_type` `<` $buffer_view `:` type($buffer_view) `>`\n `:` type($result)\n attr-dict-with-keyword\n
hal_inline.buffer_view.encoding_type
operation ::= `hal_inline.buffer_view.encoding_type` `<` $buffer_view `:` type($buffer_view) `>`\n `:` type($result)\n attr-dict-with-keyword\n
hal_inline.buffer_view.rank
operation ::= `hal_inline.buffer_view.rank` `<` $buffer_view `:` type($buffer_view) `>`\n `:` type($result)\n attr-dict-with-keyword\n
hal_inline.buffer_view.trace
operation ::= `hal_inline.buffer_view.trace` $key `=`\n $operands `:` type($operands)\n attr-dict-with-keyword\n
hal_inline.device.query
operation ::= `hal_inline.device.query` `key` `(` $category `:` `` `:` $key `)`\n `:` type($ok) `,` type($value)\n (`=` $default_value^)?\n attr-dict-with-keyword\n
Well-known keys: (none yet)
IREE HAL inline executable loader runtime module dialect.
Low-level dialect for dynamically loading executables and dispatching work. Only operates synchronously, single-threaded, and on host-local buffers. Use the full HAL for all other cases.
This dialect can be used alongside the full HAL but is intended for use in conjunction with the hal_inline dialect which also carries the same usage restrictions.
hal_inline
See hal_loader.imports.mlir for the full list of exported functions.
hal_loader.imports.mlir
hal_loader.executable.dispatch
Inline executable dispatch operation
operation ::= `hal_loader.executable.dispatch` `executable` `(` $executable `:` type($executable) `)`\n `` `[` $entry_point `]`\n `workgroups` `(` `[`\n $workgroup_x `,`\n $workgroup_y `,`\n $workgroup_z\n `]` `)`\n (`constants` `(` `[` $push_constants^ `]` `)`)?\n `bindings` `(` `[`\n custom<DispatchBindings>($binding_buffers,\n type($binding_buffers),\n $binding_offsets,\n $binding_lengths)\n `]` `)`\n attr-dict-with-keyword\n
Dispatches execution to an executable entry point with the given parameters.
hal_loader.executable.dispatch.symbol
operation ::= `hal_loader.executable.dispatch.symbol` `executable` `(` $executable `:` type($executable) `)`\n `target` `(` $entry_point `)`\n `workgroups` `(` `[`\n $workgroup_x `,`\n $workgroup_y `,`\n $workgroup_z\n `]` `)`\n (`constants` `(` `[` $push_constants^ `]` `)`)?\n `bindings` `(` `[`\n custom<DispatchBindings>($binding_buffers,\n type($binding_buffers),\n $binding_offsets,\n $binding_lengths)\n `]` `)`\n attr-dict-with-keyword\n
Dispatches execution to an executable entry point with the given parameters. The entry point is a symbolic reference to an exported entry point.
Interfaces: SymbolUserOpInterface
hal_loader.executable.load
Dynamically loads an executable
operation ::= `hal_loader.executable.load` `format` `(` $format `)`\n `data` `(` $data `)`\n (`constants` `(` `[` $constants^ `]` `)`)?\n `:` type($result)\n attr-dict-with-keyword\n
Creates, loads, and dynamically links an executable.
hal_loader.executable.lookup
operation ::= `hal_loader.executable.lookup` `executable` `(` $executable `)`\n `:` type($result)\n attr-dict-with-keyword\n
Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpAsmOpInterface, SymbolUserOpInterface
hal_loader.executable.query_support
Queries whether an executable format is supported
operation ::= `hal_loader.executable.query_support` `format` `(` $executable_format `)`\n `:` type($supported)\n attr-dict-with-keyword\n
Returns true if the given format is supported by the device loader. This does not guarantee that loading will succeed as the executable may require functionality that cannot be met my the hosting runtime environment.
executable_format
supported
External parameter resource management APIs.
Parameters are externalized storage for resources that are asynchronously accessible and device-aware. Parameters can be read or written on the same device timelines as the operations that consume or produce them and with locality pinning to ensure memory doesn't need to move. Parameters are referenced by a scope and a key, with the scope being optional but strongly recommended as a way to distinguish sets of parameters that may exist when multiple model parts are compiled together and would otherwise collide.
Parameters are provided by a few operations implementing a virtual interface and can support shared parameters (same storage used in multiple contexts, or outliving a single instantiation in a context), in-memory caches, memory-mapped files (including directly using the mapped memory for execution when devices support it), iree_hal_file_t usage for device-supported I/O, and parameter subsetting for things like runtime sharding.
iree_hal_file_t
Alongside read(+load) and write operations gather and scatter allow for batching of large numbers of reads and writes into/from single buffers. For parameter providers that can batch operations this allows for a handful (~1-4) of calls out to perform many more operations (~thousands). Modeling the gather/scatter also gives us a point where we could extract the mapping and use it to repack files/defrag memory in the future.
See io_parameters.imports.mlir for the full list of exported functions.
io_parameters.imports.mlir
Ops parameter I/O.
io_parameters.gather
Gathers multiple parameters from a parameter scope
operation ::= `io_parameters.gather` `<` $device `:` type($device) `>`\n `affinity` `(` $queue_affinity `)`\n `wait` `(` $wait_fence `)`\n `signal` `(` $signal_fence `)`\n `{`\n custom<ParameterGatherOperations>(\n $source_scope, $source_keys, $source_offsets,\n $target_buffer, type($target_buffer), $target_offsets, $target_lengths)\n `}`\n attr-dict-with-keyword\n
Asynchronously gathers one or more parameters into a single target buffer. This is equivalent to one read per parameter but allows implementations that can batch operations to do so without additional overhead.
source_scope
source_keys
source_offsets
target_offsets
target_lengths
io_parameters.load
Reads one or more parameters from a parameter scope
operation ::= `io_parameters.load` `<` $device `:` type($device) `>`\n `affinity` `(` $queue_affinity `)`\n `wait` `(` $wait_fence `)`\n `signal` `(` $signal_fence `)`\n `type` `(` $memory_types `)`\n `usage` `(` $buffer_usage `)`\n `{`\n custom<ParameterLoadOperations>(\n $source_scope, $source_keys, $source_offsets,\n type($results), $lengths)\n `}`\n attr-dict-with-keyword\n
Asynchronously reads one or more parameters from an external parameter provider and returns the resulting buffers. Depending on the parameter and buffer types this may alias existing cached storage or be directly mapped to the parameter origin or result in a copy as if an allocate + read had been used.
Interfaces: Util_SizeAwareOp
Util_SizeAwareOp
io_parameters.scatter
Scatters multiple parameters to a parameter scope
operation ::= `io_parameters.scatter` `<` $device `:` type($device) `>`\n `affinity` `(` $queue_affinity `)`\n `wait` `(` $wait_fence `)`\n `signal` `(` $signal_fence `)`\n `{`\n custom<ParameterScatterOperations>(\n $source_buffer, type($source_buffer), $source_offsets, $source_lengths,\n $target_scope, $target_keys, $target_offsets)\n `}`\n attr-dict-with-keyword\n
Asynchronously scatters one or more parameters from a single source buffer into one or more parameters. This is equivalent to one write per parameter but allows implementations that can batch operations to do so without additional overhead.
target_scope
target_keys
source_lengths
Public ops/type/attributes legal for input to IREE's compiler.
IREE's compiler allows as input a number of common dialects. This dialect contains structural and unique ops that do not exist elsewhere or that IREE has an interest in maintaining as a stable set.
The contents of this dialect often mirror various constructs in IREE's internal implementation. The focus here is on simplicity and stability over time. Generally, this dialect does not use \"advanced\" features and should be broadly source compatible over a range of LLVM versions. There are of course, limits, and source-compatibility is not guaranteed, since LLVM/MLIR's API surface is itself unstable.
iree_input.buffer.subspan
operation ::= `iree_input.buffer.subspan` `<` $source_buffer `:` type($source_buffer) `>`\n `` `[` $source_offset `,` $length `]`\n `:` type($result)\n attr-dict-with-keyword\n
iree_input.buffer_view.create
operation ::= `iree_input.buffer_view.create` `buffer` `(` $source_buffer `:` type($source_buffer) `)`\n `` `[` $source_offset `,` $source_length `]`\n `shape` `(` `[` $shape `]` `)`\n `type` `(` $element_type `)`\n `encoding` `(` $encoding_type `)`\n `:` type($result)\n attr-dict-with-keyword\n
iree_input.buffer_view.dim
operation ::= `iree_input.buffer_view.dim` $buffer_view `,` $index attr-dict `:` type($result)\n
iree_input.buffer_view.rank
operation ::= `iree_input.buffer_view.rank` $buffer_view attr-dict `:` type($result)\n
iree_input.byte_buffer.constant
Constant host-side byte buffer
operation ::= `iree_input.byte_buffer.constant` ($name^)? attr-dict `:` type($result) `=` $value\n
Defines a compile-time byte buffer based on the given attribute value. The attribute will be serialized into the canonical IREE format for the chosen host target.
iree_input.optimization_barrier
Prevents compiler optimizations across a value.
operation ::= `iree_input.optimization_barrier` attr-dict\n ($operands^ `:` type($operands))?\n
Wraps any operands in an unoptimizable identity to prevent its results from being folded. It will be dropped during the final step in compilation and has no effect at runtime.
Traits: SameOperandsAndResultType
SameOperandsAndResultType
iree_input.dispatch
A dispatch of an executable across a grid
operation ::= `iree_input.dispatch` $entry_point\n (`[` $workload^ `]`)? ``\n `(` $arguments `)` attr-dict `:`\n custom<ShapedFunctionType>(ref($arguments),\n type($arguments), $argument_dims,\n type($results), $result_dims,\n $tied_operands)\n
Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), SymbolUserOpInterface, TiedOpInterface
iree_input.executable.export
operation ::= `iree_input.executable.export` custom<SymbolVisibility>($sym_visibility)\n $sym_name\n `ordinal` `(` $ordinal `)`\n `layout` `(` $layout `)`\n attr-dict-with-keyword\n
Traits: HasParent<IREE::Input::ExecutableSourceOp>, IsolatedFromAbove
HasParent<IREE::Input::ExecutableSourceOp>
iree_input.executable.source_end
operation ::= `iree_input.executable.source_end` attr-dict\n
Traits: HasParent<IREE::Input::ExecutableSourceOp>, Terminator
iree_input.executable.source
operation ::= `iree_input.executable.source` custom<SymbolVisibility>($sym_visibility)\n $sym_name\n attr-dict-with-keyword\n ``\n regions\n
Traits: IsolatedFromAbove, SingleBlockImplicitTerminator<IREE::Input::ExecutableSourceEndOp>, SingleBlock, SymbolTable
SingleBlockImplicitTerminator<IREE::Input::ExecutableSourceEndOp>
iree_input.global.address
Returns an address reference to a global
operation ::= `iree_input.global.address` $global attr-dict `:` type($result)\n
Returns the address of a global as a typed reference. Can be used with the global load and store indirect ops.
global
iree_input.global.load.indirect
Loads a value from a global variable
operation ::= `iree_input.global.load.indirect` $global attr-dict `:` type($global) `->` type($result)\n
Returns a copy of the global value.
iree_input.global.load
operation ::= `iree_input.global.load` $global attr-dict `:` type($result)\n
iree_input.global
Stateful global variable declaration
operation ::= `iree_input.global` custom<SymbolVisibility>($sym_visibility)\n (`mutable` $is_mutable^)?\n $sym_name\n attr-dict\n (`initializer` `(` $initializer^ `)`)?\n custom<TypeOrAttr>($type, $initial_value)\n
Declares a global variable that maintains its value across invocations. The value is tied to the execution context of the module and different contexts will have different global storage.
is_mutable
initializer
initial_value
iree_input.global.store.indirect
Stores a value into a global variable
operation ::= `iree_input.global.store.indirect` $value `,` $global attr-dict `:` type($value) `->` type($global)\n
Stores a copy of the value into a global.
iree_input.global.store
operation ::= `iree_input.global.store` $value `,` $global attr-dict `:` type($value)\n
iree_input.list.create
Creates a new empty list
operation ::= `iree_input.list.create` ($initial_capacity^)? attr-dict `:` type($result)\n
Creates a new empty list with an optional initial capacity.
Interfaces: ConditionallySpeculatable, MemoryEffectOpInterface (MemoryEffectOpInterface), NoMemoryEffect (MemoryEffectOpInterface)
Effects: MemoryEffects::Effect{MemoryEffects::Allocate on ::mlir::SideEffects::DefaultResource}, MemoryEffects::Effect{}
initial_capacity
iree_input.list.get
Element accessor
operation ::= `iree_input.list.get` $list `[` $index `]` attr-dict `:` type($list) `->` type($result)\n
Returns the value of the element at the given index. Note that the value may be null if the element is null or the type does not match.
Interfaces: MemoryEffectOpInterface (MemoryEffectOpInterface)
Effects: MemoryEffects::Effect{MemoryEffects::Read on ::mlir::SideEffects::DefaultResource}
MemoryEffects::Effect{MemoryEffects::Read on ::mlir::SideEffects::DefaultResource}
iree_input.list.resize
Resizes the list to a new count in elements
operation ::= `iree_input.list.resize` operands attr-dict `:` type($list)\n
Resizes the list to contain new_size elements. This will either truncate the list if the existing size is greater than new_size or extend the list with the default list value of the element type.
new_size
Effects: MemoryEffects::Effect{MemoryEffects::Write on ::mlir::SideEffects::DefaultResource}
MemoryEffects::Effect{MemoryEffects::Write on ::mlir::SideEffects::DefaultResource}
iree_input.list.set
Element mutator
operation ::= `iree_input.list.set` $list `[` $index `]` `,` $value attr-dict `:` type($list) `,` type($value)\n
Sets the element at the given index to the new value.
iree_input.list.size
The size of the list in elements
operation ::= `iree_input.list.size` operands attr-dict `:` type($list)\n
Returns the current size of the list in elements.
Interfaces: InferTypeOpInterface, MemoryEffectOpInterface (MemoryEffectOpInterface)
iree_input.tensor.export
Exports a tensor to a Buffer(View), capturing dynamic dims
operation ::= `iree_input.tensor.export` $source `:` type($source) (`{` $source_dims^ `}`)? `->` type($target)\n attr-dict-with-keyword\n
iree_input.tensor.import
Imports a Buffer(View) to a tensor, providing dynamic dims
operation ::= `iree_input.tensor.import` $source `:` type($source) `->` type($target) (`{` $target_dims^ `}`)?\n attr-dict-with-keyword\n
iree_input.tensor.bitcast
operation ::= `iree_input.tensor.bitcast` $source `:`\n type($source) (`{` $source_dims^ `}`)? `->`\n type($result) (`{` $result_dims^ `}`)?\n attr-dict-with-keyword\n
Bitcasts a tensor to a new shape without modifying the contents.
iree_input.tensor.clone
operation ::= `iree_input.tensor.clone` $operand `:` type($result) (`{` $operand_dims^ `}`)?\n attr-dict-with-keyword\n
operand_dims
iree_input.tensor.load
operation ::= `iree_input.tensor.load` $source (`[` $indices^ `]`)? `:`\n type($source) (`{` $source_dims^ `}`)?\n attr-dict-with-keyword\n
iree_input.tensor.reshape
operation ::= `iree_input.tensor.reshape` $source `:`\n type($source) (`{` $source_dims^ `}`)? `->`\n type($result) (`{` $result_dims^ `}`)?\n attr-dict-with-keyword\n
iree_input.tensor.slice
operation ::= `iree_input.tensor.slice` $source `[` $start_indices `for` $lengths `]` `:`\n type($source) (`{` $source_dims^ `}`)? `->`\n type($result) (`{` $result_dims^ `}`)?\n attr-dict-with-keyword\n
iree_input.tensor.splat
operation ::= `iree_input.tensor.splat` $value `:` type($result) (`{` $result_dims^ `}`)?\n attr-dict-with-keyword\n
iree_input.tensor.store
operation ::= `iree_input.tensor.store` $value `,` $target (`[` $indices^ `]`)? `:`\n type($target) (`{` $target_dims^ `}`)?\n attr-dict-with-keyword\n
iree_input.tensor.trace
operation ::= `iree_input.tensor.trace` $key `=` `[`\n custom<ShapedOperandList>($values, type($values), $value_dims)\n `]` attr-dict-with-keyword\n
iree_input.tensor.update
operation ::= `iree_input.tensor.update` $update `,` $target `[` $start_indices `]` `:`\n type($update) (`{` $update_dims^ `}`)? `->`\n custom<ShapedTiedResult>(type($result), $target_dims)\n attr-dict-with-keyword\n
Interfaces: ConditionallySpeculatable, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TiedOpInterface
iree_input.align
Aligns up to a power-of-two alignment if required
operation ::= `iree_input.align` $value `,` $alignment attr-dict `:` type($result)\n
Aligns |value| up to the given power-of-two |alignment| if required.
Traits: AlwaysSpeculatableImplTrait, SameOperandsAndResultType
iree_input.null
A null value
operation ::= `iree_input.null` attr-dict `:` type($result)\n
Initializes reference and variant types with a null value.
iree_input.dispatch.workgroup.count
operation ::= `iree_input.dispatch.workgroup.count` `[` $dimension `]` attr-dict `:` type($result)\n
Corresponds to the NumWorkgroups SPIR-V built-in and the gridDim CUDA built-in variable, only in the iree dialect the number of dimensions is not restricted to 3 (XYZ).
%x = iree_input.dispatch.workgroup.count[0] : index\n%y = iree_input.dispatch.workgroup.count[1] : index\n
iree_input.dispatch.workgroup.id
operation ::= `iree_input.dispatch.workgroup.id` `[` $dimension `]` attr-dict `:` type($result)\n
The global workgroup ID of the current workgroup in the range of [0, iree_input.dispatch.workgroup.count) along each dimension.
[0, iree_input.dispatch.workgroup.count)
Corresponds to the WorkgroupId SPIR-V built-in and the blockIdx CUDA built-in variable, only in the iree dialect the number of dimensions is not restricted to 3 (XYZ).
%x = iree_input.dispatch.workgroup.id[0] : index\n%y = iree_input.dispatch.workgroup.id[1] : index\n
iree_input.dispatch.workgroup.size
operation ::= `iree_input.dispatch.workgroup.size` `[` $dimension `]` attr-dict `:` type($result)\n
Workgroup sizes are not determined at the iree dialect level as they are dependent on the target backend determined when lowering into the HAL. It's still possible to use the symbolic workgroup size inside of dispatch executables as a placeholder for the resolved value once in the HAL.
Corresponds to the WorkgroupSize SPIR-V built-in and the blockDim CUDA built-in variable, only in the iree dialect the number of dimensions is not restricted to 3 (XYZ).
%x = iree_input.dispatch.workgroup.size[0] : index\n%y = iree_input.dispatch.workgroup.size[1] : index\n
#iree_input.descriptor_set.binding<\n int64_t, # ordinal\n DescriptorType, # type\n std::optional<DescriptorFlags> # flags\n>\n
#iree_input.descriptor_set.layout<\n int64_t, # ordinal\n ::llvm::ArrayRef<DescriptorSetBindingAttr>, # bindings\n std::optional<DescriptorSetLayoutFlags> # flags\n>\n
#iree_input.descriptor_type<\n ::mlir::iree_compiler::IREE::Input::DescriptorType # value\n>\n
::mlir::iree_compiler::IREE::Input::DescriptorType
executable object reference
#iree_input.pipeline.layout<\n int64_t, # pushConstants\n ::llvm::ArrayRef<DescriptorSetLayoutAttr> # setLayouts\n>\n
A mutable, resizable list of some type.
Buffer is an untyped bag of bits with no shape or dtype
Syntax: !iree_input.buffer
!iree_input.buffer
Buffers represent an untyped bag of bits that can be reinterpreted depending on a use case using buffer_view operation. Buffers can be used for packing multiple tensors into the same underlying storage. It is left to higher level code to decide how exactly tensors layed out in the buffer.
View into a buffer, with runtime shape and element type
Syntax: !iree_input.buffer_view
!iree_input.buffer_view
BufferViews represent views onto backing IREE runtime Buffer objects, adding runtime shape and element type parameters to the backing buffer. BufferViews are typically accepted and returned at boundaries with external code.
In the runtime and lower level compiler, BufferView's are fully modeled; however, as boundary types, not all features are exposed publicly. Since within compiled tensor programs, it is typical to operate in terms of fully typed tensors, the primary mechanism for getting or using a BufferView at the high level is by casting to/from a tensor. It is left to higher level code to ensure that aliasing rules are enforced at such boundaries.
a reference counted byte buffer
Syntax: !iree_input.byte_buffer
!iree_input.byte_buffer
A reference counted byte buffer that models a pointer, offset, and length.
A one dimensional list of runtime values
Represents a list of arbitrary type. Primitive types can be expected to be efficiently stored in an unboxed form. Reference types and variants are permitted.
Lists can either be homogenous, with a fixed element type, or heterogenous by parameterizing them with a VariantType.
::mlir::Type
Pointer to a concrete type
Represents any legal or reference type in the IREE runtime
Syntax: !iree_input.variant
!iree_input.variant
The variant type is typically used to parameterize container types that can contain any legal primitive, reference or null in the IREE type system.
IREE Vector Extensions.
A dialect designed for experimenting with vector operations beyond what is currently available in the Vector Dialect.
iree_vector_ext.layout_conflict_resolution
Layout Conflict Resolution operator
operation ::= `iree_vector_ext.layout_conflict_resolution` $input attr-dict `:` type($input) `->` type($output)\n
The layout conflict resolution operator takes a vector and a desired layout and transforms the vector to one with the desired layout.
sourceLayout
desiredLayout
output
iree_vector_ext.to_simd
SIMT to SIMD conversion operation
operation ::= `iree_vector_ext.to_simd` $input attr-dict `:` type($input) `->` type($output)\n
This operation is a temporary operation useful for source/target materializations when doing type conversions between distributed and not distributed vectors.
Traits: AlwaysSpeculatableImplTrait, SameOperandsAndResultElementType
SameOperandsAndResultElementType
iree_vector_ext.to_simt
SIMD to SIMT conversion operation
operation ::= `iree_vector_ext.to_simt` $input attr-dict `:` type($input) `->` type($output)\n
IREE Linalg Extensions.
A dialect designed for experimenting with non-structured operations that cannot be represented efficiently/directly by the Linalg dialect.
Operations for working with data layouts, padding, encodings, and other properties useful for tiling computations across iteration space dimensions.
iree_linalg_ext.pack
Pack operation
operation ::= `iree_linalg_ext.pack` attr-dict\n $inputs\n (`padding_value` `(` $padding_value^ `:` type($padding_value) `)`)?\n (`outer_dims_perm` `=` $outer_dims_perm^)?\n `inner_dims_pos` `=` $inner_dims_pos\n `inner_tiles` `=`\n custom<DynamicIndexList>($inner_tiles, $static_inner_tiles)\n `into` $outputs `:` `(` type($inputs) type($outputs) `)`\n (`->` type($results)^)?\n
The pack operation converts an input into a tiled and packed layout. The dimensions to be tiled are obtained from inner_dims_pos and the size of the tile is obtained from inner_tiles. The dimensions listed in inner_dims_pos do not need to be contiguous in which case the tile will get transposed. We handle only full tiles if padding_value is not set; it is UB if the tile does not perfectly divide the dimension. If padding_value is set, it will pad along high dimensions, i.e., it pads at the bottom and on the right if the input has rank 2, and the result type shape, will be dynamic in any dimension if and only if the input shape is. As optional input, the operation takes outer_dims_perm that allows to permute the tiled loops.
inner_dims_pos
inner_tiles
padding_value
outer_dims_perm
Example KC_to_KCck:
iree_linalg_ext.pack %arg0 inner_dims_pos = [1, 0]\n inner_tiles = [32, 8] into %arg1 : (memref<128x256xf32> memref<16x8x32x8xf32>)\n
Example NC_to_NCnc:
iree_linalg_ext.pack %arg0 inner_dims_pos = [0, 1]\n inner_tiles = [8, 32] into %arg1 : (memref<128x256xf32> memref<16x8x8x32xf32>)\n
iree_linalg_ext.pack %arg0 outer_dims_perm = [1, 0] inner_dims_pos = [0, 1]\n inner_tiles = [32, 8] into %arg1 : (memref<128x256xf32> memref<32x4x32x8xf32>)\n
In all cases, dimension at position 0 in the input memref (128) is tiled with a factor of 8, while dimension at position 1 (256) is tiled with a factor of 32. In the KC_to_KCck example, the point loops are interchanged, while in the KC_to_CKkc example the tiled loops.
Example NC_to_NCnc with padding:
iree_linalg_ext.pack %arg padding_value(%pad : f32) inner_dims_pos = [0, 1]\n inner_tiles = [8, 2] into %arg1 : (memref<13x15xf32> memref<2x8x8x2xf32>)\n
Traits: AttrSizedOperandSegments, SingleBlockImplicitTerminator<::mlir::iree_compiler::IREE::LinalgExt::YieldOp>, SingleBlock
SingleBlockImplicitTerminator<::mlir::iree_compiler::IREE::LinalgExt::YieldOp>
Interfaces: DestinationStyleOpInterface, LinalgExtInterface, LinalgExtOp, MemoryEffectOpInterface, ReifyRankedShapedTypeOpInterface, TilingInterface
DestinationStyleOpInterface
LinalgExtInterface
LinalgExtOp
MemoryEffectOpInterface
TilingInterface
static_inner_tiles
inputs
outputs
iree_linalg_ext.set_encoding
Perform pack and pad operation on source
operation ::= `iree_linalg_ext.set_encoding` attr-dict $source `:` type($source) `->` type($result)\n
Operation to assign an encoding to a tensor. The operation does not change the rank or extent of a tensor. Instead it adds an encoding attribute to the tensor type to represent a change in layout.
Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), ReifyRankedShapedTypeOpInterface
iree_linalg_ext.unpack
Unpack operation
operation ::= `iree_linalg_ext.unpack` attr-dict\n $inputs\n (`outer_dims_perm` `=` $outer_dims_perm^)?\n `inner_dims_pos` `=` $inner_dims_pos\n `inner_tiles` `=`\n custom<DynamicIndexList>($inner_tiles, $static_inner_tiles)\n `into` $outputs `:` `(` type($inputs) type($outputs) `)`\n (`->` type($results)^)?\n
The unpack operation converts a tiled and packed input to an unpacked output. See pack for more details on inner_tiles and dims_pos; it is UB if the tile does not perfectly divide the dimension. Optionally, the operation also supports permuting the tiled loops.
pack
dims_pos
Example KCck_to_KC:
iree_linalg_ext.unpack %arg0 dims_pos = [1, 0]\n inner_tiles = [32, 8] into %arg1 : (memref<16x8x32x8xf32> memref<128x256xf32>)\n
Example NCnc_to_NC:
iree_linalg_ext.unpack %arg0 dims_pos = [0, 1]\n inner_tiles = [8, 32] into %arg1 : (memref<16x8x8x32xf32> memref<128x256xf32>)\n
Example CKkc_to_KC:
iree_linalg_ext.unpack %arg1 outer_dims_perm = [1, 0] inner_dims_pos = [0, 1]\n inner_tiles = [32, 8] into %arg0 : (memref<32x4x32x8xf32> memref<128x256xf32>)\n
iree_linalg_ext.unset_encoding
Perfom unpack and extract operation on source
operation ::= `iree_linalg_ext.unset_encoding` attr-dict $source `:` type($source) `->` type($result)\n
Operation to convert an tensor with encoding that represents its data layout into a tensor with default layout (i.e. no encoding). For now in IREE the default layout is row-major.
iree_linalg_ext.upper_bound_tile_size
Returns an upper bound on tile sizes
operation ::= `iree_linalg_ext.upper_bound_tile_size` attr-dict $tensorType `->` type($results)\n
This returns the largest tile sizes that might result from materialization of the given encoding. This can be used outside of target-specific code, so there may be multiple targets, and this will return the maximum tile size from iterating over all of them. The evaluation happens in the MaterializeUpperBoundTileSize pass.
tensorType
iree_linalg_ext.attention
Attention operator
operation ::= `iree_linalg_ext.attention` attr-dict\n `ins` `(` $inputs `:` type($inputs) `)`\n `outs` `(` $outputs `:` type($outputs) `)`\n (`->` type($results)^)?\n
This operator takes in 3 tensors: query(Q), key(K) and value(V) and computes the attention. For self-attention, all inputs have the same shape BxNxd where B is the of the batch dimension, N is the sequence length and d is head dimension. Typically N >>> d. Mathematically, the attention is defined as matmul(softmax(matmul(Q, transpose(K))), V) and has shape BxNxd. Usually, this operator also performs scaling, masking and dropout, but we leave that out of the current implementation. For cross-attention, the query and output have the same shape (BxNxd), while the key and value differ in sequence length (they have shape BxLxd, where L != N). This operator after tiling results in a tiled result as per flash attention and results in the current max and sum statistics while processing the current tile.
max
sum
Interfaces: DestinationStyleOpInterface, LinalgExtInterface, MemoryEffectOpInterface, ReifyRankedShapedTypeOpInterface, TilingInterface
transpose_v
iree_linalg_ext.fft
Fft operator
operation ::= `iree_linalg_ext.fft` attr-dict (`ins` `(` $inputs^ `:` type($inputs) `)`)?\n `outs` `(` $outputs `:` type($outputs) `)`\n (`:` type($results)^)?\n
Apply 1D FFT to innermost dim. This is an iterative FFT, not recurrsive. Thus, the bit reversal is assumed applied on the input. The op carries an input -- stage, which indicates the level of reduction loop in the algorithm. It represents the computation body. For more details, see \"Data reordering, bit reversal, and in-place algorithms\" section in https://en.wikipedia.org/wiki/Cooley%E2%80%93Tukey_FFT_algorithm
The size of innermost dim is expected to be a power of 2.
It is optional to carry coefficient tensors/buffers as inputs. In this context, they will be the second and third inputs.
iree_linalg_ext.reverse
Reverse operator
operation ::= `iree_linalg_ext.reverse` attr-dict `dimensions` `(` $dimensions `)`\n (`ins` `(` $inputs^ `:` type($inputs) `)`)?\n (`outs` `(` $outputs^ `:` type($outputs) `)`)?\n (`:` type($results)^)?\n
A temporary solution for lowering reverse ops into IREE, allowing IREE to tile and distribute them. }
dimensions
iree_linalg_ext.scan
Scan operator
operation ::= `iree_linalg_ext.scan` attr-dict\n `dimension` `(` $dimension `)`\n `inclusive` `(` $inclusive `)`\n `ins` `(` $inputs `:` type($inputs) `)`\n `outs` `(` $outputs `:` type($outputs) `)`\n $region (`->` type($results)^)?\n
Computes the inclusive/exclusive scan along a given dimension.
inclusive
iree_linalg_ext.scatter
Scatter operator
operation ::= `iree_linalg_ext.scatter` attr-dict `dimension_map` `=` $dimension_map\n `unique_indices` `(` $unique_indices `)`\n (`ins` `(` $inputs^ `:` type($inputs) `)`)?\n `outs` `(` $outputs `:` type($outputs) `)`\n $region (`->` type($results)^)?\n
Based on XLA operation semantics, takes two inputs (update and indices) and outputs value (original). The operation updates the value at the slices specified by indices by combining the current value with the value in updates using the computation specified in region. The region specifies a binary operation of signature (T, T) -> T, where T is the element-type of updates (and original). The first argument correspond the value to be updated (i.e. from updates), and the second the current value (i.e. value from original).
original
updates
region
The indices is a 2D tensor/memref type. The first dim is the number of updates, and the second dim is index depth. The index depth should always be static.
The first dim of updates and indices is identical, since they represent the number of updates.
The rank of the original/result is at least index_depth + rank(%updates) - 1. The first index_depth indices are derived from indices and the shape of update value has the last rank(%original) - index_depth values match %(originals) last dimensions, with the previous dims extending from the index offsets.
index_depth + rank(%updates) - 1
index_depth
The dimension_map attributes describes which index value maps to which dimension in the destionation. It cannot contain duplicate values, must have as many entries as index depth, and values must be within the rank of the destination.
The unique_indices attribute carries the information whether all the indices are unique. If there are repeated indices, the first iteration loop will be marked as reduction.
The shapes definition follows tensorflow operations execept that it force batch dims to be 1D. See more information in https://www.tensorflow.org/api_docs/python/tf/tensor_scatter_nd_update
dimension_map
unique_indices
iree_linalg_ext.sort
Sort operator
operation ::= `iree_linalg_ext.sort` attr-dict\n `dimension` `(` $dimension `)`\n (`ins` `(` $inputs^ `:` type($inputs) `)`)?\n `outs` `(` $outputs `:` type($outputs) `)`\n $region (`->` type($results)^)?\n
Based on XLA operation semantics, sorts the given operands at the given dimension with the given comparator.
comparator
See https://www.tensorflow.org/xla/operation_semantics#sort.
iree_linalg_ext.topk
Top-K operator
operation ::= `iree_linalg_ext.topk` attr-dict\n `dimension` `(` $dimension `)`\n `ins` `(` $inputs `:` type($inputs) `)`\n `outs` `(` $outputs `:` type($outputs) `)`\n $region (`->` type($results)^)?\n
A Top-K operation for N-D tensors. Reduces the target dimension from the input size N down to K elements based on the supplied binary region.
Accepts an N-D tensor input consisting of values and an optioanl N-D tensor for indices of those values (i32 type). If input indices aren't provided, the index mapping is inferred based on the k dim. Both input values/indices tensors and output values/indicies tensors must have the same shape. Top-K is computed along the target dimension (from dimension()). Returns two output tensors of values and the indicies of Top-K results. The output dimensions must match the input save for the dimension that is reduced to K results.
Region accepts lhs=[next N input] and rhs=[exiting K output] and yeilds an i1. If true, the two values are swapped: - For Top-K compoarision: > - For Min-K comparision: < Note: when the two values are equal, the first occurence is always selected.
iree_linalg_ext.yield
LinalgExt yield op
operation ::= `iree_linalg_ext.yield` attr-dict ($operands^ `:` type($operands))?\n
iree_linalg_ext.yield is a special terminator operation for blocks inside regions in iree_linalg_ext ops.
iree_linalg_ext
iree_linalg_ext.winograd.input_transform
Winograd Input Transform operator
operation ::= `iree_linalg_ext.winograd.input_transform` attr-dict\n `output_tile_size` `(` $output_tile_size `)`\n `kernel_size` `(` $kernel_size `)`\n `image_dimensions` `(` $image_dimensions `)`\n `ins` `(` $inputs `:` type($inputs) `)`\n `outs` `(` $outputs `:` type($outputs) `)`\n (`->` type($result)^)?\n
This operator is the first step in converting a convolution to its Winograd equivalent. Given a tile of an input image (I), this operator computes matmul(tranpose(B), matmul(I, B)). The input tile is assumed to be square with each side of size m + r - 1, where the convolutional kernel is m x m and the output tile size is r x r. B is a constant 2-d square matrix of the same shape as the input tile I. The input to the operator is an image of shape (N, H, W, C) or (N, C, H, W) and the output is an operator of shape (m + r - 1, m + r - 1, N, H', W', C) where H' = ceil((H - m + 1)/r) and W' = ceil((W - m + 1)/r). The result of this operator is first collapsed and then fed to a batch matmul op.
output_tile_size
kernel_size
image_dimensions
iree_linalg_ext.winograd.output_transform
Winograd Output Transform operator
operation ::= `iree_linalg_ext.winograd.output_transform` attr-dict\n `output_tile_size` `(` $output_tile_size `)`\n `kernel_size` `(` $kernel_size `)`\n `image_dimensions` `(` $image_dimensions `)`\n `ins` `(` $inputs `:` type($inputs) `)`\n `outs` `(` $outputs `:` type($outputs) `)`\n (`->` type($result)^)?\n
This operator is the last transform in converting a convolution to its Winograd equivalent. After convolution in the Winograd domain (which turns into an elementwise product for a single channel and batch matrix multiplication for many channels), this operator converts the output back into the original domain. Given a tile of the output (O) in the Winograd domain, this operator computes matmul(transpose(A), matmul(O, A)). The output tile is square with each side of size m + r - 1, where the convolutional kernel is m x m and the output tile size is r x r. A is a constant 2-d matrix of shape (m + r - 1) x r. The input to the operator is a tensor of shape (m + r - 1, m + r - 1, N, H', W', C) and the output is a tensor of shape (N, H, W, C) or (N, C, H, W) where H = r H' and W = r W'. This operator is followed by a tensor.extract_slice which extracts only the non-padded part of the output.
information to decide how to data-tile a tensor
#iree_linalg_ext.encoding<\n EncodingRoleAttr, # role\n ArrayAttr, # element_types\n TypeAttr, # original_type\n IntegerAttr, # matmul_narrow_M\n IntegerAttr, # matmul_narrow_N\n ArrayAttr # user_indexing_maps\n>\n
This attribute describes the change in the layout for a given tensor to execute subsequent operations on the tiled layout. The encoding serves as a way to represent the change in the way the data is laid out in memory without changing the logical rank/extent of the tensor itself. When required, the encoding can be used to explicitly manifest the layout change through operations like pack/unpack.
EncodingRoleAttr
TypeAttr
IntegerAttr
Describes the role of the tensor as an operand or a result of an operation.
#iree_linalg_ext.role<\n ::mlir::iree_compiler::IREE::LinalgExt::EncodingRole # value\n>\n
Enum cases: * LHS (LHS) * RHS (RHS) * RESULT (RESULT)
LHS
RHS
RESULT
::mlir::iree_compiler::IREE::LinalgExt::EncodingRole
A dialect designed to model execution partitioning and scheduling.
The stream dialect is designed to take tensor programs and convert them to explicitly scheduled asynchronous programs. This includes placing ops on specific targets, partitioning the work between the targets, scheduling the work for concurrency, and encoding tensors into target-specific resources.
+--------+ +----------+ +-------+\n| flow.* | -> | stream.* | -> | hal.* |\n+--------+ +----------+ +-------+\n
This sits in-between the flow and hal dialects.
flow models tensor programs by separating work into dispatchable functions in order to isolate the main host program data flow and the dense tensor compute operations.
stream models explicitly scheduled asynchronous programs by partitioning the dispatchable work, specifying target affinities, encoding tensors into target-specific forms, and scheduling the work to run concurrently.
hal models a low-level hardware abstraction layer used to manage buffers and issue asynchronous work across a variety of device types. The dialect is largely 1:1 with the IREE HAL C API.
Transforms in the dialect lower tensor values into opaque resources with the goal of ensuring no tensors survive in the IR. At entry stream.tensor.* ops are used to capture the source tensor encoding information (data type, shapes, etc) and then lowered into stream.async.* ops that model the asynchronous workloads on the opaque resources. The asynchronous operations are then partitioned, allocated, and scheduled for execution using the stream.cmd.* ops.
stream.tensor.*
stream.async.*
stream.cmd.*
It's intended that after transformation through the stream dialect the program is ready for execution on an abstract machine. At this level of representation buffers have still not been allocated and devices are not yet resolved, however the information captured in the stream IR allows such operations to be done trivially. To this end all ops carry the symbolic size of the resources on which they operate as well as the lifetime of the resources they are acting upon. This manifests in the usage of the !stream.resource type:
!stream.resource
// Unresolved lifetime (resolved during the iree-stream-refine-usage pass):\n!stream.resource<*>\n// An externally managed value (passed in via the program API).\n!stream.resource<external>\n// A staging buffer for uploads/downloads.\n!stream.resource<staging>\n// A short-lived value that is used across streams.\n!stream.resource<transient>\n// A long-lived value that persists across streams in globals.\n!stream.resource<variable>\n// An immutable value that persists for the duration of the program.\n!stream.resource<constant>\n
Operations using resources carry the size of all operand result resources:
// %update (40 bytes) is being inserted into %target (296 bytes).\n// Can be dynamic values such as those originating from dynamic dimensions.\n%13 = stream.async.update %update, %target[%c256 to %c296] :\n !stream.resource<transient>{%c40} ->\n %target as !stream.resource<transient>{%c296}\n
Once all stream.async.* work is moved into executable regions (such as stream.async.execute) !stream.timepoint values are used to sequence the execution. These timepoints represent some point in time where all execution up to that timepoint has completed and any results that were produced by the execution are available for use. Attempting to use the resources before their corresponding timepoint has been reached will lead to undefined behavior. The benefit of this is that after timepoints are established in the IR it's possible to induce aliasing of resources without breaking execution correctness.
stream.async.execute
!stream.timepoint
stream.async.call
operation ::= `stream.async.call` (`on` `(` $affinity^ `)`)?\n $callee ``\n custom<DispatchOperands>($resource_operands,\n $resource_operand_offsets,\n $resource_operand_ends,\n $resource_operand_lengths) attr-dict `:`\n custom<ShapedFunctionType>(ref($resource_operands),\n type($resource_operands), $resource_operand_sizes,\n type($results), $result_sizes,\n $tied_operands)\n
Calls a function taking/returning resource values with stream semantics. Asynchronous calls must have no side-effects.
Note that returned resources must have their sizes declared prior to the call as this is what allows the call to be made on the stream. If external host logic is required to compute the size (avoid at all costs!) a separate func.call can be used outside of the stream to do so. If sizes are unknownable until the operation is performed it should be made as a normal asynchronous host call with 'coarse-fences' instead.
Traits: AttrSizedOperandSegments, Stream_AsyncPhaseOp
Stream_AsyncPhaseOp
Interfaces: AsyncAccessOpInterface, CallOpInterface, Stream_AffinityOp, Stream_StreamableOp, SymbolUserOpInterface, TiedOpInterface, Util_SizeAwareOp
AsyncAccessOpInterface
Stream_AffinityOp
Stream_StreamableOp
affinity
resource_operands
resource_operand_sizes
resource_operand_offsets
resource_operand_ends
resource_operand_lengths
result_sizes
stream.async.concurrent
Executes all ops concurrently
operation ::= `stream.async.concurrent` (`on` `(` $affinity^ `)`)?\n `with` ``\n custom<ResourceRegion>($resource_operands,\n type($resource_operands), $resource_operand_sizes,\n type($results), $result_sizes,\n $tied_operands, $body)\n attr-dict-with-keyword\n
Represents a wave of work scheduled concurrently (each op executing at the same time). All resource inputs must be captured explicitly. All results are only ready once all nested ops complete execution.
Waves can be nested to create a DAG. For example, take the following graph:
|\n v---------+---------v\n+-------|-------+ +-------|-------+\n| v--+--v | | v--+--v |\n| +----+ +----+ | | +----+ +----+ |\n| | %a | | %b | | | | %c | | %d | |\n| +----+ +----+ | | +----+ +----+ |\n| +--v--+ | | +--v--+ |\n+-------|-------+ +-------|-------+\n +---------v---------+\n |\n
Represented with nested waves:
%0 = stream.async.concurrent with(%arg) -> ... {\n %1 = stream.async.concurrent with(%arg as %arg0) -> ... {\n %a = ...\n %b = ...\n stream.yield %a, %b\n }\n %2 = stream.async.concurrent with(%arg as %arg1) -> ... {\n %c = ...\n %d = ...\n stream.yield %c, %d\n }\n stream.yield %1, %2\n }\n
Traits: AttrSizedOperandSegments, HasParent<IREE::Stream::AsyncExecuteOp, IREE::Stream::AsyncConcurrentOp>, RecursiveMemoryEffects, SingleBlockImplicitTerminator<IREE::Stream::YieldOp>, SingleBlock, Stream_AsyncPhaseOp
HasParent<IREE::Stream::AsyncExecuteOp, IREE::Stream::AsyncConcurrentOp>
RecursiveMemoryEffects
SingleBlockImplicitTerminator<IREE::Stream::YieldOp>
Interfaces: AsyncAccessOpInterface, ClosureOpInterface, RegionBranchOpInterface, Stream_AffinityOp, Stream_StreamableOp, TiedOpInterface, Util_SizeAwareOp
RegionBranchOpInterface
Executes a dependency-aware sequence of streamable ops
operation ::= `stream.async.execute` (`on` `(` $affinity^ `)`)?\n (`await` `(` $await_timepoint^ `)` `=` `` `>`)?\n `with` ``\n custom<ResourceRegion>($resource_operands,\n type($resource_operands), $resource_operand_sizes,\n type($results), $result_sizes,\n $tied_operands, $body)\n `=` `` `>` type($result_timepoint)\n attr-dict-with-keyword\n
Evaluates the operations within the region by dependency order while obeying ties when present. Nested ops execute serially in block order and nested stream.async.concurrent ops can be used to run multiple ops concurrently within the stream. All resource inputs must be captured explicitly. All results are only ready once all nested ops complete execution and the returned timepoint is reached. Zero or more timepoints may be provided to block execution until they are all reached; zero timepoints indicates that execution may begin immediately.
Traits: AttrSizedOperandSegments, RecursiveMemoryEffects, SingleBlockImplicitTerminator<IREE::Stream::YieldOp>, SingleBlock, Stream_AsyncPhaseOp
Interfaces: AsyncAccessOpInterface, ClosureOpInterface, RegionBranchOpInterface, Stream_AffinityOp, Stream_TimelineOp, TiedOpInterface, Util_SizeAwareOp
Stream_TimelineOp
await_timepoint
result_timepoint
stream.async.func
operation ::= `stream.async.func` custom<SymbolVisibility>($sym_visibility)\n $sym_name\n ``\n custom<ShapedFunctionSignature>($function_type,\n $tied_operands,\n $arg_attrs,\n $res_attrs)\n attr-dict-with-keyword\n ($body^)?\n
Declares a function that can be called as an asynchronous streaming operation via stream.async.call. Today only external functions are allowed.
Traits: IsolatedFromAbove, Stream_AsyncPhaseOp
stream.channel.count
operation ::= `stream.channel.count` $channel `:` type($result)\n attr-dict-with-keyword\n
stream.channel.create
operation ::= `stream.channel.create` (`on` `(` $affinity^ `)`)?\n (`id` `(` $id^ `)`)?\n (`group` `(` $group^ `)`)?\n (`rank` `(` $rank^ `)`)?\n (`count` `(` $count^ `)`)?\n `:` type($result)\n attr-dict-with-keyword\n
Returns a new channel with the given rank associated with the specified affinity. Collective operations using this channel must only be submitted on compatible affinities.
The group and ID are optional and may be null. The rank and count can be omitted to indicate a default inherited from the environment or device configuration at runtime.
Interfaces: ConditionallySpeculatable, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface), OpAsmOpInterface, Stream_AffinityOp
stream.channel.rank
operation ::= `stream.channel.rank` $channel `:` type($result)\n attr-dict-with-keyword\n
stream.channel.split
operation ::= `stream.channel.split` $channel `,` $color `,` $key\n `:` type($channel) `->` type($result)\n attr-dict-with-keyword\n
stream.binding.subspan
operation ::= `stream.binding.subspan` $binding `` `[` $byte_offset `]`\n attr-dict `:` type($binding) `->` type($result) (`{` $dynamic_dims^ `}`)?\n
Returns a subview to a tensor or memref-like type from a binding. The same binding may have multiple subviews at different byte offsets.
stream.dispatch.workgroup.count
operation ::= `stream.dispatch.workgroup.count` `[` $dimension `]` attr-dict `:` type($result)\n
%x = stream.dispatch.workgroup.count[0] : index\n%y = stream.dispatch.workgroup.count[1] : index\n%z = stream.dispatch.workgroup.count[2] : index\n
stream.dispatch.workgroup.id
operation ::= `stream.dispatch.workgroup.id` `[` $dimension `]` attr-dict `:` type($result)\n
The global workgroup ID of the current workgroup in the range of [0, stream.dispatch.workgroup.count) along each dimension.
[0, stream.dispatch.workgroup.count)
%x = stream.dispatch.workgroup.id[0] : index\n%y = stream.dispatch.workgroup.id[1] : index\n%z = stream.dispatch.workgroup.id[2] : index\n
stream.dispatch.workgroup.size
operation ::= `stream.dispatch.workgroup.size` `[` $dimension `]` attr-dict `:` type($result)\n
Workgroup sizes are not determined at the stream dialect level as they are dependent on the target backend determined when lowering into the HAL. It's still possible to use the symbolic workgroup size inside of dispatch executables as a placeholder for the resolved value once in the HAL.
%x = stream.dispatch.workgroup.size[0] : index\n%y = stream.dispatch.workgroup.size[1] : index\n%z = stream.dispatch.workgroup.size[2] : index\n
stream.executable.end
operation ::= `stream.executable.end` attr-dict\n
Traits: HasParent<IREE::Stream::ExecutableOp>, Terminator
HasParent<IREE::Stream::ExecutableOp>
stream.executable.export
operation ::= `stream.executable.export` custom<SymbolVisibility>($sym_visibility)\n custom<SymbolAlias>($sym_name, $function_ref)\n custom<WorkgroupCountRegion>($workgroup_count)\n attr-dict-with-keyword\n
Traits: HasParent<IREE::Stream::ExecutableOp>, IsolatedFromAbove
stream.executable
operation ::= `stream.executable` custom<SymbolVisibility>($sym_visibility)\n $sym_name\n attr-dict-with-keyword\n regions\n
Traits: IsolatedFromAbove, SingleBlockImplicitTerminator<IREE::Stream::ExecutableEndOp>, SingleBlock, SymbolTable, Util_ObjectLike
SingleBlockImplicitTerminator<IREE::Stream::ExecutableEndOp>
Operations for interacting with the execution context that stream operations execute within.
stream.context.resolve
Resolves low-level context resources based on type
operation ::= `stream.context.resolve` (`on` `(` $affinity^ `)`)?\n attr-dict `:` type($results)\n
WIP; allows for accessing the implementation details of lower-level dialects such as the HAL. This will likely be reworked in the future to either live inside other dialects, use some op interface instead of having a dedicated op here, or remove the op entirely and make resolution happen explicitly.
Examples:
// Returns a HAL device.\n= stream.context.resolve on(#something) : !hal.device\n// Returns a HAL device and (optional) queue affinity.\n= stream.context.resolve on(#something) : !hal.device, i64\n// Returns a HAL allocator and (optional) queue affinity.\n= stream.context.resolve on(#something) : !hal.allocator, i64\n
Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), Stream_AffinityOp
stream.cmd.call
operation ::= `stream.cmd.call` $callee ``\n custom<CmdCallOperands>($resource_operands,\n $resource_operand_offsets,\n $resource_operand_lengths,\n $resource_operand_accesses) attr-dict `:`\n custom<ShapedFunctionType>(ref($resource_operands),\n type($resource_operands),\n $resource_operand_sizes,\n type($results),\n $result_sizes,\n $tied_operands)\n
Calls a function operating on resource values with stream semantics. Asynchronous calls must have no side-effects.
Traits: AttrSizedOperandSegments, Stream_CmdPhaseOp
Stream_CmdPhaseOp
Interfaces: CallOpInterface, Stream_StreamableOp, Stream_SubviewEffectOp, SymbolUserOpInterface, Util_SizeAwareOp
Stream_SubviewEffectOp
resource_operand_accesses
stream.cmd.collective
Dispatches a collective operation
operation ::= `stream.cmd.collective` `` $op `` `[` $element_count `]`\n `channel` `(` $channel `)`\n (`param` `(` $param^ `:` type($param) `)`)? `{`\n custom<DispatchResources>($resources, type($resources), $resource_sizes,\n $resource_offsets, $resource_lengths,\n $resource_accesses)\n `\\n` `}`\n attr-dict-with-keyword\n
Dispatches a collective operation specified against the device. If grouped with other collectives in a stream.cmd.concurrent region the collective operations may fuse and execute more efficiently.
stream.cmd.concurrent
Interfaces: Stream_StreamableOp, Stream_SubviewEffectOp, Util_SizeAwareOp
resource_accesses
resources
resource_sizes
resource_offsets
resource_lengths
operation ::= `stream.cmd.concurrent` $body\n attr-dict-with-keyword\n
Represents a wave of work scheduled concurrently (each op executing at the same time).
|\n v---------+---------v\n+-------|-------+ +-------|-------+\n| v--+--v | | v--+--v |\n| +----+ +----+ | | +----+ +----+ |\n| | @a | | @b | | | | @c | | @d | |\n| +----+ +----+ | | +----+ +----+ |\n| +--v--+ | | +--v--+ |\n+-------|-------+ +-------|-------+\n +---------v---------+\n |\n
stream.cmd.concurrent {\n stream.cmd.concurrent {\n stream.cmd.dispatch @a\n stream.cmd.dispatch @b\n }\n stream.cmd.concurrent {\n stream.cmd.dispatch @c\n stream.cmd.dispatch @d\n }\n }\n
Traits: HasParent<IREE::Stream::CmdExecuteOp, IREE::Stream::CmdSerialOp, IREE::Stream::CmdConcurrentOp>, RecursiveMemoryEffects, SingleBlockImplicitTerminator<IREE::Stream::YieldOp>, SingleBlock, Stream_CmdPhaseOp
HasParent<IREE::Stream::CmdExecuteOp, IREE::Stream::CmdSerialOp, IREE::Stream::CmdConcurrentOp>
Interfaces: RegionBranchOpInterface, Stream_StreamableOp
stream.cmd.copy
Copies a subview of a stream resource to another
operation ::= `stream.cmd.copy` $source `[` $source_offset `]` `,`\n $target `[` $target_offset `]` `,`\n $length `:`\n type($source) `` `{` $source_size `}` `->`\n type($target) `` `{` $target_size `}`\n attr-dict-with-keyword\n
Copies a subview of a resource into a subview of another. As with memcpy this does not support overlapping updates into the same resource.
Traits: Stream_CmdPhaseOp
source_size
target_size
stream.cmd.discard
Discards a subview of a resource
operation ::= `stream.cmd.discard` $target `[` $target_offset `for` $target_length `]` `:`\n type($target) `` `{` $target_size `}`\n attr-dict-with-keyword\n
Discards a subview of a resource, indicating that after this command the specified contents are no longer needed. This can be used to trim memory or invalidate caches.
target_length
stream.cmd.dispatch
Dispatches a parallelized grid of work
operation ::= `stream.cmd.dispatch` custom<DispatchEntryPoints>($entry_points)\n (`[` $workload^ `]`)? ``\n (`(` $uniform_operands^ `:` type($uniform_operands) `)`)? `{`\n custom<DispatchResources>($resources, type($resources), $resource_sizes,\n $resource_offsets, $resource_lengths,\n $resource_accesses)\n `\\n` `}`\n attr-dict-with-keyword\n
Calls the specified entry point function once for each element in the specified workgroup count. Each workgroup has access to the same operands and results and is able to load/store at will.
Interfaces: Stream_StreamableOp, Stream_SubviewEffectOp, SymbolUserOpInterface, Util_SizeAwareOp
uniform_operands
stream.cmd.execute
operation ::= `stream.cmd.execute` (`on` `(` $affinity^ `)`)?\n (`await` `(` $await_timepoint^ `)` `=` `` `>`)?\n `with` ``\n custom<ExplicitResourceRegion>($resource_operands,\n type($resource_operands), $resource_operand_sizes,\n $body)\n `=` `` `>` type($result_timepoint)\n attr-dict-with-keyword\n
Evaluates the operations within the region by dependency order while obeying ties when present. Nested ops execute serially in block order and nested stream.cmd.concurrent ops can be used to run multiple ops concurrently within the stream. All resource inputs must be captured explicitly. All results are only ready once all nested ops complete execution and the returned timepoint is reached. Zero or more timepoints may be provided to block execution until they are all reached; zero timepoints indicates that execution may begin immediately.
Traits: AttrSizedOperandSegments, RecursiveMemoryEffects, SingleBlockImplicitTerminator<IREE::Stream::YieldOp>, SingleBlock, Stream_CmdPhaseOp
Interfaces: ClosureOpInterface, InferTypeOpInterface, RegionBranchOpInterface, Stream_AffinityOp, Stream_TimelineOp, Util_SizeAwareOp
stream.cmd.fill
Fills a subview of a stream resource with a value
operation ::= `stream.cmd.fill` $value `,`\n $target `[` $target_offset `for` $target_length `]` `:`\n type($value) `->`\n type($target) `` `{` $target_size `}`\n attr-dict-with-keyword\n
Splats a value into a subview of the given stream resource and returns the resource with the update applied.
stream.cmd.flush
Flushes a subview of a resource
operation ::= `stream.cmd.flush` (`to` `(` $source_affinity^ `)`)?\n $target `[` $target_offset `for` $target_length `]` `:`\n type($target) `` `{` $target_size `}`\n attr-dict-with-keyword\n
Transfers a resource to an external target. The resource memory is made available to the target and can be made visible there using stream.cmd.invalidate.
stream.cmd.invalidate
source_affinity
stream.cmd.func
operation ::= `stream.cmd.func` custom<SymbolVisibility>($sym_visibility)\n $sym_name ``\n custom<DispatchFunctionSignature>($function_type,\n $arg_attrs,\n $res_attrs)\n attr-dict-with-keyword\n ($body^)?\n
Declares a function that can be called as an asynchronous streaming operation via stream.cmd.call. Today only external functions are allowed.
Traits: IsolatedFromAbove, Stream_CmdPhaseOp
Invalidates a subview of a resource
operation ::= `stream.cmd.invalidate` (`from` `(` $source_affinity^ `)`)?\n $target `[` $target_offset `for` $target_length `]` `:`\n type($target) `` `{` $target_size `}`\n attr-dict-with-keyword\n
Transfers a resource from an external source into the current target. The resource memory is assumed to have been made available at the source via stream.cmd.flush.
stream.cmd.serial
Executes all ops serially (in-order)
operation ::= `stream.cmd.serial` $body\n attr-dict-with-keyword\n
Represents a sequence of work scheduled serially (each op executing one after the other).
Regions can be nested to create a DAG. For example, take the following graph:
|\n v---------+-----v\n+-------|-------+ +---|----+\n| v--+--v | | v |\n| +----+ +----+ | | +----+ |\n| | @a | | @b | | | | @c | |\n| +----+ +----+ | | +----+ |\n| | | | | | |\n| | | | | +-v--+ |\n| | | | | | @d | |\n| | | | | +----+ |\n| +--v--+ | | | |\n+-------|-------+ +---|----+\n +---------v-----+\n |\n
Represented with nested regions:
stream.cmd.concurrent {\n stream.cmd.concurrent {\n stream.cmd.dispatch @a\n stream.cmd.dispatch @b\n }\n stream.cmd.serial {\n stream.cmd.dispatch @c\n stream.cmd.dispatch @d\n }\n }\n
File ops.
stream.file.constant
Creates a file backed by the provided constant host memory
operation ::= `stream.file.constant` (`on` `(` $affinity^ `)`)?\n $source `[` $source_offset `for` $source_length `]` `:`\n type($source) `` `{` $source_size `}`\n `->`\n type($result)\n attr-dict-with-keyword\n
Synchronously wraps a host heap buffer into a stream-accessible file handle. Changing the source buffer after definition has undefined behavior.
Interfaces: ConditionallySpeculatable, InferTypeOpInterface, MemoryEffectOpInterface (MemoryEffectOpInterface), NoMemoryEffect (MemoryEffectOpInterface), OpAsmOpInterface, Stream_AffinityOp, SubrangeOperandOpInterface, Util_SizeAwareOp
SubrangeOperandOpInterface
stream.file.read
Reads a segment of a file into a resource
operation ::= `stream.file.read` (`on` `(` $affinity^ `)`)?\n (`await` `(` $await_timepoint^ `)` `=` `` `>`):(`:`)?\n $source `[` $source_offset `]` `,`\n $target `[` $target_offset `]` `,`\n $length `:`\n type($source) `->`\n type($target) `` `{` $target_size `}`\n `=` `` `>`\n type($result_timepoint)\n attr-dict-with-keyword\n
Asynchronously reads a segment of a file into a resource.
Some implementations can stream directly from the source file into device-local memory and file ops should be preferred to manually staging memory through host buffers.
Interfaces: AffinityOpInterface, InferTypeOpInterface, Stream_TimelineOp, Util_SizeAwareOp
AffinityOpInterface
stream.file.write
Writes a segment of a file from a resource
operation ::= `stream.file.write` (`on` `(` $affinity^ `)`)?\n (`await` `(` $await_timepoint^ `)` `=` `` `>`):(`:`)?\n $source `[` $source_offset `]` `,`\n $target `[` $target_offset `]` `,`\n $length `:`\n type($source) `` `{` $source_size `}` `->`\n type($target)\n `=` `` `>`\n type($result_timepoint)\n attr-dict-with-keyword\n
Asynchronously writes a segment of a resource into a file. The file range must be valid within the file as this operation cannot grow the underlying file storage.
Some implementations can stream directly from device-local memory into the target file and file ops should be preferred to manually staging memory through host buffers.
stream.return
Returns results from a region
operation ::= `stream.return` attr-dict\n $operands `:` type($operands)\n
The values returned are copied by-value.
Traits: AlwaysSpeculatableImplTrait, HasParent<IREE::Stream::ExecutableExportOp>, ReturnLike, Terminator
HasParent<IREE::Stream::ExecutableExportOp>
stream.yield
Yields stream values from an execution region
operation ::= `stream.yield` attr-dict\n ($resource_operands^ `:`\n custom<ShapedTypeList>(type($resource_operands),\n $resource_operand_sizes))?\n
The values returned represent the asynchronous value at the point in time the SSA value is defined (or tied).
Traits: AlwaysSpeculatableImplTrait, HasParent<IREE::Stream::AsyncExecuteOp, IREE::Stream::AsyncConcurrentOp, IREE::Stream::CmdExecuteOp, IREE::Stream::CmdSerialOp, IREE::Stream::CmdConcurrentOp>, SameVariadicOperandSize, Terminator
HasParent<IREE::Stream::AsyncExecuteOp, IREE::Stream::AsyncConcurrentOp, IREE::Stream::CmdExecuteOp, IREE::Stream::CmdSerialOp, IREE::Stream::CmdConcurrentOp>
Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), RegionBranchTerminatorOpInterface, Util_SizeAwareOp
stream.tensor.export
Conversion placeholder for stream->other type conversion
operation ::= `stream.tensor.export` (`on` `(` $affinity^ `)`)?\n $source `:`\n $source_encoding (`` `{` $source_encoding_dims^ `}`)?\n `in`\n type($source) `` `{` $source_size `}`\n `->`\n type($result)\n attr-dict-with-keyword\n
Defines a conversion to a higher-level dialect type such as tensor that is resolved during lowering into the stream dialect. This can be used to interoperate between levels of the stack that require specifying stream types and those that prior to lowering do not handle them.
Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), Stream_AffinityOp, TiedOpInterface, Util_ShapeAwareOp, Util_SizeAwareOp
source_encoding_dims
stream.tensor.import
Conversion placeholder for other->stream type conversion
operation ::= `stream.tensor.import` (`on` `(` $affinity^ `)`)?\n $source `:`\n type($source)\n `->`\n $result_encoding (`` `{` $result_encoding_dims^ `}`)?\n `in`\n type($result) `{` $result_size `}`\n attr-dict-with-keyword\n
Defines a conversion from a higher-level dialect type such as tensor that is resolved during lowering into the stream dialect. This can be used to interoperate between levels of the stack that require specifying stream types and those that prior to lowering do not handle them.
result_encoding
result_encoding_dims
Generic resource ops.
stream.resource.alloc
Allocates a persistent resource
operation ::= `stream.resource.alloc` (`on` `(` $affinity^ `)`)?\n (`uninitialized` $uninitialized^)?\n attr-dict `:`\n type($result) `{` $storage_size `}`\n
Allocates a persistent value (one that is long-lived and possibly external to the program) with undefined contents. Consumers of the allocated result must assume nothing of the contents and use discard access.
discard
Uninitialized allocations will have undefined contents and must only be used when all bytes are discarded prior to any reads. Runtimes decide what \"undefined contents\" means and here it only indicates that execution will be correct even if the memory starts with non-zero values.
If multiple values are allocated from the same operation it implies that they have matching lifetimes. When lowering to execution environments the separate allocations may be fused into one or more slab allocations in order to reduce overheads. How many allocations can be fused is based on the size of the individual resources and the target constraints (how large any single buffer may be, etc).
Interfaces: AffinityOpInterface, ConditionallySpeculatable, MemoryEffectOpInterface (MemoryEffectOpInterface), Util_SizeAwareOp
uninitialized
storage_size
stream.resource.alloca
Allocates a transient value with undefined contents
operation ::= `stream.resource.alloca` `uninitialized`\n (`on` `(` $affinity^ `)`)?\n (`await` `(` $await_timepoint^ `)` `=` `` `>`):(`:`)?\n attr-dict\n type($result) `{` $storage_size `}`\n `=` `` `>`\n type($result_timepoint)\n
Allocates a transient value (one that is short-lived and local to the current computation) with undefined contents. Consumers of the allocated result must assume nothing of the contents and use discard access.
The resource returned is not valid for use until the timepoint is reached; execution using this resource must await on the timepoint.
Interfaces: AffinityOpInterface, ConditionallySpeculatable, MemoryEffectOpInterface (MemoryEffectOpInterface), Stream_TimelineOp, Util_SizeAwareOp
stream.resource.constants
Asynchronously uploads or maps constant values
operation ::= `stream.resource.constants` (`on` `(` $affinity^ `)`)?\n attr-dict `:`\n custom<ConstantValueList>(type($results),\n $result_sizes,\n $values)\n `\\n` ` ` ` ` `=` `` `>` type($result_timepoint)\n
Represents an upload of constant resources that may be packed, suballocated, and mapped depending on the final lowering target.
In runtime environments where memory is shared between host and device this turns into a mapping operation that avoids additional memory allocation and copies. When memory cannot be shared an asynchronous stream will be created to allocate and copy all of the constant values.
Though this op returns a unique resource for each constant value it's expected that almost all end up aliasing into the same storage. The exact packing and number of storage resources that are needed are not known until lowering to a particular backend, though, so they are separate here for proper usage tracking.
Both constant and variable resources can be produced; a constant is immutable while a variable will be treated as a constant-value initializer for a mutable resource. By modeling these together it's not required that variable initializers first be allocated, copied to the target, and then copied into the variable storage if the target is capable of doing a direct upload or mapping.
Traits: AlwaysSpeculatableImplTrait, SameVariadicResultSize
SameVariadicResultSize
Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), Stream_AffinityOp, Stream_TimelineOp, Util_SizeAwareOp
stream.resource.dealloca
Frees a transient value when available
operation ::= `stream.resource.dealloca` (`on` `(` $affinity^ `)`)?\n (`await` `(` $await_timepoint^ `)` `=` `` `>`)?\n $operand `:` type($operand) `{` $operand_size `}`\n `=` `` `>` type($result_timepoint)\n attr-dict\n
Deallocates a transient value (one that is short-lived and local to the current computation) previously allocated using stream.resource.alloca.
The resource is considered live and valid until the provided timepoint is reached and the memory is only made available for future requests after the result timepoint is reached.
Interfaces: AffinityOpInterface, InferTypeOpInterface, MemoryEffectOpInterface (MemoryEffectOpInterface), Stream_TimelineOp, Util_SizeAwareOp
Effects: MemoryEffects::Effect{MemoryEffects::Free on ::mlir::SideEffects::DefaultResource}
MemoryEffects::Effect{MemoryEffects::Free on ::mlir::SideEffects::DefaultResource}
operand_size
stream.resource.load
Loads a value from a staging resource
operation ::= `stream.resource.load` $source `[` $source_offset `]` `:`\n type($source) `` `{` $source_size `}`\n `->`\n type($result)\n attr-dict-with-keyword\n
Returns the element(s) at the given offset in the staging resource. The operation will complete synchronously against the resource though it may introduce a yield point if the staging resource needs to be transferred.
stream.resource.pack
Packs variable-sized slices into a single slab
operation ::= `stream.resource.pack` (`on` `(` $affinity^ `)`)?\n (`offset` `(` $offset^ `)`)?\n `slices` `(` `{`\n custom<PackSliceRanges>($lifetime_intervals,\n $dynamic_slice_sizes,\n type($packed_offsets))\n `}` `)`\n `:` type($total_length)\n attr-dict-with-keyword\n
Performs a greedy packing of one or more sized slices with specified lifetimes and returns their relative offsets in an aliased linear space.
Slices are [start, end] = %slice_byte_size, where the start and end values define an inclusive lifetime range and the size is the total number of bytes required to be live for that range.
[start, end] = %slice_byte_size
// Computes the total length required for the packed values and the offsets\n// of the 3 slices requested relative to the base of the packed memory:\n%total_length, %offset_0, %offset_1, %offset_2 =\n stream.resource.pack\n // Each slice gets one result offset:\n slices({\n // 3 slices where A and B overlap and will get unique offsets\n // while B and C do not overlap and are allowed to alias.\n [0, 10] = %size_0, // A => %offset_0\n [3, 8] = %size_1, // B => %offset_1\n [9, 10] = %size_2, // C => %offset_2\n ...\n }) : index\n
The lifetime start and end points (inclusive) are only used for relative comparisons and may originate with any meaning (op order in block, epoch, phase of the moon, etc). The packing algorithm uses the intervals to determine slice liveness and when aliasing is safe.
The size of each slice may either be a constant or runtime-computed dynamic value. Constant slices can achieve more dense packing than the dynamic values and CSE/canonicalization should be applied to ensure that as many of the dynamic values are equivalent if possible.
The total length required to pack all slices is returned and can be used to acquire storage. The individual slice offsets are 0-based and as such if are directly used as buffer offsets may need additional offsetting. This can either be applied via the optional offset operand or slicing of the underlying allocation buffer.
Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpAsmOpInterface, Stream_AffinityOp
lifetime_intervals
dynamic_slice_sizes
total_length
packed_offsets
stream.resource.size
Returns the size of the resource storage in bytes
operation ::= `stream.resource.size` (`on` `(` $affinity^ `)`)?\n $operand\n attr-dict `:` type($operand)\n
Returns a possibly runtime-dynamic byte size of the resource backing storage. This may differ from the logical storage size of a value based on the alignment requirements of the target as well as encoding of higher level values such as sparse tensor formats.
Interfaces: ConditionallySpeculatable, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface), Stream_AffinityOp, Util_SizeAwareOp
stream.resource.store
Stores a value into a staging resource
operation ::= `stream.resource.store` $value `,`\n $target `[` $target_offset `]` `:`\n type($value)\n `->`\n type($target) `{` $target_size `}`\n attr-dict-with-keyword\n
The operation will complete synchronously against the resource though it may introduce a yield point if the staging resource needs to be acquired.
Interfaces: MemoryEffectOpInterface (MemoryEffectOpInterface), Util_SizeAwareOp
stream.resource.subview
Slices out a cloned subview of a value
operation ::= `stream.resource.subview` $source `[` $source_offset `]` `:`\n type($source) `` `{` $source_size `}` `->`\n type($result) `` `{` $result_size `}`\n attr-dict-with-keyword\n
Aliases a byte subrange of a resource.
Interfaces: ConditionallySpeculatable, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface), StreamableOpInterface, TiedOpInterface, Util_SizeAwareOp, Util_SubrangeOp, ViewLikeOpInterface
StreamableOpInterface
Util_SubrangeOp
ViewLikeOpInterface
stream.resource.try_map
Maps read-only memory into a resource
operation ::= `stream.resource.try_map` (`on` `(` $affinity^ `)`)?\n $source `[` $source_offset `]` `:`\n type($source)\n `->`\n type($did_map) `,` type($result) `` `{` $result_size `}`\n attr-dict-with-keyword\n
Synchronously maps a host heap buffer into a stream-accessible resource with the requested lifetime. If the given source cannot be mapped the did_map result will be 0 and users must find another route into memory (such as file I/O). The resulting resource is not coherent with the source and behavior is undefined if the underlying contents change.
did_map
Interfaces: ConditionallySpeculatable, MemoryEffectOpInterface (MemoryEffectOpInterface), NoMemoryEffect (MemoryEffectOpInterface), Stream_AffinityOp, Util_SizeAwareOp
Resource parameter I/O ops.
stream.parameter.gather
Gathers multiple resources from a parameter scope
operation ::= `stream.parameter.gather` (`on` `(` $affinity^ `)`)?\n (`await` `(` $await_timepoint^ `)` `=` `` `>`)?\n `{`\n custom<ParameterGatherOperations>(\n $source_scope, $source_keys, $source_offsets,\n $target, type($target), $target_size, $target_offsets, $target_lengths)\n `}`\n `=` `` `>`\n type($result_timepoint)\n attr-dict-with-keyword\n
Asynchronously gathers one or more resources into a single target stream resource. This is equivalent to one stream.parameter.read per parameter but allows implementations that can batch operations to do so without additional timeline overhead.
stream.parameter.read
stream.parameter.load
Reads one or more resources from a parameter scope
operation ::= `stream.parameter.load` (`on` `(` $affinity^ `)`)?\n (`await` `(` $await_timepoint^ `)` `=` `` `>`)?\n `{`\n custom<ParameterLoadOperations>(\n $source_scope, $source_keys, $source_offsets,\n type($results), $result_sizes)\n `}`\n `=` `` `>`\n type($result_timepoint)\n attr-dict-with-keyword\n
Asynchronously reads one or more resources from an external parameter provider and returns the resulting stream resources. Depending on the resource type this may alias existing cached storage or be directly mapped to the parameter origin or result in a copy as if stream.resource.alloca and stream.parameter.read had been used per parameter.
Traits: AlwaysSpeculatableImplTrait, AttrSizedOperandSegments, Stream_CmdPhaseOp
Interfaces: AffinityOpInterface, ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), Stream_TimelineOp, Util_SizeAwareOp
Reads a resource from a parameter scope
operation ::= `stream.parameter.read` (`on` `(` $affinity^ `)`)?\n (`await` `(` $await_timepoint^ `)` `=` `` `>`)?\n custom<ParameterReference>($source_scope, $source_key)\n `` `[` $source_offset `]` `->`\n $target `[` $target_offset `for` $target_length `]` `:`\n type($target) `` `{` $target_size `}`\n `=` `` `>`\n type($result_timepoint)\n attr-dict-with-keyword\n
Asynchronously reads a resource from an external parameter provider into the provided target resource range.
source_key
stream.parameter.scatter
Scatters multiple resources to a parameter scope
operation ::= `stream.parameter.scatter` (`on` `(` $affinity^ `)`)?\n (`await` `(` $await_timepoint^ `)` `=` `` `>`)?\n `{`\n custom<ParameterScatterOperations>(\n $source, type($source), $source_size, $source_offsets, $source_lengths,\n $target_scope, $target_keys, $target_offsets)\n `}`\n `=` `` `>`\n type($result_timepoint)\n attr-dict-with-keyword\n
Asynchronously scatters one or more resources from a single source resource into one or more parameters. This is equivalent to one stream.parameter.write per parameter but allows implementations that can batch operations to do so without additional overhead.
stream.parameter.write
Writes a resource to a parameter scope
operation ::= `stream.parameter.write` (`on` `(` $affinity^ `)`)?\n (`await` `(` $await_timepoint^ `)` `=` `` `>`)?\n $source `[` $source_offset `for` $source_length `]` `:`\n type($source) `` `{` $source_size `}` `->`\n custom<ParameterReference>($target_scope, $target_key)\n `` `[` $target_offset `]`\n `=` `` `>`\n type($result_timepoint)\n attr-dict-with-keyword\n
Asynchronously writes a resource to an external parameter provider from the provided source resource range.
target_key
stream.async.alloca
operation ::= `stream.async.alloca` (`on` `(` $affinity^ `)`)?\n attr-dict `:` type($result) `{` $storage_size `}`\n
Traits: AlwaysSpeculatableImplTrait, Stream_AsyncPhaseOp
Interfaces: AffinityOpInterface, ConditionallySpeculatable, MemoryEffectOpInterface (MemoryEffectOpInterface), StreamableOpInterface, Util_SizeAwareOp
stream.async.clone
Clones the contents of a value
operation ::= `stream.async.clone` (`on` `(` $affinity^ `)`)?\n $source `:`\n type($source) `` `{` $source_size `}` `->`\n type($result) `` `{` $result_size `}`\n attr-dict-with-keyword\n
Clones the contents of a value at a snapshot in time. Future changes to the cloned value will not effect the result. Acts as a copy-on-write operation.
Traits: Stream_AsyncPhaseOp
Interfaces: AsyncAccessOpInterface, Stream_AffinityOp, StreamableOpInterface, Util_SizeAwareOp
stream.async.collective
Performs a collective operation
operation ::= `stream.async.collective` `` $op `` `[` $element_count `]`\n (`on` `(` $affinity^ `)`)?\n `channel` `(` $channel `)`\n custom<CollectiveParam>(ref($op), $param) ``\n $source `[` $source_offset `to` $source_end `for` $source_length `]` `,`\n $target `[` $target_offset `to` $target_end `for` $target_length `]` `:`\n type($source) `` `{` $source_size `}` `->`\n custom<ShapedTiedResult>(type($target), $target_size)\n attr-dict-with-keyword\n
TODO: document different usage. For now this should be considered a prototype and that modeling of collective operations may change in the future to better ensure in-place operations (where send/recv is a subset of recv/send). We may have dedicated operations for the send and recv verbs as they have sequencing implications - or we could add optional sequencing to this base op.
Interfaces: AsyncAccessOpInterface, InferTypeOpInterface, Stream_AffinityOp, Stream_StreamableOp, TiedOpInterface, Util_SizeAwareOp
target_end
source_end
stream.async.constant
Defines a constant resource
operation ::= `stream.async.constant` (`on` `(` $affinity^ `)`)?\n `:`\n type($result) `` `{` $result_size `}`\n `=`\n $value\n attr-dict-with-keyword\n
Returns a new resource with the given constant value.
Interfaces: AsyncAccessOpInterface, ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpAsmOpInterface, Stream_AffinityOp, StreamableOpInterface, Util_SizeAwareOp
stream.async.copy
operation ::= `stream.async.copy` (`on` `(` $affinity^ `)`)?\n $source `[` $source_offset `to` $source_end `]` `,`\n $target `[` $target_offset `to` $target_end `]` `,`\n $length `:`\n type($source) `` `{` $source_size `}` `->`\n custom<ShapedTiedResult>(type($target), $target_size)\n attr-dict-with-keyword\n
Copies a subview of a resource into a subview of another. As with memcpy this does not support overlapping updates into the same resource. Unlike stream.async.update copy sources cannot be allocated in-place.
stream.async.update
Equivalent to a stream.async.slice + stream.async.update.
stream.async.dispatch
operation ::= `stream.async.dispatch` (`on` `(` $affinity^ `)`)?\n custom<DispatchEntryPoints>($entry_points)\n (`[` $workload^ `]`)? ``\n custom<DispatchOperands>($resource_operands,\n $resource_operand_offsets,\n $resource_operand_ends,\n $resource_operand_lengths) attr-dict `:`\n custom<ShapedFunctionType>(ref($resource_operands),\n type($resource_operands), $resource_operand_sizes,\n type($results), $result_sizes,\n $tied_operands)\n
Interfaces: AsyncAccessOpInterface, Stream_AffinityOp, Stream_StreamableOp, SymbolUserOpInterface, TiedOpInterface, Util_SizeAwareOp
stream.async.fill
operation ::= `stream.async.fill` (`on` `(` $affinity^ `)`)?\n $value `,`\n $target `[` $target_offset `to` $target_end `for` $target_length `]` `:`\n type($value) `->`\n custom<ShapedTiedResult>(type($target), $target_size)\n attr-dict-with-keyword\n
Equivalent to a stream.async.splat + stream.async.update.
stream.async.load
Loads a value from a resource
operation ::= `stream.async.load` $source `[` $source_offset `]` `:`\n type($source) `` `{` $source_size `}`\n `->`\n type($result)\n attr-dict-with-keyword\n
Returns the element at the given location from within the resource.
Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), Util_SizeAwareOp
stream.async.slice
operation ::= `stream.async.slice` (`on` `(` $affinity^ `)`)?\n $source `[` $source_offset `to` $source_end `]` `:`\n type($source) `` `{` $source_size `}` `->`\n type($result) `` `{` $result_size `}`\n attr-dict-with-keyword\n
Slices a subrange of a stream resource based on a byte range. Acts as a copy-on-write operation.
Interfaces: AsyncAccessOpInterface, ConditionallySpeculatable, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface), Stream_AffinityOp, Stream_StreamableOp, Util_SizeAwareOp
stream.async.splat
Splats a value into a resource
operation ::= `stream.async.splat` (`on` `(` $affinity^ `)`)?\n $value `:` type($value) `->` type($result) `` `{` $result_size `}`\n attr-dict-with-keyword\n
Returns a new resource with the given primitive value splatted out to fill the entire contents.
stream.async.store
Stores a value into a resource
operation ::= `stream.async.store` $value `,`\n $target `[` $target_offset `]` `:`\n type($value)\n `->`\n custom<ShapedTiedResult>(type($target), $target_size)\n attr-dict-with-keyword\n
Returns a resource with the element at the given offset set to the given value.
Interfaces: ConditionallySpeculatable, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TiedOpInterface, Util_SizeAwareOp
stream.async.transfer
Transfers a resource from one location/state to another
operation ::= `stream.async.transfer` (`from` `(` $source_affinity^ `)`)?\n $source `:`\n type($source) `` `{` $source_size `}` `->`\n (`to` `(` $result_affinity^ `)`)?\n type($result) `` `{` $result_size `}`\n attr-dict-with-keyword\n
Transfers a resource between different states (such as a staging lifetime to a local lifetime) or different affinities. This is roughly equivalent to a cast but may have special semantics when later lowered to one or more devices with discrete memory spaces or pools.
staging
local
Interfaces: AsyncAccessOpInterface, Stream_AffinityOp, Stream_StreamableOp, Util_SizeAwareOp
result_affinity
Updates a slice of a subview of a resource in-place
operation ::= `stream.async.update` (`on` `(` $affinity^ `)`)?\n $update `,`\n $target `[` $target_offset `to` $target_end `]` `:`\n type($update) `` `{` $update_size `}` `->`\n custom<ShapedTiedResult>(type($target), $target_size)\n attr-dict-with-keyword\n
Copies a value into a resource based on a byte range. The returned value is the entire updated target value. Updates can be turned into placement allocations and avoid copies.
update_size
stream.timepoint.await
Awaits a timepoint before returning a set of resources
operation ::= `stream.timepoint.await` (`on` `(` $affinity^ `)`)?\n $await_timepoint `=` `` `>`\n $resource_operands `:`\n custom<ShapedTypeList>(type($resource_operands),\n type($results), $resource_operand_sizes)\n attr-dict-with-keyword\n
After asynchronous execution scheduling resources may exist in different states at different points in the execution timeline. This op enables resolving the version of a resource after a particular point in the timeline. As timepoints transitively chain the timepoint must only cover the resource availability but not be limited to its original production timepoint.
Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), Stream_AffinityOp, Stream_TimelineOp, TiedOpInterface, Util_SizeAwareOp
stream.timepoint.barrier
Returns a timepoint indicating when a resource is available
operation ::= `stream.timepoint.barrier` (`on` `(` $affinity^ `)`)?\n $resource `:` type($resource) `` `{` $resource_size `}`\n `=` `` `>`\n type($result_timepoint)\n attr-dict-with-keyword\n
After asynchronous execution scheduling resources may exist in different states at different points in the execution timeline. This op enables identifying when the version of a resource after a particular point in the timeline is available. As timepoints transitively chain the timepoint must only cover the resource availability but not be limited to its original production timepoint.
Interfaces: ConditionallySpeculatable, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface), Stream_AffinityOp, Stream_TimelineOp, TiedOpInterface, Util_SizeAwareOp
resource
resource_size
stream.timepoint.chain_external
Exports a timepoint to an external dialect type
operation ::= `stream.timepoint.chain_external` (`on` `(` $affinity^ `)`)?\n $await_timepoint\n `=` `` `>`\n `(` $external_values `:` type($external_values) `)`\n attr-dict-with-keyword\n
Defines a conversion to an external dialect type such as hal.fence that is resolved during lowering into the stream dialect. This can be used to interoperate between levels of the stack that require specifying stream types and those that prior to lowering do not handle them.
hal.fence
Interfaces: Stream_AffinityOp
external_values
stream.timepoint.export
operation ::= `stream.timepoint.export` (`on` `(` $affinity^ `)`)?\n $await_timepoint\n `=` `` `>`\n `(` type($results) `)`\n attr-dict-with-keyword\n
stream.timepoint.immediate
Results an immediately-available timepoint
operation ::= `stream.timepoint.immediate` attr-dict\n `=` `` `>` type($result_timepoint)\n
Timepoints indicate a point in the execution timeline and this op can be used to get a placeholder representing the start of the timeline. Any waits on the returned timepoint will resolve immediately. This generally folds away but can be useful if needing to initialize globals or branch args.
Traits: AlwaysSpeculatableImplTrait, ConstantLike
ConstantLike
Interfaces: ConditionallySpeculatable, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface), Stream_TimelineOp
stream.timepoint.import
Imports a timepoint from an external dialect type
operation ::= `stream.timepoint.import` (`on` `(` $affinity^ `)`)?\n $operands `:` `(` type($operands) `)`\n `=` `` `>`\n type($result_timepoint)\n attr-dict-with-keyword\n
Defines a conversion from an external dialect type such as hal.semaphore that is resolved during lowering into the stream dialect. This can be used to interoperate between levels of the stack that require specifying stream types and those that prior to lowering do not handle them.
hal.semaphore
Interfaces: ConditionallySpeculatable, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface), Stream_AffinityOp
stream.timepoint.join
Joins one or more timepoints into the max of all of them
operation ::= `stream.timepoint.join` `max` `(` $await_timepoints `)` `=` `` `>` type($result_timepoint)\n attr-dict-with-keyword\n
Returns a timepoint that indicates that all of the input timepoints have been reached.
await_timepoints
stream.tensor.clone
operation ::= `stream.tensor.clone` (`on` `(` $affinity^ `)`)?\n $source `:`\n $source_encoding (`` `{` $source_encoding_dims^ `}`)?\n `in`\n type($source) `` `{` $source_size `}`\n `->`\n $result_encoding (`` `{` $result_encoding_dims^ `}`)?\n `in`\n type($result) `` `{` $result_size `}`\n attr-dict-with-keyword\n
Traits: AlwaysSpeculatableImplTrait, AttrSizedOperandSegments, Stream_TensorPhaseOp
Stream_TensorPhaseOp
Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), Stream_AffinityOp, Stream_StreamableOp, Util_ShapeAwareOp, Util_SizeAwareOp
stream.tensor.constant
Defines a constant tensor value
operation ::= `stream.tensor.constant` (`on` `(` $affinity^ `)`)?\n `:`\n $result_encoding (`` `{` $result_encoding_dims^ `}`)?\n `in`\n type($result)\n `=`\n $value\n attr-dict-with-keyword\n
Returns a typed resource initialized to the given constant value.
Traits: AlwaysSpeculatableImplTrait, Stream_TensorPhaseOp
Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpAsmOpInterface, Stream_AffinityOp, Stream_StreamableOp, Util_ShapeAwareOp
stream.tensor.empty
Defines an empty tensor value
operation ::= `stream.tensor.empty` (`on` `(` $affinity^ `)`)?\n `:`\n $result_encoding (`` `{` $result_encoding_dims^ `}`)?\n `in`\n type($result) `` `{` $result_size `}`\n attr-dict-with-keyword\n
Returns a typed resource initialized with no contents. This still carries shape metadata and may encode to a non-empty resource such as in cases where the empty representation still has data (e.g. sparse tensors). Subsequent writes must populate any ranges of the tensor that are later read.
Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpAsmOpInterface, Stream_AffinityOp, StreamableOpInterface, Util_ShapeAwareOp, Util_SizeAwareOp
stream.tensor.fill
operation ::= `stream.tensor.fill` (`on` `(` $affinity^ `)`)?\n $value `,` $target `[` $start_indices `for` $lengths `]` `:`\n type($value)\n `->`\n $target_encoding (`` `{` $target_encoding_dims^ `}`)?\n `in`\n custom<ShapedTiedResult>(type($target), $target_size)\n attr-dict-with-keyword\n
Equivalent to a stream.tensor.splat + stream.tensor.update.
Traits: AttrSizedOperandSegments, Stream_TensorPhaseOp
Interfaces: InferTypeOpInterface, Stream_AffinityOp, Stream_StreamableOp, TiedOpInterface, Util_ShapeAwareOp, Util_SizeAwareOp
target_encoding_dims
stream.tensor.load
operation ::= `stream.tensor.load` $source (`[` $indices^ `]`)? `:`\n $source_encoding (`` `{` $source_encoding_dims^ `}`)?\n `in`\n type($source) `` `{` $source_size `}`\n `->`\n type($result)\n attr-dict-with-keyword\n
Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), Util_ShapeAwareOp, Util_SizeAwareOp
stream.tensor.sizeof
Calculates the storage size of a given high-level type
operation ::= `stream.tensor.sizeof` (`on` `(` $affinity^ `)`)?\n $encoding (`{` $encoding_dims^ `}`)?\n attr-dict `:` type($storage_size)\n
Target-dependent storage size calculation using a high-level annotated type. While within the stream dialect the storage size of a value is left as a placeholder using this op. The requisite target-specific parameters for expanding the size calculation are only available after affinities have been assigned.
encoding_dims
stream.tensor.slice
operation ::= `stream.tensor.slice` (`on` `(` $affinity^ `)`)?\n $source `[` $start_indices `for` $lengths `]` `:`\n $source_encoding (`` `{` $source_encoding_dims^ `}`)?\n `in`\n type($source) `` `{` $source_size `}`\n `->`\n $result_encoding (`` `{` $result_encoding_dims^ `}`)?\n `in`\n type($result) `` `{` $result_size `}`\n attr-dict-with-keyword\n
Slices a subrange of a stream resource based on a tensor encoding. Acts as a copy-on-write operation.
stream.tensor.splat
operation ::= `stream.tensor.splat` (`on` `(` $affinity^ `)`)?\n $value\n `:` type($value)\n `->`\n $result_encoding (`` `{` $result_encoding_dims^ `}`)?\n `in`\n type($result) `` `{` $result_size `}`\n attr-dict-with-keyword\n
Returns a typed resource initialized to the given primitive value.
Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), Stream_AffinityOp, StreamableOpInterface, Util_ShapeAwareOp, Util_SizeAwareOp
stream.tensor.store
operation ::= `stream.tensor.store` $value `,`\n $target (`[` $indices^ `]`)? `:`\n type($value)\n `->`\n $target_encoding (`` `{` $target_encoding_dims^ `}`)?\n `in`\n custom<ShapedTiedResult>(type($target), $target_size)\n attr-dict-with-keyword\n
Interfaces: ConditionallySpeculatable, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TiedOpInterface, Util_ShapeAwareOp, Util_SizeAwareOp
stream.tensor.trace
operation ::= `stream.tensor.trace` $key `=` `[`\n custom<EncodedResourceOperands>(\n $resources, type($resources), $resource_sizes,\n $resource_encodings, $resource_encoding_dims)\n `]` attr-dict-with-keyword\n
Interfaces: ShapeAwareOpInterface, Util_SizeAwareOp
resource_encodings
resource_encoding_dims
stream.tensor.update
operation ::= `stream.tensor.update` (`on` `(` $affinity^ `)`)?\n $update `,` $target `[` $start_indices `]` `:`\n $update_encoding (`` `{` $update_encoding_dims^ `}`)?\n `in`\n type($update) `` `{` $update_size `}`\n `->`\n $target_encoding (`` `{` $target_encoding_dims^ `}`)?\n `in`\n custom<ShapedTiedResult>(type($target), $target_size)\n attr-dict-with-keyword\n
Copies a value into a resource based on tensor encodings. The returned value is the entire updated target value.
update_encoding
update_encoding_dims
#stream.collective<\n CollectiveKind, # kind\n std::optional<CollectiveReductionOp>, # reduction\n CollectiveElementType # element_type\n>\n
named parameter referenced an optional scope and key
#stream.parameter.named<\n ::mlir::Type, # type\n StringAttr, # scope\n StringAttr, # key\n DictionaryAttr # config\n>\n
Species an externally-defined parameter that can be referenced by an optional scope defining a set of parameters and a key uniquely identifying the parameter within its scope.
defines partitioning configuration
Configures the partitioning algorithm to use and its configuration. Partitioning is useful to adjust when scheduling behavior of targets is radically different - such as single-threaded vs. multi-threaded CPUs or bespoke ML accelerators vs. general purpose GPUs. This mechanism controls the amount of concurrency, parallelism, memory consumption, and latency.
IREE::Stream::FavorAttr
defines resource constraints configuration
Defines resource storage constraints. These allow for packing and layout algorithms to ensure they are producing usable results on target devices.
bool
IREE::Stream::MemoryModel
an immediately-resolved timepoint
Stream constants are immutable values that are available for the lifetime of the program once initialized.
Stream external values represent asynchronously-available and sequenced values that are owned and managed by external code - such as those passed in or out of the program entry points. Though external values are managed during an invocation the same as other stream values the visibility into them does not extend outside of the invocation they are provided to.
Stream values are not usable directly outside of a stream execution or transfer operation. If the contents of the value are needed they must first be transferred via stream.transfer - which may incur a copy.
stream.transfer
Stream upload/download staging resource. These are used outside of streams and then transferred to other stream resources such as variables or transients for use inside of streams. Dispatches and several other operations cannot directly operate on these resources.
Stream transients represent asynchronously-available and sequenced values that have a short lifetime - often only passed between stream executions. It is expected that transient values are not stored in global state and have minimal lifetime as they may be heavily pooled or suballocated.
A stream resource that has not yet had its lifetime calculated.
Stream variables represent asynchronously-available and sequenced values that have a long lifetime relative to the work being performed on them. These variables are often stored in global state and may live for the entire duration of the program.
a managed resource binding into an executable scope
Syntax: !stream.binding
!stream.binding
A resource binding available within an executable dispatch function. The bindings map 1:1 with the resources bound during dispatch operations.
a collective communication channel
Syntax: !stream.channel
!stream.channel
In programs that model SPMD behavior internally channels can be created or provided by hosting applications. For example, the program could expose a @set_channels(!util.list<!stream.channel>) method that stores the channels in globals for use throughout the program allowing for application-controlled channel configuration.
@set_channels(!util.list<!stream.channel>)
a file handle used for I/O operations
Syntax: !stream.file
!stream.file
A file handle that can be asynchronously read and written into/from stream resources.
a managed resource
IREE::Stream::Lifetime
a timepoint indicating execution availability
Syntax: !stream.timepoint
Represents a point in the execution timeline that when resolved indicates that all of the execution prior to this timepoint has completed and the results of the execution are available for use. This includes transitive dependencies as well; if timepoint B is dependent on timepoint A then when B is available so too must be A.
A dialect used for types common across IREE subdialects.
util.align
operation ::= `util.align` $value `,` $alignment attr-dict `:` type($result)\n
util.sizeof
Returns the size in bytes of a datatype
operation ::= `util.sizeof` $sizedType attr-dict-with-keyword\n
Most datatypes have a static size at all layers of the compilation stack. However, those that only have a size for certain lowering flows can be challenging. This op represents such sizes in a way that can be specialized later.
Returns the size in bytes, rounded up to the next whole byte of the specified type. This op will fold to a constant index value for IntegerType and FloatType. All others are not folded.
sizedType
size
util.buffer.alloc
Allocates a buffer with undefined contents
operation ::= `util.buffer.alloc` `uninitialized`\n attr-dict\n `:`\n type($result) `` `{` $storage_size `}`\n
Allocates a buffer with undefined contents. Consumers of the allocated result must assume nothing of the contents.
Interfaces: ConditionallySpeculatable, InferTypeOpInterface, MemoryEffectOpInterface (MemoryEffectOpInterface), NoMemoryEffect (MemoryEffectOpInterface), OpAsmOpInterface, Util_SizeAwareOp
util.buffer.compare
Compares a range of two buffers
operation ::= `util.buffer.compare` $lhs `[` $lhs_offset `]` `,`\n $rhs `[` $rhs_offset `]` `,`\n $length `:`\n type($lhs) `` `{` $lhs_size `}` `,`\n type($rhs) `` `{` $rhs_size `}`\n attr-dict-with-keyword\n
Returns true if the two ranges are bitwise equivalent, somewhat like memcmp.
Interfaces: ConditionallySpeculatable, InferTypeOpInterface, MemoryEffectOpInterface (MemoryEffectOpInterface), NoMemoryEffect (MemoryEffectOpInterface), SubrangeOperandOpInterface, Util_SizeAwareOp
Effects: MemoryEffects::Effect{MemoryEffects::Read on ::mlir::SideEffects::DefaultResource}, MemoryEffects::Effect{}
lhs_size
lhs_offset
rhs_size
rhs_offset
util.buffer.constant
operation ::= `util.buffer.constant` ($name^)? attr-dict `:` type($result) `=` $value\n
util.buffer.copy
Copies a range of bytes between buffers
operation ::= `util.buffer.copy` $source `[` $source_offset `]` `,`\n $target `[` $target_offset `]` `,`\n $length `:`\n type($source) `` `{` $source_size `}` `->`\n type($target) `` `{` $target_size `}`\n attr-dict-with-keyword\n
Copies a range of bytes as with memcpy (no overlapping).
Interfaces: MemoryEffectOpInterface (MemoryEffectOpInterface), SubrangeOperandOpInterface, Util_SizeAwareOp
Effects: MemoryEffects::Effect{MemoryEffects::Read on ::mlir::SideEffects::DefaultResource, MemoryEffects::Write on ::mlir::SideEffects::DefaultResource}
MemoryEffects::Effect{MemoryEffects::Read on ::mlir::SideEffects::DefaultResource, MemoryEffects::Write on ::mlir::SideEffects::DefaultResource}
util.buffer.dealloc
Deallocates a buffer
operation ::= `util.buffer.dealloc` $operand `:` type($operand) `{` $operand_size `}`\n attr-dict-with-keyword\n
Hints that the buffer contents can be discarded. Buffers are reference counted and other owners may keep it live beyond the dealloc.
Interfaces: ConditionallySpeculatable, MemoryEffectOpInterface (MemoryEffectOpInterface), NoMemoryEffect (MemoryEffectOpInterface), Util_SizeAwareOp
Effects: MemoryEffects::Effect{MemoryEffects::Free on ::mlir::SideEffects::DefaultResource}, MemoryEffects::Effect{}
util.buffer.fill
Fills a range of bytes with a value
operation ::= `util.buffer.fill` $pattern `,`\n $target `[` $target_offset `for` $length `]` `:`\n type($pattern) `->`\n type($target) `` `{` $target_size `}`\n attr-dict-with-keyword\n
Fills the contents of the buffer in the given byte range with a pattern. The offset and length must match the natural alignment of the pattern type.
util.buffer.hash
Computes the hash of a byte range of a buffer
operation ::= `util.buffer.hash` $source `[` $source_offset `for` $length `]`\n `:` type($source) `` `{` $source_size `}` `->` type($result)\n attr-dict-with-keyword\n
Computes the SipHash-2-4 of a value at a byte offset with the given length. This always uses a seed of 0x0001020304...0e0f and produces a single 64 bit value.
0x0001020304...0e0f
Interfaces: InferTypeOpInterface, MemoryEffectOpInterface (MemoryEffectOpInterface), SubrangeOperandOpInterface, Util_SizeAwareOp
util.buffer.load
Loads a value from a buffer
operation ::= `util.buffer.load` $source `[` $source_offset `for` $length `]`\n `:` type($source) `` `{` $source_size `}` `->` type($result)\n attr-dict-with-keyword\n
Loads a value at a byte offset. Must be aligned to the natural size of the result type.
util.buffer.size
Returns the total buffer storage size in bytes
operation ::= `util.buffer.size` $operand\n `:` type($operand)\n attr-dict-with-keyword\n
Returns the total length of the buffer in bytes from its base offset.
Interfaces: ConditionallySpeculatable, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface), OpAsmOpInterface, Util_SizeAwareOp
util.buffer.slice
Clones a subregion of a buffer
operation ::= `util.buffer.slice` $source `[` $source_offset `]` attr-dict `:`\n type($source) `` `{` $source_size `}` `->`\n type($result) `` `{` $result_size `}`\n
Returns a copy of the contents from the source buffer.
Interfaces: ConditionallySpeculatable, InferTypeOpInterface, MemoryEffectOpInterface (MemoryEffectOpInterface), NoMemoryEffect (MemoryEffectOpInterface), OpAsmOpInterface, SubrangeOperandOpInterface, Util_SizeAwareOp
Effects: MemoryEffects::Effect{MemoryEffects::Allocate on ::mlir::SideEffects::DefaultResource, MemoryEffects::Read on ::mlir::SideEffects::DefaultResource}, MemoryEffects::Effect{}
MemoryEffects::Effect{MemoryEffects::Allocate on ::mlir::SideEffects::DefaultResource, MemoryEffects::Read on ::mlir::SideEffects::DefaultResource}
util.buffer.storage
Returns the underlying buffer storage range
operation ::= `util.buffer.storage` $operand\n `:` type($operand) `` `{` $operand_size `}` `->` `(` type($result) `,` type($offset) `)`\n attr-dict-with-keyword\n
Returns the buffer storage as a memref that must be offset and restricted to the returned range. The memref may be of any type and the user is responsible for ensuring that the reinterpret_cast-like behavior makes sense for the data they are accessing.
Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpAsmOpInterface, Util_SizeAwareOp
util.buffer.store
Stores a value into a buffer
operation ::= `util.buffer.store` $source `,`\n $target `[` $target_offset `for` $length `]`\n `:` type($source) `->` type($target) `` `{` $target_size `}`\n attr-dict-with-keyword\n
Stores a value at a byte offset. Must be aligned to the natural size of the source type.
util.buffer.subspan
Returns a reference to a subrange of a buffer
operation ::= `util.buffer.subspan` $source `[` $source_offset `]` `:`\n type($source) `` `{` $source_size `}` `->`\n type($result) `` `{` $result_size `}`\n attr-dict-with-keyword\n
Returns a logical view into an underlying source buffer. This induces aliasing and multiple SSA values may allow access to the same underlying buffer storage.
Subspans are a compiler-only concept and are propagated by an analysis pass to result in absolute offsets on accesses any place the subrange would have been used.
Interfaces: ConditionallySpeculatable, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface), OpAsmOpInterface, SubrangeOperandOpInterface, TiedOpInterface, Util_SizeAwareOp, Util_SubrangeOp, ViewLikeOpInterface
operation ::= `util.optimization_barrier` attr-dict\n ($operands^ `:` type($operands))?\n
A constant that cannot be folded by the compiler.
Similar to a std.constant, but is declared as having a side effect and has no folder. This is really just syntactic sugar as it is canonicalized to a std.constant wrapped in an util.optimization_barrier.
util.unreachable
Unreachable assertion op
operation ::= `util.unreachable` $message attr-dict\n
Signals to the compiler that the parent block should not be reachable. This may be converted into a runtime assertion, though ideally they are stripped during translation.
^bb0:\n %true = arith.constant true\n cond_br %true, ^bb2, ^bb1\n^bb1:\n // Indicates that this branch should never be taken.\n util.unreachable \"shouldn't be here\"\n^bb2:\n ...\n
Traits: ReturnLike, Terminator
Interfaces: NoMemoryEffect (MemoryEffectOpInterface), RegionBranchTerminatorOpInterface
util.numeric.optional_narrow
Memorializes an optional numeric narrowing that is valid
operation ::= `util.numeric.optional_narrow` $operand `:` type($operand) `as` $semantic_type attr-dict\n
Serves as a placeholder for points in the computation where an optional numeric narrowing can be performed without loss of information. Such ops can guide optimization passes wishing to perform precision reduction.
In addition to the operand and result type, this op takes an additional semantic_type attribute representing the semantic target type which can be: * FloatType * Signed IntegerType * Unsigned IntegerType
semantic_type
Note that this semantic_type must be a sign-carrying integer if using an integer type and cannot be IndexType (i.e. it can be used to indicate a possible narrowing of an IndexType to a specific integer).
If the operand is a TensorType, then the result must be a TensorType. The semantic_type constrains the element type.
Optionally, the minimum and maximum integer values (for integer semantic types) are tracked if known.
min_value
max_value
util.global.address
operation ::= `util.global.address` (`immutable` $is_immutable^)?\n $global attr-dict `:` qualified(type($result))\n
Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpAsmOpInterface, SymbolUserOpInterface, Util_GlobalAddressOpInterface
Util_GlobalAddressOpInterface
is_immutable
util.global.load.indirect
operation ::= `util.global.load.indirect` (`immutable` $is_immutable^)?\n $global attr-dict `:` qualified(type($global)) `->` type($result)\n
Returns a copy of the global variable value.
Interfaces: Util_GlobalLoadIndirectOpInterface
Util_GlobalLoadIndirectOpInterface
util.global.load
operation ::= `util.global.load` (`immutable` $is_immutable^)?\n $global attr-dict `:` type($result)\n
Returns a global variable value.
Interfaces: MemoryEffectOpInterface, OpAsmOpInterface, SymbolUserOpInterface, Util_GlobalLoadOpInterface
Util_GlobalLoadOpInterface
util.global
operation ::= `util.global` custom<SymbolVisibility>($sym_visibility)\n (`mutable` $is_mutable^)?\n $sym_name\n attr-dict\n custom<TypeOrAttr>($type, $initial_value)\n
Declares a global variable that maintains its value across invocations. The value is tied to the execution context of the module and different contexts will have different variable storage.
Interfaces: Symbol, Util_GlobalOpInterface
Util_GlobalOpInterface
inlining_policy
util.global.store.indirect
operation ::= `util.global.store.indirect` $value `,` $global attr-dict `:` type($value) `->` qualified(type($global))\n
Stores a copy of the value into a global variable.
Interfaces: Util_GlobalStoreIndirectOpInterface
Util_GlobalStoreIndirectOpInterface
util.global.store
operation ::= `util.global.store` $value `,` $global attr-dict `:` type($value)\n
Interfaces: SymbolUserOpInterface, Util_GlobalStoreOpInterface
Util_GlobalStoreOpInterface
Ops for !util.list<T> (mostly just a placeholder for now).
util.list.create
operation ::= `util.list.create` ($initial_capacity^)? attr-dict `:` qualified(type($result))\n
util.list.get
operation ::= `util.list.get` $list `[` $index `]` attr-dict `:` custom<ListTypeGet>(type($list), type($result))\n
util.list.resize
operation ::= `util.list.resize` operands attr-dict `:` qualified(type($list))\n
util.list.set
operation ::= `util.list.set` $list `[` $index `]` `,` $value attr-dict `:` custom<ListTypeSet>(type($list), type($value))\n
util.list.size
operation ::= `util.list.size` operands attr-dict `:` qualified(type($list))\n
util.range.extents
Returns the min/max of a union of a set of ranges
operation ::= `util.range.extents` custom<RangeList>($offsets, $lengths) attr-dict `:` type($min)\n
Computes min(offsets) and max(offsets + lengths). Though it's possible to express this with standard arithmetic this op enables more semantically meaningful folding/optimizations.
Traits: AlwaysSpeculatableImplTrait, SameOperandsAndResultType, SameVariadicOperandSize
min
util.range.max
Returns the max of all values
operation ::= `util.range.max` $operands attr-dict `:` type($result)\n
Computes the max of a variadic list of operands. Though it's possible to express this with standard arithmetic this op enables more semantically meaningful folding/optimizations.
util.range.min
Returns the min of all values
operation ::= `util.range.min` $operands attr-dict `:` type($result)\n
Computes the min of a variadic list of operands. Though it's possible to express this with standard arithmetic this op enables more semantically meaningful folding/optimizations.
util.status.check_ok
Raises a global failure if a status is not 'ok'
operation ::= `util.status.check_ok` $status (`,` $message^)? attr-dict\n
When the status is not 'ok' this signals a runtime failure that causes the entire active invocation - and possibly all in-flight and pending invocations - to fail with the given status. The status will be propagated back via the available runtime error handling mechanisms such as semaphores or synchronous invocation results.
As the IREE execution model is deeply pipelined it's possible that failures have a latency between when they are emitted and when the application can observe the failure. It's also possible that other work that is in-flight or pending when the failure occurs will complete.
util.call
Function call operation
operation ::= `util.call` $callee `(` $operands `)`\n attr-dict `:`\n custom<OperandTypeList>(type($operands))\n `->`\n custom<TiedFunctionResultList>(ref($operands),\n ref(type($operands)),\n type($results),\n $tied_operands)\n
Represents a direct call to a function that is within the same symbol scope as the call. The operands and result types of the call must match the specified function type.
Calls support tied operands which indicate that specific results alias a specific operand. The operand and result types are allowed to differ if a cast is performed within the callee.
util.func @fn(%arg0: i32, %arg1: tensor<f32>) -> (f32, %arg1 as tensor<i32>)\n...\n%0 = util.call @fn(%0, %1) : (i32, tensor<f32>) -> (f32, %1 as tensor<i32>)\n
Interfaces: CallOpInterface, SymbolUserOpInterface, Util_TiedOpInterface
Util_TiedOpInterface
util.func
Function operation containing a CFG region
An operation declaring a callable function.
An external function declaration (used when referring to a function declared in some other module) has no body.
Traits: AffineScope, AutomaticAllocationScope, IsolatedFromAbove
AffineScope
AutomaticAllocationScope
Interfaces: CallableOpInterface, FunctionOpInterface, OpAsmOpInterface, Symbol
util.initializer
Global initialization function
A function that is called in definition order upon module initialization. Must not load any globals that are defined or initialized after it in the module.
Interfaces: CallableOpInterface, FunctionOpInterface, Symbol, Util_InitializerOpInterface
Util_InitializerOpInterface
util.return
Return from a util.initializer
operation ::= `util.return` attr-dict\n ($operands^ `:` type($operands))?\n
Returns control from an initializer function.
Traits: AlwaysSpeculatableImplTrait, HasParent<IREE::Util::InitializerOp, IREE::Util::FuncOp>, ReturnLike, Terminator
HasParent<IREE::Util::InitializerOp, IREE::Util::FuncOp>
util.cast
Casts one util type to another ala static_cast/dynamic_cast
operation ::= `util.cast` $operand attr-dict `:` type($operand) `to` type($result)\n
Performs a type cast between object types known to the util dialect.
Interfaces: CastOpInterface, ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TiedOpInterface
CastOpInterface
util.cmp.eq
Compares two values for equality
operation ::= `util.cmp.eq` operands attr-dict `:` type($lhs)\n
Compares two operands for equality. This is intended for comparing IREE reference types (like !util.buffer) that cannot be used with std.cmpi.
Traits: AlwaysSpeculatableImplTrait, Commutative
Commutative
util.null
Returns a null type value
operation ::= `util.null` attr-dict `:` type($result)\n
Defines an SSA value that is lowered into dialects supporting null/undefined/optional/etc values.
util.switch
Primitive switch operation
operation ::= `util.switch` type($default_value) `from`\n custom<TypedValueList>(ref(type($default_value)), $values, type($values))\n `at` $index\n `else` $default_value\n attr-dict\n `:` type($result)\n
Returns the value with the given index in values or default_value if the index is out of bounds.
// Switch %index to cases of %c100/%c200/%c300 if index==0, ==1, ==2.\n// If %index is out of range (<0 or >2) then default to %c5.\n%0 = util.switch %index[%c100, %c200, %c300] else %c5 : i32\n
an attribute containing a filled byte pattern
#util.byte_pattern<\n ::mlir::Type, # type\n int64_t # pattern\n>\n
A dense serializable attribute with the given byte pattern.
defines a range of bytes
Specifies a starting offset and total length in bytes.
an attribute composed of a sequence of attributes
Models a concatenated set of serializable attributes that when combined form a single sequence of i8 elements. As each value references the uniqued storage of the composite element this attribute is cheap to construct. When the full flattened range is required it can be efficiently streamed via the SerializableAttrInterface. All values must also be serializable.
All values are tightly packed to byte boundaries. If padding is required it can be inserted as splat elements attributes with the padding value (usually 0). Sub-byte aligned element types will have their individual components padded to byte alignment.
forces inlining on the associated function when possible
Syntax: #util.inline.always
#util.inline.always
Skips any cost-model decisions as to whether a function should be inlined into call-sites and allows the inlining to happen. Any policies that prevent inlining will still be observed and inlining may fail if any are not satisfied.
disables inlining on the associated function
Syntax: #util.inline.never
#util.inline.never
Disables inlining of the function the attribute is associated with into any call-site.
an attribute specifying uninitialized storage
#util.uninitialized<\n ::mlir::Type # type\n>\n
The contents of the storage backing this attribute may be uninitialized at runtime. This is a hint to implementations that if policy allows memory allocated for the storage of this attribute type is allowed to have undefined contents upon return.
Syntax: !util.buffer
!util.buffer
dense list container type
!util.list<\n Type # element_type\n>\n
Typed container supporting variant storage.
Type
a placeholder for an unspecified object type
Syntax: !util.object
!util.object
Describes a runtime object type. These may be reference counted or garbage collected at runtime.
a pointer-like reference
!util.ptr<\n Type # target_type\n>\n
A typed indirect reference to a value. These define a runtime addressable value that is strongly referenced.
a placeholder for a variant type (?)
Syntax: !util.variant
!util.variant
Describes a runtime variant type. These may be primitives (i32, f32, etc) or object types.
A dialect representing operations against an abstract virtual machine.
The virtual machine ops are designed to be either serialized to a bytecode representation that can be interpreted at runtime or lowered further to static representations such as LLVM IR, C, etc. The idea is that the types and operations performed are generally just encoding resource ownership rules and control flow that can be represented in many different ways by target runtimes. For example, it should be possible to lower the VM dialect to SPIR-V and run the VM entirely within a persistent Vulkan kernel.
With this scalable runtime approach we make some limiting assumptions to keep the required implementations simple. As we assume all real math is happening within dispatch regions the only math we provide is scalar operations used for offset and shape calculations. This also enables simple flow control such as fixed-range loops.
Besides integer values the only other storage type is a variant reference modeling an abstract iree_vm_ref_t. This allows automated reference counting to be relied upon by other dialects built on top of the VM dialect and avoids the need for more verbose manual reference counting logic (that may be difficult or impossible to manage given the coroutine-like nature of the VM). Lowering targets can insert the reference counting as needed.
The types in the VM dialect correspond to the storage rather than value type, with the interpretation of the type encoded on the op.
Unconditional fiber yield operation
operation ::= `vm.yield` $dest (`(` $destOperands^ `:` type($destOperands) `)`)? attr-dict\n
Yields the fiber for some (likely short) amount of time. This can be used to perform cooperative scheduling and ensure fair (enough) execution. Execution resumes at the specified target branch.
^bb0: vm.yield ^on_resume ^on_resume: ...
Traits: HasParent<IREE::VM::FuncOp>, Terminator, Util_YieldPoint
HasParent<IREE::VM::FuncOp>
Interfaces: BranchOpInterface, VMSerializableOp, VM_OpInterface
BranchOpInterface
VMSerializableOp
VM_OpInterface
destOperands
dest
vm.shl.i32
Integer shift left operation
operation ::= `vm.shl.i32` $operand `,` $amount attr-dict `:` type($operand)\n
Shifts the operand in a direction by the number of bits specified.
Interfaces: NoMemoryEffect (MemoryEffectOpInterface), VMSerializableOp, VM_OpInterface
amount
vm.shl.i64
operation ::= `vm.shl.i64` $operand `,` $amount attr-dict `:` type($operand)\n
vm.shr.i32.s
Signed integer (arithmetic) shift right operation
operation ::= `vm.shr.i32.s` $operand `,` $amount attr-dict `:` type($operand)\n
vm.shr.i32.u
Unsigned integer (logical) shift right operation
operation ::= `vm.shr.i32.u` $operand `,` $amount attr-dict `:` type($operand)\n
vm.shr.i64.s
operation ::= `vm.shr.i64.s` $operand `,` $amount attr-dict `:` type($operand)\n
vm.shr.i64.u
operation ::= `vm.shr.i64.u` $operand `,` $amount attr-dict `:` type($operand)\n
vm.buffer.alloc
Allocates a new zero-initialized buffer
operation ::= `vm.buffer.alloc` operands attr-dict `:` type($result)\n
Allocates a new zero-initialized buffer with the given size in bytes.
Interfaces: ConditionallySpeculatable, MemoryEffectOpInterface (MemoryEffectOpInterface), VMSerializableOp, VM_OpInterface
vm.buffer.clone
Clones a buffer
operation ::= `vm.buffer.clone` operands attr-dict `:` type($source_buffer) `->` type($result)\n
Clones a range of the source buffer to produce a mutable buffer with the same contents.
Effects: MemoryEffects::Effect{MemoryEffects::Allocate on ::mlir::SideEffects::DefaultResource, MemoryEffects::Read on ::mlir::SideEffects::DefaultResource}
vm.buffer.compare
Compares a range of a buffer to another
operation ::= `vm.buffer.compare` operands attr-dict `:` type($lhs_buffer) `,` type($rhs_buffer)\n
Returns 1 if the two ranges are bitwise equivalent, somewhat like memcmp.
Interfaces: MemoryEffectOpInterface (MemoryEffectOpInterface), VMSerializableOp, VM_OpInterface
lhs_buffer
rhs_buffer
vm.buffer.copy
Copies a range of a buffer to another
operation ::= `vm.buffer.copy` operands attr-dict `:` type($source_buffer) `->` type($target_buffer)\n
Copies a range of one buffer to another, like memcpy.
vm.buffer.fill.f32
Fills the buffer with the given repeating 32-bit value
operation ::= `vm.buffer.fill.f32` $target_buffer `,` $target_offset `,` $length `,` $value\n attr-dict `:` type($value) `->` type($target_buffer)\n
Fills an element range of the buffer with the given value, like memset.
Traits: VM_ExtF32
VM_ExtF32
vm.buffer.fill.f64
Fills the buffer with the given repeating 64-bit value
operation ::= `vm.buffer.fill.f64` $target_buffer `,` $target_offset `,` $length `,` $value\n attr-dict `:` type($value) `->` type($target_buffer)\n
Traits: VM_ExtF64
VM_ExtF64
vm.buffer.fill.i16
Fills the buffer with the given repeating 16-bit value
operation ::= `vm.buffer.fill.i16` $target_buffer `,` $target_offset `,` $length `,` $value\n attr-dict `:` type($value) `->` type($target_buffer)\n
vm.buffer.fill.i32
operation ::= `vm.buffer.fill.i32` $target_buffer `,` $target_offset `,` $length `,` $value\n attr-dict `:` type($value) `->` type($target_buffer)\n
vm.buffer.fill.i64
operation ::= `vm.buffer.fill.i64` $target_buffer `,` $target_offset `,` $length `,` $value\n attr-dict `:` type($value) `->` type($target_buffer)\n
vm.buffer.fill.i8
Fills the buffer with the given repeating 8-bit value
operation ::= `vm.buffer.fill.i8` $target_buffer `,` $target_offset `,` $length `,` $value\n attr-dict `:` type($value) `->` type($target_buffer)\n
vm.buffer.hash
operation ::= `vm.buffer.hash` $source_buffer `,` $source_offset `,` $length\n attr-dict `:` type($source_buffer) `->` type($result)\n
Computes the SipHash-2-4 of the source buffer at the given offset for |length| bytes using seed 0x0001020304...0e0f.
vm.buffer.length
Returns the byte length of a buffer
operation ::= `vm.buffer.length` operands attr-dict `:` type($buffer) `->` type($result)\n
Returns the total byte length of the given buffer. This is the exact value as specified during buffer allocation though the underlying system buffer may have additional padding.
Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), VMSerializableOp, VM_OpInterface
vm.buffer.load.f32
32-bit floating-point load
operation ::= `vm.buffer.load.f32` $source_buffer `[` $source_offset `]`\n attr-dict `:` type($source_buffer) `->` type($result)\n
Loads a value from the buffer at the given element offset.
vm.buffer.load.f64
64-bit floating-point load
operation ::= `vm.buffer.load.f64` $source_buffer `[` $source_offset `]`\n attr-dict `:` type($source_buffer) `->` type($result)\n
vm.buffer.load.i16.s
Signed 16-bit integer load
operation ::= `vm.buffer.load.i16.s` $source_buffer `[` $source_offset `]`\n attr-dict `:` type($source_buffer) `->` type($result)\n
vm.buffer.load.i16.u
Unsigned 16-bit integer load
operation ::= `vm.buffer.load.i16.u` $source_buffer `[` $source_offset `]`\n attr-dict `:` type($source_buffer) `->` type($result)\n
vm.buffer.load.i32
32-bit integer load
operation ::= `vm.buffer.load.i32` $source_buffer `[` $source_offset `]`\n attr-dict `:` type($source_buffer) `->` type($result)\n
vm.buffer.load.i64
64-bit integer load
operation ::= `vm.buffer.load.i64` $source_buffer `[` $source_offset `]`\n attr-dict `:` type($source_buffer) `->` type($result)\n
vm.buffer.load.i8.s
Signed 8-bit integer load
operation ::= `vm.buffer.load.i8.s` $source_buffer `[` $source_offset `]`\n attr-dict `:` type($source_buffer) `->` type($result)\n
vm.buffer.load.i8.u
Unsigned 8-bit integer load
operation ::= `vm.buffer.load.i8.u` $source_buffer `[` $source_offset `]`\n attr-dict `:` type($source_buffer) `->` type($result)\n
vm.buffer.store.f32
32-bit floating-point store
operation ::= `vm.buffer.store.f32` $value `,` $target_buffer `[` $target_offset `]`\n attr-dict `:` type($value) `->` type($target_buffer)\n
Stores a value to the buffer at the given element offset.
vm.buffer.store.f64
64-bit floating-point store
operation ::= `vm.buffer.store.f64` $value `,` $target_buffer `[` $target_offset `]`\n attr-dict `:` type($value) `->` type($target_buffer)\n
vm.buffer.store.i16
Unsigned 16-bit integer store
operation ::= `vm.buffer.store.i16` $value `,` $target_buffer `[` $target_offset `]`\n attr-dict `:` type($value) `->` type($target_buffer)\n
vm.buffer.store.i32
32-bit integer store
operation ::= `vm.buffer.store.i32` $value `,` $target_buffer `[` $target_offset `]`\n attr-dict `:` type($value) `->` type($target_buffer)\n
vm.buffer.store.i64
64-bit integer store
operation ::= `vm.buffer.store.i64` $value `,` $target_buffer `[` $target_offset `]`\n attr-dict `:` type($value) `->` type($target_buffer)\n
vm.buffer.store.i8
Unsigned 8-bit integer store
operation ::= `vm.buffer.store.i8` $value `,` $target_buffer `[` $target_offset `]`\n attr-dict `:` type($value) `->` type($target_buffer)\n
Casting and type conversion/emulation.
vm.bitcast.f32.i32
Bitcast from a 32-bit float-point value to a 32-bit integer
operation ::= `vm.bitcast.f32.i32` $operand attr-dict `:` type($operand) `->` type($result)\n
Traits: AlwaysSpeculatableImplTrait, VM_ExtF32
vm.bitcast.f64.i64
Bitcast from a 64-bit float-point value to a 64-bit integer
operation ::= `vm.bitcast.f64.i64` $operand attr-dict `:` type($operand) `->` type($result)\n
Traits: AlwaysSpeculatableImplTrait, VM_ExtF64
vm.bitcast.i32.f32
Bitcast from a 32-bit integer to a 32-bit float-point value
operation ::= `vm.bitcast.i32.f32` $operand attr-dict `:` type($operand) `->` type($result)\n
vm.bitcast.i64.f64
Bitcast from a 64-bit integer to a 64-bit float-point value
operation ::= `vm.bitcast.i64.f64` $operand attr-dict `:` type($operand) `->` type($result)\n
vm.cast.any.ref
Casts from any ref to a specific ref type
operation ::= `vm.cast.any.ref` $operand attr-dict `:` type($operand) `->` type($result)\n
Performs a runtime cast of an opaque !vm.ref<?> to a specific !vm.ref<T> and raises an error if the operand does not match the expected type. Null refs can always be cast between types.
!vm.ref<?>
!vm.ref<T>
vm.cast.f32.si32
Cast from a float-point value to a signed integer
operation ::= `vm.cast.f32.si32` $operand attr-dict `:` type($operand) `->` type($result)\n
vm.cast.f32.ui32
Cast from an float-point value to an unsigned integer
operation ::= `vm.cast.f32.ui32` $operand attr-dict `:` type($operand) `->` type($result)\n
vm.cast.ref.any
Casts from a specific ref to any ref type
operation ::= `vm.cast.ref.any` $operand attr-dict `:` type($operand) `->` type($result)\n
Performs a compile-time widening cast of a specific !vm.ref<T> to an opaque !vm.ref<?>.
Traits: AlwaysSpeculatableImplTrait, VM_AssignmentOp
VM_AssignmentOp
Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), VM_OpInterface
vm.cast.si32.f32
Cast from a signed integer to a float-point value
operation ::= `vm.cast.si32.f32` $operand attr-dict `:` type($operand) `->` type($result)\n
vm.cast.ui32.f32
Cast from an unsigned integer to a float-point value
operation ::= `vm.cast.ui32.f32` $operand attr-dict `:` type($operand) `->` type($result)\n
vm.ext.f32.f64
Floating-point zero extend 32 bits to 64 bits
operation ::= `vm.ext.f32.f64` $operand attr-dict `:` type($operand) `->` type($result)\n
vm.ext.i16.i32.s
Integer sign extend 16 bits to 32 bits
operation ::= `vm.ext.i16.i32.s` $operand attr-dict `:` type($operand) `->` type($result)\n
vm.ext.i16.i32.u
Integer zero extend 16 bits to 32 bits
operation ::= `vm.ext.i16.i32.u` $operand attr-dict `:` type($operand) `->` type($result)\n
vm.ext.i16.i64.s
Integer sign extend 16 bits to 64 bits
operation ::= `vm.ext.i16.i64.s` $operand attr-dict `:` type($operand) `->` type($result)\n
vm.ext.i16.i64.u
Integer zero extend 16 bits to 64 bits
operation ::= `vm.ext.i16.i64.u` $operand attr-dict `:` type($operand) `->` type($result)\n
vm.ext.i32.i64.s
Integer sign extend 32 bits to 64 bits
operation ::= `vm.ext.i32.i64.s` $operand attr-dict `:` type($operand) `->` type($result)\n
vm.ext.i32.i64.u
Integer zero extend 32 bits to 64 bits
operation ::= `vm.ext.i32.i64.u` $operand attr-dict `:` type($operand) `->` type($result)\n
vm.ext.i8.i32.s
Integer sign extend 8 bits to 32 bits
operation ::= `vm.ext.i8.i32.s` $operand attr-dict `:` type($operand) `->` type($result)\n
vm.ext.i8.i32.u
Integer zero extend 8 bits to 32 bits
operation ::= `vm.ext.i8.i32.u` $operand attr-dict `:` type($operand) `->` type($result)\n
vm.ext.i8.i64.s
Integer sign extend 8 bits to 64 bits
operation ::= `vm.ext.i8.i64.s` $operand attr-dict `:` type($operand) `->` type($result)\n
vm.ext.i8.i64.u
Integer zero extend 8 bits to 64 bits
operation ::= `vm.ext.i8.i64.u` $operand attr-dict `:` type($operand) `->` type($result)\n
vm.trunc.f64.f32
Floating-point truncate to 32 bits
operation ::= `vm.trunc.f64.f32` $operand attr-dict `:` type($operand) `->` type($result)\n
vm.trunc.i16.i8
Integer truncate to 8 bits
operation ::= `vm.trunc.i16.i8` $operand attr-dict `:` type($operand) `->` type($result)\n
vm.trunc.i32.i16
Integer truncate to 16 bits
operation ::= `vm.trunc.i32.i16` $operand attr-dict `:` type($operand) `->` type($result)\n
vm.trunc.i32.i8
operation ::= `vm.trunc.i32.i8` $operand attr-dict `:` type($operand) `->` type($result)\n
vm.trunc.i64.i16
operation ::= `vm.trunc.i64.i16` $operand attr-dict `:` type($operand) `->` type($result)\n
vm.trunc.i64.i32
Integer truncate to 32 bits
operation ::= `vm.trunc.i64.i32` $operand attr-dict `:` type($operand) `->` type($result)\n
vm.trunc.i64.i8
operation ::= `vm.trunc.i64.i8` $operand attr-dict `:` type($operand) `->` type($result)\n
vm.cmp.eq.i32
Integer equality comparison operation
operation ::= `vm.cmp.eq.i32` operands attr-dict `:` type($lhs)\n
Compares two operands with the specified predicate.
Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpAsmOpInterface, VMSerializableOp, VM_OpInterface
vm.cmp.eq.i64
operation ::= `vm.cmp.eq.i64` operands attr-dict `:` type($lhs)\n
vm.cmp.gte.i32.s
Signed integer greater-than-or-equal comparison operation
operation ::= `vm.cmp.gte.i32.s` operands attr-dict `:` type($lhs)\n
Traits: AlwaysSpeculatableImplTrait, VM_PseudoOp
VM_PseudoOp
Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpAsmOpInterface, VM_OpInterface
vm.cmp.gte.i32.u
Unsigned integer greater-than-or-equal comparison operation
operation ::= `vm.cmp.gte.i32.u` operands attr-dict `:` type($lhs)\n
vm.cmp.gte.i64.s
operation ::= `vm.cmp.gte.i64.s` operands attr-dict `:` type($lhs)\n
vm.cmp.gte.i64.u
operation ::= `vm.cmp.gte.i64.u` operands attr-dict `:` type($lhs)\n
vm.cmp.gt.i32.s
Signed integer greater-than comparison operation
operation ::= `vm.cmp.gt.i32.s` operands attr-dict `:` type($lhs)\n
vm.cmp.gt.i32.u
Unsigned integer greater-than comparison operation
operation ::= `vm.cmp.gt.i32.u` operands attr-dict `:` type($lhs)\n
vm.cmp.gt.i64.s
operation ::= `vm.cmp.gt.i64.s` operands attr-dict `:` type($lhs)\n
vm.cmp.gt.i64.u
operation ::= `vm.cmp.gt.i64.u` operands attr-dict `:` type($lhs)\n
vm.cmp.lte.i32.s
Signed integer less-than-or-equal comparison operation
operation ::= `vm.cmp.lte.i32.s` operands attr-dict `:` type($lhs)\n
vm.cmp.lte.i32.u
Unsigned integer less-than-or-equal comparison operation
operation ::= `vm.cmp.lte.i32.u` operands attr-dict `:` type($lhs)\n
vm.cmp.lte.i64.s
operation ::= `vm.cmp.lte.i64.s` operands attr-dict `:` type($lhs)\n
vm.cmp.lte.i64.u
operation ::= `vm.cmp.lte.i64.u` operands attr-dict `:` type($lhs)\n
vm.cmp.lt.i32.s
Signed integer less-than comparison operation
operation ::= `vm.cmp.lt.i32.s` operands attr-dict `:` type($lhs)\n
vm.cmp.lt.i32.u
Unsigned integer less-than comparison operation
operation ::= `vm.cmp.lt.i32.u` operands attr-dict `:` type($lhs)\n
vm.cmp.lt.i64.s
operation ::= `vm.cmp.lt.i64.s` operands attr-dict `:` type($lhs)\n
vm.cmp.lt.i64.u
operation ::= `vm.cmp.lt.i64.u` operands attr-dict `:` type($lhs)\n
vm.cmp.ne.i32
Integer inequality comparison operation
operation ::= `vm.cmp.ne.i32` operands attr-dict `:` type($lhs)\n
vm.cmp.ne.i64
operation ::= `vm.cmp.ne.i64` operands attr-dict `:` type($lhs)\n
vm.cmp.nz.i32
Integer non-zero comparison operation
operation ::= `vm.cmp.nz.i32` $operand attr-dict `:` type($operand)\n
Compares the given integer operand for a non-zero value.
vm.cmp.nz.i64
operation ::= `vm.cmp.nz.i64` $operand attr-dict `:` type($operand)\n
vm.select.f32
Floating-point select operation
operation ::= `vm.select.f32` operands attr-dict `:` type($result)\n
Chooses one value based on a binary condition supplied as its first operand. If the value of the condition is true the true_value operand is chosen, otherwise the false_value operand is chosen. The true and false values must have the same types. For example, the maximum operation is obtained by combining \"select\" with \"cmpi\" as follows:
true_value
false_value
%2 = vm.cmp.gt.i32.s %0, %1 : i32\n%3 = vm.select.i32 %2, %0, %1 : i32\n
condition
vm.select.f64
operation ::= `vm.select.f64` operands attr-dict `:` type($result)\n
vm.select.i32
Integer select operation
operation ::= `vm.select.i32` operands attr-dict `:` type($result)\n
vm.select.i64
operation ::= `vm.select.i64` operands attr-dict `:` type($result)\n
vm.select.ref
Ref select operation
operation ::= `vm.select.ref` operands attr-dict `:` type($result)\n
Chooses one value based on a binary condition supplied as its first operand. If the value of the condition is true the true_value operand is chosen, otherwise the false_value operand is chosen.
vm.switch.f32
Floating-point switch operation
operation ::= `vm.switch.f32` $index `[` $values `]` `else` $default_value attr-dict `:` type($result)\n
// Switch %index to cases of %c100/%c200/%c300 if index==0, ==1, ==2.\n// If %index is out of range (<0 or >2) then default to %c5.\n%0 = vm.switch.f32 %index[%c100, %c200, %c300] else %c5 : f32\n
vm.switch.f64
operation ::= `vm.switch.f64` $index `[` $values `]` `else` $default_value attr-dict `:` type($result)\n
vm.switch.i32
Integer switch operation
operation ::= `vm.switch.i32` $index `[` $values `]` `else` $default_value attr-dict `:` type($result)\n
// Switch %index to cases of %c100/%c200/%c300 if index==0, ==1, ==2.\n// If %index is out of range (<0 or >2) then default to %c5.\n%0 = vm.switch.i32 %index[%c100, %c200, %c300] else %c5 : i32\n
vm.switch.i64
operation ::= `vm.switch.i64` $index `[` $values `]` `else` $default_value attr-dict `:` type($result)\n
vm.switch.ref
Ref switch operation
// Switch %arg0 to cases of %r0/%r1/%r2 if arg0==0, ==1, ==2.\n// If %arg0 is out of range (<0 or >2) then default to %null.\n%0 = vm.switch.ref %index[%r0, %r1, %r2] else %null : vm.ref<!foo>\n
vm.const.f32
32-bit floating-point constant operation
operation ::= `vm.const.f32` $value attr-dict\n
Defines a constant value that is treated as a scalar literal at runtime.
Traits: AlwaysSpeculatableImplTrait, ConstantLike, VM_ExtF32
vm.const.f32.zero
32-bit floating-point constant zero operation
operation ::= `vm.const.f32.zero` attr-dict\n
Defines a constant zero primitive.
vm.const.f64
64-bit floating-point constant operation
operation ::= `vm.const.f64` $value attr-dict\n
Traits: AlwaysSpeculatableImplTrait, ConstantLike, VM_ExtF64
vm.const.f64.zero
64-bit floating-point constant zero operation
operation ::= `vm.const.f64.zero` attr-dict\n
vm.const.i32
32-bit integer constant operation
operation ::= `vm.const.i32` $value attr-dict\n
vm.const.i32.zero
32-bit integer constant zero operation
operation ::= `vm.const.i32.zero` attr-dict\n
vm.const.i64
64-bit integer constant operation
operation ::= `vm.const.i64` $value attr-dict\n
vm.const.i64.zero
64-bit integer constant zero operation
operation ::= `vm.const.i64.zero` attr-dict\n
vm.const.ref.rodata
Constant rodata access operation
operation ::= `vm.const.ref.rodata` $rodata attr-dict `:` type($value)\n
Returns a reference to a read-only buffer.
rodata
vm.const.ref.zero
Null ref constant operation
operation ::= `vm.const.ref.zero` `:` type($result) attr-dict\n
Defines a constant null ref that can be used in comparisons and initialization.
vm.rodata.inline
Inlined constant rodata
operation ::= `vm.rodata.inline` ($name^)? attr-dict `:` type($result) `=` $value\n
vm.rodata that can be embedded inline in functions. See vm.rodata for more information.
vm.rodata
Read-only data definition operation
operation ::= `vm.rodata` custom<SymbolVisibility>($sym_visibility) $sym_name attr-dict $value\n
Defines a blob of read-only constant data that can be represented as a ref. This can be used to store arbitrary data within modules such as large constant buffers and other file contents.
Note that the data is reference counted as a way to track its usage once the value leaves the module. For example, returning rodata from an exported function must keep the data (possibly backed by mmap) valid for its entire lifetime.
By default all rodata will be aligned in the final module output at a 16-byte granularity. An optional alignment can be specified to override the default for cases where larger or smaller alignments are needed.
Traits: HasParent<IREE::VM::ModuleOp>, IsolatedFromAbove
HasParent<IREE::VM::ModuleOp>
Interfaces: Symbol, VM_OpInterface
vm.rodata.table.inline
Inlined constant rodata table
operation ::= `vm.rodata.table.inline` $table_type attr-dict `:` type($table_result) `,` type($data_result) `=` $data_array\n
vm.rodata with another associated vm.rodata table specifying byte offsets and sizes as a subview into the flattened data. The table is a flat array of 32 or 64-bit integers storing (offset, size) in element order.
The optional alignment attribute applies to both the table and data rodata. The data_alignment attribute can be used to specify an alignment for the elements of the table, padding to the data alignment with zeros. The element sizes reflect the unpadded attribute storage sizes.
See vm.rodata for more information.
table_name
data_name
table_type
data_array
data_alignment
table_result
data_result
vm.br
Unconditional branch operation
operation ::= `vm.br` $dest (`(` $destOperands^ `:` type($destOperands) `)`)? attr-dict\n
Represents an unconditional branch operation that branches to a target block with the given set of arguments.
^bb0(...): vm.br ^bb1(%a) ^bb1(%blockArg1): ...
vm.br_table
Branch table operation
operation ::= `vm.br_table` $index ` ` `{` `\\n`\n custom<BranchTableCases>(\n $defaultDestination, $defaultOperands, type($defaultOperands),\n $caseDestinations, $caseOperands, type($caseOperands))\n `}`\n attr-dict\n
Represents a branch table instructing execution to branch to the block with the specified index. If the index is out of bounds then execution will branch to the default block.
vm.br_table %index { default: ^bb1(%a : i64), 0: ^bb2, 1: ^bb3(%c : i64) }
Traits: AlwaysSpeculatableImplTrait, AttrSizedOperandSegments, Terminator
Interfaces: BranchOpInterface, ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), VMSerializableOp, VM_OpInterface
case_operand_segments
defaultOperands
caseOperands
defaultDestination
caseDestinations
Call operation
operation ::= `vm.call` $callee `(` operands `)` attr-dict `:` functional-type(operands, results)\n
Calls an internal VM function with the given arguments.
Interfaces: CallOpInterface, MemoryEffectOpInterface, VMSerializableOp, VM_OpInterface
vm.call.variadic
Call operation with variadic arguments
Calls an internal VM function with the given arguments. One or more of the arguments may be variadic, encoded as segmented sized operand lists.
Variadic arguments must be specified with a total count in the segment_sizes attribute.
segment_sizes
segment_types
vm.check.eq
Raises a global failure if the condition is true
operation ::= `vm.check.eq` $lhs `,` $rhs (`,` $message^)? attr-dict `:` type($lhs)\n
When the condition is true this signals a runtime failure that causes the entire active invocation - and possibly all in-flight and pending invocations - to fail. The status will be propagated back via the available runtime error handling mechanisms such as semaphores or synchronous invocation results.
This is implemented as a pseudo-op that transforms into a vm.cond_fail operation.
vm.check.eq %a, %b, \"a == b\" : i32\nvm.check.nz %ref, \"!null\" : !vm.ref<?>\n
Traits: Commutative, VM_PseudoOp
Interfaces: VM_OpInterface
vm.check.ne
operation ::= `vm.check.ne` $lhs `,` $rhs (`,` $message^)? attr-dict `:` type($lhs)\n
vm.check.nz
operation ::= `vm.check.nz` $value (`,` $message^)? attr-dict `:` type($value)\n
Traits: VM_PseudoOp
vm.check.nearly_eq
operation ::= `vm.check.nearly_eq` $lhs `,` $rhs (`,` $message^)? attr-dict `:` type($lhs)\n
vm.cond_br
Conditional branch operation
operation ::= `vm.cond_br` $condition `,`\n $trueDest (`(` $trueDestOperands^ `:` type($trueDestOperands) `)`)? `,`\n $falseDest (`(` $falseDestOperands^ `:` type($falseDestOperands) `)`)?\n attr-dict\n
Represents a conditional branch operation that branches to one of the two target blocks with the given set of arguments.
^bb0(...): vm.cond_br %condition, ^bb1(%a), ^bb2(%b) ^bb1(%blockArg1): ... ^bb2(%blockArg2): ...
Traits: AttrSizedOperandSegments, Terminator
trueDestOperands
falseDestOperands
trueDest
falseDest
vm.cond_fail
When the condition is true this signals a runtime failure that causes the entire active invocation - and possibly all in-flight and pending invocations - to fail with the given status. The status will be propagated back via the available runtime error handling mechanisms such as semaphores or synchronous invocation results.
This is implemented as a pseudo-op that transforms into a vm.fail operation guarded by the condition.
%nz = vm.cmp.nz.i32 %value : i32\n%statusCode = vm.const.i32 9\nvm.cond_fail %nz, %statusCode, \"expected non-zero\"\n
vm.fail
Raises a global failure
operation ::= `vm.fail` $status (`,` $message^)? attr-dict\n
Signals a runtime failure that causes the entire active invocation - and possibly all in-flight and pending invocations - to fail with the given status. The status will be propagated back via the available runtime error handling mechanisms such as semaphores or synchronous invocation results.
%statusCode = vm.const.i32 9\nvm.fail %statusCode, \"oh no!\"\n
Interfaces: VMSerializableOp, VM_OpInterface
vm.import.resolved
Returns true if an optional import was resolved at runtime
operation ::= `vm.import.resolved` $import attr-dict `:` type($result)\n
Allows for checking whether a optional import was resolved at runtime. If this returns false then attempting to call the imported function will result in a failure at runtime.
Interfaces: NoMemoryEffect (MemoryEffectOpInterface), OpAsmOpInterface, VMSerializableOp, VM_OpInterface
import
vm.return
Return operation
operation ::= `vm.return` attr-dict ($operands^ `:` type($operands))?\n
Represents a return operation within a function.
vm.func @foo(%0: i32, %1: f8) -> (i32, f8) {\n vm.return %0, %1 : i32, f8\n}\n
Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), RegionBranchTerminatorOpInterface, VMSerializableOp, VM_OpInterface
vm.break
Unconditional debug break operation
operation ::= `vm.break` $dest (`(` $destOperands^ `:` type($destOperands) `)`)? attr-dict\n
Breaks into the attached debugger or asks for attaching a debugger. After resuming (or if a debugger is not attached) execution will continue at the target block.
Traits: Terminator, Util_YieldPoint, VM_DebugOnly, VM_FullBarrier
VM_DebugOnly
VM_FullBarrier
vm.cond_break
Conditional debug break operation
operation ::= `vm.cond_break` $condition `,` $dest (`(` $destOperands^ `:` type($destOperands) `)`)?\n attr-dict\n
Breaks into the attached debugger or asks for attaching a debugger if the provided condition is true. After resuming (or if a debugger is not attached) execution will continue at the target block.
vm.print
Message printing operation
operation ::= `vm.print` $message `(` operands `)` attr-dict `:` type(operands)\n
Prints the given string message and zero or more values.
Traits: VM_DebugOnly, VM_FullBarrier
vm.trace
operation ::= `vm.trace` $event_name `(` operands `)` attr-dict `:` type(operands)\n
Traces one or more values at the time the operation is executed. These values will be encoded into the active trace depending on the active trace verbosity setting.
event_name
vm.abs.f32
Floating point absolute-value operation
operation ::= `vm.abs.f32` $operand attr-dict `:` type($result)\n
vm.abs.f64
operation ::= `vm.abs.f64` $operand attr-dict `:` type($result)\n
vm.add.f32
Floating-point add operation
operation ::= `vm.add.f32` operands attr-dict `:` type($result)\n
Traits: AlwaysSpeculatableImplTrait, Commutative, VM_ExtF32
vm.add.f64
operation ::= `vm.add.f64` operands attr-dict `:` type($result)\n
Traits: AlwaysSpeculatableImplTrait, Commutative, VM_ExtF64
vm.ceil.f32
Floating point ceiling operation
operation ::= `vm.ceil.f32` $operand attr-dict `:` type($result)\n
vm.ceil.f64
operation ::= `vm.ceil.f64` $operand attr-dict `:` type($result)\n
vm.div.f32
Floating point division operation
operation ::= `vm.div.f32` operands attr-dict `:` type($result)\n
vm.div.f64
operation ::= `vm.div.f64` operands attr-dict `:` type($result)\n
vm.fma.f32
Floating point fused multiply-add operation (a*b+c)
operation ::= `vm.fma.f32` operands attr-dict `:` type($result)\n
b
vm.fma.f64
operation ::= `vm.fma.f64` operands attr-dict `:` type($result)\n
vm.floor.f32
Floating point floor operation
operation ::= `vm.floor.f32` $operand attr-dict `:` type($result)\n
vm.floor.f64
operation ::= `vm.floor.f64` $operand attr-dict `:` type($result)\n
vm.max.f32
Floating point maximum operation
operation ::= `vm.max.f32` operands attr-dict `:` type($result)\n
vm.max.f64
operation ::= `vm.max.f64` operands attr-dict `:` type($result)\n
vm.min.f32
Floating point minimum operation
operation ::= `vm.min.f32` operands attr-dict `:` type($result)\n
vm.min.f64
operation ::= `vm.min.f64` operands attr-dict `:` type($result)\n
vm.mul.f32
Floating point multiplication operation
operation ::= `vm.mul.f32` operands attr-dict `:` type($result)\n
vm.mul.f64
operation ::= `vm.mul.f64` operands attr-dict `:` type($result)\n
vm.neg.f32
Floating point negation operation
operation ::= `vm.neg.f32` $operand attr-dict `:` type($result)\n
vm.neg.f64
operation ::= `vm.neg.f64` $operand attr-dict `:` type($result)\n
vm.rem.f32
Floating point remainder operation
operation ::= `vm.rem.f32` operands attr-dict `:` type($result)\n
vm.rem.f64
operation ::= `vm.rem.f64` operands attr-dict `:` type($result)\n
vm.round.f32.even
Rounds the value to the nearest even integer
operation ::= `vm.round.f32.even` $operand attr-dict `:` type($result)\n
vm.round.f32
Rounds the value to the nearest integer away from zero
operation ::= `vm.round.f32` $operand attr-dict `:` type($result)\n
vm.round.f64.even
operation ::= `vm.round.f64.even` $operand attr-dict `:` type($result)\n
vm.round.f64
operation ::= `vm.round.f64` $operand attr-dict `:` type($result)\n
vm.sub.f32
Floating point subtraction operation
operation ::= `vm.sub.f32` operands attr-dict `:` type($result)\n
vm.sub.f64
operation ::= `vm.sub.f64` operands attr-dict `:` type($result)\n
vm.cmp.eq.f32.near
Near floating-point equality comparison operation
operation ::= `vm.cmp.eq.f32.near` operands attr-dict `:` type($lhs)\n
Traits: AlwaysSpeculatableImplTrait, Commutative, VM_ExtF32, VM_PseudoOp
vm.cmp.eq.f32.o
Ordered floating-point equality comparison operation
operation ::= `vm.cmp.eq.f32.o` operands attr-dict `:` type($lhs)\n
vm.cmp.eq.f32.u
Unordered floating-point equality comparison operation
operation ::= `vm.cmp.eq.f32.u` operands attr-dict `:` type($lhs)\n
vm.cmp.eq.f64.near
operation ::= `vm.cmp.eq.f64.near` operands attr-dict `:` type($lhs)\n
Traits: AlwaysSpeculatableImplTrait, Commutative, VM_ExtF64, VM_PseudoOp
vm.cmp.eq.f64.o
operation ::= `vm.cmp.eq.f64.o` operands attr-dict `:` type($lhs)\n
vm.cmp.eq.f64.u
operation ::= `vm.cmp.eq.f64.u` operands attr-dict `:` type($lhs)\n
vm.cmp.gte.f32.o
Ordered floating-point greater-than-or-equal comparison operation
operation ::= `vm.cmp.gte.f32.o` operands attr-dict `:` type($lhs)\n
Traits: AlwaysSpeculatableImplTrait, VM_ExtF32, VM_PseudoOp
vm.cmp.gte.f32.u
Unordered floating-point greater-than-or-equal comparison operation
operation ::= `vm.cmp.gte.f32.u` operands attr-dict `:` type($lhs)\n
vm.cmp.gte.f64.o
operation ::= `vm.cmp.gte.f64.o` operands attr-dict `:` type($lhs)\n
Traits: AlwaysSpeculatableImplTrait, VM_ExtF64, VM_PseudoOp
vm.cmp.gte.f64.u
operation ::= `vm.cmp.gte.f64.u` operands attr-dict `:` type($lhs)\n
vm.cmp.gt.f32.o
Ordered floating-point greater-than comparison operation
operation ::= `vm.cmp.gt.f32.o` operands attr-dict `:` type($lhs)\n
vm.cmp.gt.f32.u
Unordered floating-point greater-than comparison operation
operation ::= `vm.cmp.gt.f32.u` operands attr-dict `:` type($lhs)\n
vm.cmp.gt.f64.o
operation ::= `vm.cmp.gt.f64.o` operands attr-dict `:` type($lhs)\n
vm.cmp.gt.f64.u
operation ::= `vm.cmp.gt.f64.u` operands attr-dict `:` type($lhs)\n
vm.cmp.lte.f32.o
Ordered floating-point less-than-or-equal comparison operation
operation ::= `vm.cmp.lte.f32.o` operands attr-dict `:` type($lhs)\n
vm.cmp.lte.f32.u
Unordered floating-point less-than-or-equal comparison operation
operation ::= `vm.cmp.lte.f32.u` operands attr-dict `:` type($lhs)\n
vm.cmp.lte.f64.o
operation ::= `vm.cmp.lte.f64.o` operands attr-dict `:` type($lhs)\n
vm.cmp.lte.f64.u
operation ::= `vm.cmp.lte.f64.u` operands attr-dict `:` type($lhs)\n
vm.cmp.lt.f32.o
Ordered floating-point less-than comparison operation
operation ::= `vm.cmp.lt.f32.o` operands attr-dict `:` type($lhs)\n
vm.cmp.lt.f32.u
Unordered floating-point less-than comparison operation
operation ::= `vm.cmp.lt.f32.u` operands attr-dict `:` type($lhs)\n
vm.cmp.lt.f64.o
operation ::= `vm.cmp.lt.f64.o` operands attr-dict `:` type($lhs)\n
vm.cmp.lt.f64.u
operation ::= `vm.cmp.lt.f64.u` operands attr-dict `:` type($lhs)\n
vm.cmp.ne.f32.o
Ordered floating-point inequality comparison operation
operation ::= `vm.cmp.ne.f32.o` operands attr-dict `:` type($lhs)\n
vm.cmp.ne.f32.u
Unordered floating-point inequality comparison operation
operation ::= `vm.cmp.ne.f32.u` operands attr-dict `:` type($lhs)\n
vm.cmp.ne.f64.o
operation ::= `vm.cmp.ne.f64.o` operands attr-dict `:` type($lhs)\n
vm.cmp.ne.f64.u
operation ::= `vm.cmp.ne.f64.u` operands attr-dict `:` type($lhs)\n
vm.cmp.nz.f32.o
Ordered floating-point non-zero comparison operation
operation ::= `vm.cmp.nz.f32.o` operands attr-dict `:` type($operand)\n
Compares the given floating-point operand for a non-zero value.
vm.cmp.nz.f32.u
Unordered floating-point non-zero comparison operation
operation ::= `vm.cmp.nz.f32.u` operands attr-dict `:` type($operand)\n
vm.cmp.nz.f64.o
operation ::= `vm.cmp.nz.f64.o` operands attr-dict `:` type($operand)\n
vm.cmp.nz.f64.u
operation ::= `vm.cmp.nz.f64.u` operands attr-dict `:` type($operand)\n
vm.cmp.nan.f32
Floating-point NaN comparison operation
operation ::= `vm.cmp.nan.f32` $operand attr-dict `:` type($operand)\n
Returns 1 if the value is NaN.
vm.cmp.nan.f64
operation ::= `vm.cmp.nan.f64` $operand attr-dict `:` type($operand)\n
These map directly to the math dialect.
vm.atan2.f32
2-argument arcus tangent of the given values
operation ::= `vm.atan2.f32` operands attr-dict `:` type($result)\n
vm.atan2.f64
operation ::= `vm.atan2.f64` operands attr-dict `:` type($result)\n
vm.atan.f32
Arcus tangent of the given value
operation ::= `vm.atan.f32` $operand attr-dict `:` type($result)\n
vm.atan.f64
operation ::= `vm.atan.f64` $operand attr-dict `:` type($result)\n
vm.cos.f32
Cosine of the specified value
operation ::= `vm.cos.f32` $operand attr-dict `:` type($result)\n
vm.cos.f64
operation ::= `vm.cos.f64` $operand attr-dict `:` type($result)\n
vm.erf.f32
Computes the error function of the specified value
operation ::= `vm.erf.f32` $operand attr-dict `:` type($result)\n
vm.erf.f64
operation ::= `vm.erf.f64` $operand attr-dict `:` type($result)\n
vm.exp2.f32
Base-2 exponential of the specified value
operation ::= `vm.exp2.f32` $operand attr-dict `:` type($result)\n
vm.exp2.f64
operation ::= `vm.exp2.f64` $operand attr-dict `:` type($result)\n
vm.exp.f32
Base-e exponential of the specified value
operation ::= `vm.exp.f32` $operand attr-dict `:` type($result)\n
vm.exp.f64
operation ::= `vm.exp.f64` $operand attr-dict `:` type($result)\n
vm.expm1.f32
Base-e exponential of the specified value minus 1
operation ::= `vm.expm1.f32` $operand attr-dict `:` type($result)\n
vm.expm1.f64
operation ::= `vm.expm1.f64` $operand attr-dict `:` type($result)\n
vm.log10.f32
Base-10 logarithm of the specified value
operation ::= `vm.log10.f32` $operand attr-dict `:` type($result)\n
vm.log10.f64
operation ::= `vm.log10.f64` $operand attr-dict `:` type($result)\n
vm.log1p.f32
Natural logarithm of one plus the given value
operation ::= `vm.log1p.f32` $operand attr-dict `:` type($result)\n
vm.log1p.f64
operation ::= `vm.log1p.f64` $operand attr-dict `:` type($result)\n
vm.log2.f32
Base-2 logarithm of the specified value
operation ::= `vm.log2.f32` $operand attr-dict `:` type($result)\n
vm.log2.f64
operation ::= `vm.log2.f64` $operand attr-dict `:` type($result)\n
vm.log.f32
Base-e logarithm of the specified value
operation ::= `vm.log.f32` $operand attr-dict `:` type($result)\n
vm.log.f64
operation ::= `vm.log.f64` $operand attr-dict `:` type($result)\n
vm.pow.f32
Floating point raised to the power of operation
operation ::= `vm.pow.f32` operands attr-dict `:` type($result)\n
vm.pow.f64
operation ::= `vm.pow.f64` operands attr-dict `:` type($result)\n
vm.rsqrt.f32
Reciprocal of sqrt (1 / sqrt of the specified value)
operation ::= `vm.rsqrt.f32` $operand attr-dict `:` type($result)\n
vm.rsqrt.f64
operation ::= `vm.rsqrt.f64` $operand attr-dict `:` type($result)\n
vm.sin.f32
Sine of the specified value
operation ::= `vm.sin.f32` $operand attr-dict `:` type($result)\n
vm.sin.f64
operation ::= `vm.sin.f64` $operand attr-dict `:` type($result)\n
vm.sqrt.f32
Sqrt of the specified value
operation ::= `vm.sqrt.f32` $operand attr-dict `:` type($result)\n
vm.sqrt.f64
operation ::= `vm.sqrt.f64` $operand attr-dict `:` type($result)\n
vm.tanh.f32
Hyperbolic tangent of the specified value
operation ::= `vm.tanh.f32` $operand attr-dict `:` type($result)\n
vm.tanh.f64
operation ::= `vm.tanh.f64` $operand attr-dict `:` type($result)\n
vm.global.address
operation ::= `vm.global.address` (`immutable` $is_immutable^)?\n $global attr-dict `:` type($result)\n
Returns an indirect address reference to the given global. During export the address will be converted to the natural format of the global table (for example, ordinals for refs and byte offsets for primitive types).
Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), SymbolUserOpInterface, Util_GlobalAddressOpInterface, VM_OpInterface
vm.global.f32
32-bit floating-point global declaration
operation ::= `vm.global.f32` custom<SymbolVisibility>($sym_visibility)\n (`mutable` $is_mutable^)?\n $sym_name\n attr-dict\n custom<TypeOrAttr>($type, $initial_value)\n
Defines a global value that is treated as a scalar literal at runtime. Initialized to zero unless an initial value is specified.
Traits: HasParent<IREE::VM::ModuleOp>, IsolatedFromAbove, VM_ExtF32
Interfaces: GlobalOpInterface, Symbol, VM_OpInterface
GlobalOpInterface
vm.global.f64
64-bit floating-point global declaration
operation ::= `vm.global.f64` custom<SymbolVisibility>($sym_visibility)\n (`mutable` $is_mutable^)?\n $sym_name\n attr-dict\n custom<TypeOrAttr>($type, $initial_value)\n
Traits: HasParent<IREE::VM::ModuleOp>, IsolatedFromAbove, VM_ExtF64
vm.global.i32
32-bit integer global declaration
operation ::= `vm.global.i32` custom<SymbolVisibility>($sym_visibility)\n (`mutable` $is_mutable^)?\n $sym_name\n attr-dict\n custom<TypeOrAttr>($type, $initial_value)\n
vm.global.i64
64-bit integer global declaration
operation ::= `vm.global.i64` custom<SymbolVisibility>($sym_visibility)\n (`mutable` $is_mutable^)?\n $sym_name\n attr-dict\n custom<TypeOrAttr>($type, $initial_value)\n
vm.global.load.f32
Global 32-bit floating-point load operation
operation ::= `vm.global.load.f32` (`immutable` $is_immutable^)?\n $global attr-dict `:` type($value)\n
Loads the value of a global containing an primitive value.
Interfaces: MemoryEffectOpInterface, OpAsmOpInterface, SymbolUserOpInterface, Util_GlobalLoadOpInterface, VMSerializableOp, VM_OpInterface
vm.global.load.f64
Global 64-bit floating-point load operation
operation ::= `vm.global.load.f64` (`immutable` $is_immutable^)?\n $global attr-dict `:` type($value)\n
vm.global.load.i32
Global 32-bit integer load operation
operation ::= `vm.global.load.i32` (`immutable` $is_immutable^)?\n $global attr-dict `:` type($value)\n
vm.global.load.i64
Global 64-bit integer load operation
operation ::= `vm.global.load.i64` (`immutable` $is_immutable^)?\n $global attr-dict `:` type($value)\n
vm.global.load.indirect.f32
operation ::= `vm.global.load.indirect.f32` (`immutable` $is_immutable^)?\n $global attr-dict `:` type($global) `->` type($value)\n
Loads the value of a global containing a primitive value.
Interfaces: Util_GlobalLoadIndirectOpInterface, VMSerializableOp, VM_OpInterface
vm.global.load.indirect.f64
operation ::= `vm.global.load.indirect.f64` (`immutable` $is_immutable^)?\n $global attr-dict `:` type($global) `->` type($value)\n
vm.global.load.indirect.i32
operation ::= `vm.global.load.indirect.i32` (`immutable` $is_immutable^)?\n $global attr-dict `:` type($global) `->` type($value)\n
vm.global.load.indirect.i64
operation ::= `vm.global.load.indirect.i64` (`immutable` $is_immutable^)?\n $global attr-dict `:` type($global) `->` type($value)\n
vm.global.load.indirect.ref
Global ref load operation
operation ::= `vm.global.load.indirect.ref` (`immutable` $is_immutable^)?\n $global attr-dict `:` type($global) `->` type($value)\n
Loads the value of a global containing a ref of the given type.
vm.global.load.ref
operation ::= `vm.global.load.ref` (`immutable` $is_immutable^)?\n $global attr-dict `:` type($value)\n
vm.global.ref
Ref global declaration
operation ::= `vm.global.ref` custom<SymbolVisibility>($sym_visibility)\n (`mutable` $is_mutable^)?\n $sym_name\n attr-dict\n `:` $type\n
Defines a global value that is a ref of a specific type. The global will retain the ref object for the lifetime of the context or until the value is replaced with a store or reset. Initialized to null unless an initial value is specified.
vm.global.store.f32
Global 32-bit floating-point store operation
operation ::= `vm.global.store.f32` $value `,` $global attr-dict `:` type($value)\n
Stores a primitive value value to a global.
Interfaces: SymbolUserOpInterface, Util_GlobalStoreOpInterface, VMSerializableOp, VM_OpInterface
vm.global.store.f64
Global 64-bit floating-point store operation
operation ::= `vm.global.store.f64` $value `,` $global attr-dict `:` type($value)\n
vm.global.store.i32
Global 32-bit integer store operation
operation ::= `vm.global.store.i32` $value `,` $global attr-dict `:` type($value)\n
vm.global.store.i64
Global 64-bit integer store operation
operation ::= `vm.global.store.i64` $value `,` $global attr-dict `:` type($value)\n
vm.global.store.indirect.f32
operation ::= `vm.global.store.indirect.f32` $value `,` $global attr-dict `:` type($value) `->` type($global)\n
Stores a primitive value to a global.
Interfaces: Util_GlobalStoreIndirectOpInterface, VMSerializableOp, VM_OpInterface
vm.global.store.indirect.f64
operation ::= `vm.global.store.indirect.f64` $value `,` $global attr-dict `:` type($value) `->` type($global)\n
vm.global.store.indirect.i32
operation ::= `vm.global.store.indirect.i32` $value `,` $global attr-dict `:` type($value) `->` type($global)\n
vm.global.store.indirect.i64
operation ::= `vm.global.store.indirect.i64` $value `,` $global attr-dict `:` type($value) `->` type($global)\n
vm.global.store.indirect.ref
Global ref stores operation
operation ::= `vm.global.store.indirect.ref` $value `,` $global attr-dict `:` type($value) `->` type($global)\n
Stores a ref to a global, retaining it until the global is reset.
vm.global.store.ref
operation ::= `vm.global.store.ref` $value `,` $global attr-dict `:` type($value)\n
vm.abs.i32
Integer absolute-value operation
operation ::= `vm.abs.i32` $operand attr-dict `:` type($result)\n
vm.abs.i64
operation ::= `vm.abs.i64` $operand attr-dict `:` type($result)\n
vm.add.i32
Integer add operation
operation ::= `vm.add.i32` operands attr-dict `:` type($result)\n
Traits: Commutative
vm.add.i64
operation ::= `vm.add.i64` operands attr-dict `:` type($result)\n
vm.div.i32.s
Signed integer division operation
operation ::= `vm.div.i32.s` operands attr-dict `:` type($result)\n
vm.div.i32.u
Unsigned integer division operation
operation ::= `vm.div.i32.u` operands attr-dict `:` type($result)\n
vm.div.i64.s
operation ::= `vm.div.i64.s` operands attr-dict `:` type($result)\n
vm.div.i64.u
operation ::= `vm.div.i64.u` operands attr-dict `:` type($result)\n
vm.fma.i32
Integer fused-multiply add operation (a*b+c)
operation ::= `vm.fma.i32` operands attr-dict `:` type($result)\n
vm.fma.i64
operation ::= `vm.fma.i64` operands attr-dict `:` type($result)\n
vm.max.i32.s
Signed integer maximum operation
operation ::= `vm.max.i32.s` operands attr-dict `:` type($result)\n
vm.max.i32.u
Unsigned integer maximum operation
operation ::= `vm.max.i32.u` operands attr-dict `:` type($result)\n
vm.max.i64.s
operation ::= `vm.max.i64.s` operands attr-dict `:` type($result)\n
vm.max.i64.u
operation ::= `vm.max.i64.u` operands attr-dict `:` type($result)\n
vm.min.i32.s
Signed integer minimum operation
operation ::= `vm.min.i32.s` operands attr-dict `:` type($result)\n
vm.min.i32.u
Unsigned integer minimum operation
operation ::= `vm.min.i32.u` operands attr-dict `:` type($result)\n
vm.min.i64.s
operation ::= `vm.min.i64.s` operands attr-dict `:` type($result)\n
vm.min.i64.u
operation ::= `vm.min.i64.u` operands attr-dict `:` type($result)\n
vm.mul.i32
Integer multiplication operation
operation ::= `vm.mul.i32` operands attr-dict `:` type($result)\n
vm.mul.i64
operation ::= `vm.mul.i64` operands attr-dict `:` type($result)\n
vm.rem.i32.s
Signed integer division remainder operation
operation ::= `vm.rem.i32.s` operands attr-dict `:` type($result)\n
vm.rem.i32.u
Unsigned integer division remainder operation
operation ::= `vm.rem.i32.u` operands attr-dict `:` type($result)\n
vm.rem.i64.s
operation ::= `vm.rem.i64.s` operands attr-dict `:` type($result)\n
vm.rem.i64.u
operation ::= `vm.rem.i64.u` operands attr-dict `:` type($result)\n
vm.sub.i32
Integer subtract operation
operation ::= `vm.sub.i32` operands attr-dict `:` type($result)\n
vm.sub.i64
operation ::= `vm.sub.i64` operands attr-dict `:` type($result)\n
vm.and.i32
Integer binary and operation
operation ::= `vm.and.i32` operands attr-dict `:` type($result)\n
vm.and.i64
operation ::= `vm.and.i64` operands attr-dict `:` type($result)\n
vm.ctlz.i32
Counts the leading zeros in an integer value
operation ::= `vm.ctlz.i32` $operand attr-dict `:` type($result)\n
vm.ctlz.i64
operation ::= `vm.ctlz.i64` $operand attr-dict `:` type($result)\n
vm.not.i32
Integer binary not operation
operation ::= `vm.not.i32` $operand attr-dict `:` type($result)\n
vm.not.i64
operation ::= `vm.not.i64` $operand attr-dict `:` type($result)\n
vm.or.i32
Integer binary or operation
operation ::= `vm.or.i32` operands attr-dict `:` type($result)\n
vm.or.i64
operation ::= `vm.or.i64` operands attr-dict `:` type($result)\n
vm.xor.i32
Integer binary exclusive-or operation
operation ::= `vm.xor.i32` operands attr-dict `:` type($result)\n
vm.xor.i64
operation ::= `vm.xor.i64` operands attr-dict `:` type($result)\n
vm.list.alloc
Allocates a new empty list
operation ::= `vm.list.alloc` operands attr-dict `:` `(` type($initial_capacity) `)` `->` type($result)\n
Allocates a new typed list with a minimum initial_capacity.
vm.list.get.f32
Primitive type element accessor
operation ::= `vm.list.get.f32` operands attr-dict `:` `(` type($list) `,` type($index) `)` `->` type($result)\n
Returns the value of the element at the given index.
vm.list.get.f64
operation ::= `vm.list.get.f64` operands attr-dict `:` `(` type($list) `,` type($index) `)` `->` type($result)\n
vm.list.get.i32
operation ::= `vm.list.get.i32` operands attr-dict `:` `(` type($list) `,` type($index) `)` `->` type($result)\n
vm.list.get.i64
operation ::= `vm.list.get.i64` operands attr-dict `:` `(` type($list) `,` type($index) `)` `->` type($result)\n
vm.list.get.ref
Ref type element accessor
operation ::= `vm.list.get.ref` operands attr-dict `:` `(` type($list) `,` type($index) `)` `->` type($result)\n
Returns the ref value of the element at the given index. Note that the value may be null if the element is null or the type does not match.
vm.list.reserve
Reserves capacity for list growth
operation ::= `vm.list.reserve` operands attr-dict `:` `(` type($list) `,` type($minimum_capacity) `)`\n
Reserves storage for at least minimum_capacity elements. If the list already has at least the specified capacity the operation is ignored.
Effects: MemoryEffects::Effect{MemoryEffects::Allocate on ::mlir::SideEffects::DefaultResource, MemoryEffects::Read on ::mlir::SideEffects::DefaultResource, MemoryEffects::Write on ::mlir::SideEffects::DefaultResource}
MemoryEffects::Effect{MemoryEffects::Allocate on ::mlir::SideEffects::DefaultResource, MemoryEffects::Read on ::mlir::SideEffects::DefaultResource, MemoryEffects::Write on ::mlir::SideEffects::DefaultResource}
minimum_capacity
vm.list.resize
operation ::= `vm.list.resize` operands attr-dict `:` `(` type($list) `,` type($new_size) `)`\n
Resizes the list to contain new_size elements. This will either truncate the list if the existing size is greater than new_size or extend the list with the default list value of 0 if storing primitives and null if refs.
vm.list.set.f32
Primitive type element mutator
operation ::= `vm.list.set.f32` operands attr-dict `:` `(` type($list) `,` type($index) `,` type($value) `)`\n
vm.list.set.f64
operation ::= `vm.list.set.f64` operands attr-dict `:` `(` type($list) `,` type($index) `,` type($value) `)`\n
vm.list.set.i32
operation ::= `vm.list.set.i32` operands attr-dict `:` `(` type($list) `,` type($index) `,` type($value) `)`\n
vm.list.set.i64
operation ::= `vm.list.set.i64` operands attr-dict `:` `(` type($list) `,` type($index) `,` type($value) `)`\n
vm.list.set.ref
Ref type element mutator
operation ::= `vm.list.set.ref` operands attr-dict `:` `(` type($list) `,` type($index) `,` type($value) `)`\n
Sets the element at the given index to the new ref value (possibly null).
vm.list.size
operation ::= `vm.list.size` operands attr-dict `:` `(` type($list) `)` `->` type($result)\n
Comparison ops for vm.ref.
vm.ref
vm.cmp.eq.ref
Ref equality comparison operation
operation ::= `vm.cmp.eq.ref` operands attr-dict `:` type($lhs)\n
vm.cmp.ne.ref
Ref inequality comparison operation
operation ::= `vm.cmp.ne.ref` operands attr-dict `:` type($lhs)\n
vm.cmp.nz.ref
Ref non-zero comparison operation
operation ::= `vm.cmp.nz.ref` $operand attr-dict `:` type($operand)\n
Compares the given ref operand for a non-zero/null value.
vm.export
Exports a function from the module
Specifies an exported function with an externally-visible alias. Multiple exports can reference the same internal functions.
Interfaces: SymbolUserOpInterface, VM_OpInterface
export_name
vm.func
Function defined with VM control flow ops
Represents a function containing VM ops and those of compatible dialects. All flow control is performed by VM ops.
Interfaces: CallableOpInterface, FunctionOpInterface, Symbol, VM_OpInterface
vm.import
Imports a function from an external module
Specifies a function that should be imported from either the runtime or an external VM module.
Required imports can be declared with a minimum version of the module that contains the import. The maximum declared minimum version of all required imports from the module will become the required minimum version at runtime.
Optional imports not present at runtime will be invalid to call and whether they were resolved can be queried with vm.import.resolved.
is_optional
minimum_version
vm.initializer
Interfaces: CallableOpInterface, FunctionOpInterface, Symbol, Util_InitializerOpInterface, VM_OpInterface
vm.module
Module containing VM functions and variables
operation ::= `vm.module` custom<SymbolVisibility>($sym_visibility)\n $sym_name\n attr-dict-with-keyword\n regions\n
Top-level container for VM functions.
Traits: IsolatedFromAbove, SingleBlockImplicitTerminator<IREE::VM::ModuleTerminatorOp>, SingleBlock, SymbolTable
SingleBlockImplicitTerminator<IREE::VM::ModuleTerminatorOp>
ordinal_counts
version
vm.module_terminator
Terminator pseudo-op for the module op
operation ::= `vm.module_terminator` attr-dict\n
Traits: HasParent<IREE::VM::ModuleOp>, Terminator
#vm.ordinal_counts<\n int32_t, # import_funcs\n int32_t, # export_funcs\n int32_t, # internal_funcs\n int32_t, # global_bytes\n int32_t, # global_refs\n int32_t, # rodatas\n int32_t # rwdatas\n>\n
int32_t
Vector extensions to the IREE VM.
This is a reference dialect representing a simple IREE VM-based linear algebra module that is used as a library at runtime. The ops in this dialect map (roughly) 1:1 with the exported functions in the runtime module.
See vmvx.imports.mlir for the full list of exported functions.
vmvx.imports.mlir
vmvx.binary
Performs a strided elementwise operation on two same-rank buffers
operation ::= `vmvx.binary` `op` `` `(` $opcode `:` $element_type `)`\n `lhs` `` `(` $lhs_buffer `offset` $lhs_offset `strides` `[` $lhs_strides `]` `:` type($lhs_buffer) `)`\n `rhs` `` `(` $rhs_buffer `offset` $rhs_offset `strides` `[` $rhs_strides `]` `:` type($rhs_buffer) `)`\n `out` `` `(` $out_buffer `offset` $out_offset `strides` `[` $out_strides `]` `:` type($out_buffer) `)`\n `sizes` `` `(` $sizes `)`\n attr-dict\n
Performs the operation in-place as if:
OUT = OP(LHS, RHS)\n
Where OP is a concrete operation name as defined in ukernel/elementwise.h
OP
opcode
lhs_strides
rhs_strides
out_buffer
out_offset
out_strides
vmvx.copy
Copy from one buffer to another
operation ::= `vmvx.copy` `in` `` `(` $in_buffer `offset` $in_offset `strides` `[` $in_strides `]` `:` type($in_buffer) `)`\n `out` `` `(` $out_buffer `offset` $out_offset `strides` `[` $out_strides `]` `:` type($out_buffer) `)`\n `sizes` `` `(` $sizes `)`\n `:` $element_type\n attr-dict\n
in_buffer
in_offset
in_strides
vmvx.fill2d
Fill a tile with a scalar
operation ::= `vmvx.fill2d` `scalar` `` `(` $scalar `:` type($scalar) `)`\n `out` `` `(` $out_buffer `offset` $out_offset `row_stride` $out_row_stride `:` type($out_buffer) `)`\n `sizes` `` `(` $m `,` $n `)`\n attr-dict\n
Fills a tile with dimensions [m, n] with a scalar.
scalar
out_row_stride
m
n
vmvx.unary
Performs a strided elementwise unary operation
operation ::= `vmvx.unary` `op` `` `(` $opcode `:` $element_type `)`\n `in` `` `(` $in_buffer `offset` $in_offset `strides` `[` $in_strides `]` `:` type($in_buffer) `)`\n `out` `` `(` $out_buffer `offset` $out_offset `strides` `[` $out_strides `]` `:` type($out_buffer) `)`\n `sizes` `` `(` $sizes `)`\n attr-dict\n
OUT = OP(IN)\n
vmvx.get_buffer_descriptor
Late binds a base buffer/offset/strides
operation ::= `vmvx.get_buffer_descriptor` $source `:` type($source) `->` type(results) attr-dict\n
Queries a base buffer, offset and strides. This op is late bound to its source (alloca, binding, etc), allowing additional layers of transformations to be added as lowering progresses (or for buffers to be combined).
This op has canonicalization rules which will bubble it up through the view stack. A final reconciliation pass is used explicitly to bind it to concrete sources.
base_buffer
vmvx.get_raw_interface_binding_buffer
Gets the raw buffer associated with a binding
operation ::= `vmvx.get_raw_interface_binding_buffer` `set` `(` $set `)` `binding` `(` $binding `)` attr-dict\n
Normally, a slice of a binding buffer is returned via hal.interface.binding.subspan. However, the normal VMVX lowering flow for this presumes that the result is a memref, and upon final conversion, it will offset the memref automatically to make it consistent.
This op is used in situations where earlier in a lowering, we have fully resolved the binding to a buffer and would just like the raw backing buffer as passed to the interface.
Website pages sorted by tag:
IREE (Intermediate Representation Execution Environment1) is an MLIR-based end-to-end compiler and runtime that lowers Machine Learning (ML) models to a unified IR that scales up to meet the needs of the datacenter and down to satisfy the constraints and special considerations of mobile and edge deployments.
integrates/llvm-20240501