AMDGPU ukernels: Bazel build, separate bitcode files, c-embed archives. #19274

bjacob · 2024-11-22T21:00:44Z

Implement Bazel, generate CMake from Bazel.
Split .bc bitcode files, one .bc file <-> one ukernel function.
Generate embedded-data archives.
Update the compiler code to use the embedded-data archives.
Simplify setAlwaysInline now that we are no longer dealing with HIP symbols.

gfx942_rocm

build_tools/cmake/iree_bitcode_library.cmake

ScottTodd · 2024-11-25T17:34:17Z

compiler/plugins/target/ROCM/builtins/ukernel/BUILD.bazel

+gpu_archs = [
+    "gfx90a",
+    "gfx942",
+    "gfx1030",
+    "gfx1100",
+]


How many of these architectures are we going to need [built into the compiler]? How much binary size and compile time does each add?

If less than 10, that seems acceptable. Any more than that and I'd worry about maintenance, binary size, and compile time costs.

Some suggestions:

Could add a comment here with a link to a list of available architectures and some description of which are worth including here. Docs: https://iree.dev/guides/deployment-configurations/gpu-rocm/#compile-a-program and https://llvm.org/docs/AMDGPUUsage.html#processors could work.

Would we want to add a CMake option to add other architectures to the list? (and possibly Bazel...)

Can iree-compile generate these on-demand if a user passes a target chip that isn't in the list of packaged ukernels? (Or, if these are embedded in the runtime, can they be configured when building programs like iree-run-module?)

The current argmax ukernel here is more aggressive in supporting "all" architectures than what I envision adding going forward.

I mostly envision ukernels for iree_gpu.multi_mma , each painstakingly hand-written in amdgpu assembly (wrapped as inline-assembly into a C form). My whole motivation is to work around defects in the LLVM AMDGPU backend, and assembly is the only thing that doesn't ultimately depend on that. So, the ukernels that I will be adding will by construction be specific to one architecture, and expensive in engineering time.

This argmax ukernel is different: written in C and compiled for "all currently relevant architectures", what it works around is the difficulty of generating efficient code for this kind of kernel in the MLIR programming model, due to the very specific things that argmax needs to do (see wave reductions (wfred) and ballots in the C sources).

Even then, even for such argmax-style C ukernels, the number of architectures is still less than 10 in any foreseeable future -- since this is an optimization-only thing (argmax still works without ukernels, which is also why they are not enabled by default) --- we never need to support more architectures than we really care about performance on. At the moment, the number of architectures that we really care about performance on, equals 1 (that is: CDNA3)...

Would we want to add a CMake option to add other architectures to the list? (and possibly Bazel...)

That would make sense if the current argmax-style ukernels (which are C that can simply be recompiled to any target arch) were the generic case going forward (but see above explanation).

Can iree-compile generate these on-demand if a user passes a target chip that isn't in the list of packaged ukernels?

In a future where this genericity were really important (meaning, a future where such retargetable C ukernels are important and we have many target architectures) we could do that by actually shipping ukernels in source C form... and building clang C-compiling powers into iree-compile... just a theoretical possibility. See above explanation for why we are not headed in that direction.

The really important high level point here is that our plan A is still to root for LLVM AMDGPU improvements. So that before any such future happens... LLVM AMDGPU will be in a better shape and will remove the need for ukernels. And then for things like argmax where the compiler can't magically invent the right implementation with wave reductions and ballot, it'll be possible to at least implement that as a MLIR lowering, mirroring the current C code in that ukernel. (While for my use case with multi_mma, it really is just a compiler bug that it isn't currently generating the fastest code).

Thanks, that helps a lot. I don't have any extra concerns with this approach for now then.

Can we get some of that text into a comment somewhere in the source and/or a markdown file?

I just added an allusion to that in addressing your review comment on BUILD.bazel. I expect that the code is going to start conveying intent better once (1) the next PR moves the code loading the bitcode to LowerToUkernels (see other reply) and (2) I start adding multi_mma ukernels that provide the better model for the generic case going forward. At that point we'll have fewer things requiring comment (because always best when the code speaks for itself) and more stable places to add comments on what still needs commenting.

ScottTodd · 2024-11-25T17:50:49Z

compiler/plugins/target/ROCM/ROCMTargetUtils.cpp

+#include "compiler/plugins/target/ROCM/builtins/ukernel/iree_uk_amdgpu_gfx1030.h"
+#include "compiler/plugins/target/ROCM/builtins/ukernel/iree_uk_amdgpu_gfx1100.h"
+#include "compiler/plugins/target/ROCM/builtins/ukernel/iree_uk_amdgpu_gfx90a.h"
+#include "compiler/plugins/target/ROCM/builtins/ukernel/iree_uk_amdgpu_gfx942.h"


Could group these into a single header file, perhaps with the contents of that header file generated by the build system.

There is by construction one header file per c_embed_data library. So the question becomes whether to group these 4 c_embed_data libraries into one. I have a specific reason not to, which will become apparent in the next PR: I want the c_embed_data library for a given target architecture to contain exactly the ukernels that are supported on that architecture, so that just by looking up the table of contents, LowerToUkernels will be able to tell whether a ukernel lowering should succeed or fail.

ScottTodd · 2024-11-25T17:53:34Z

compiler/plugins/target/ROCM/ROCMTargetUtils.cpp

+static SmallVector<llvm::Expected<std::unique_ptr<llvm::Module>>>
+loadUKernelBitcodeFiles(llvm::LLVMContext &context, StringRef targetGpuArch) {
+  const iree_file_toc_t *toc = nullptr;
+  int toc_size = 0;
+  if (targetGpuArch == "gfx90a") {
+    toc = iree_uk_amdgpu_gfx90a_create();
+    toc_size = iree_uk_amdgpu_gfx90a_size();
+  } else if (targetGpuArch == "gfx942") {
+    toc = iree_uk_amdgpu_gfx942_create();
+    toc_size = iree_uk_amdgpu_gfx942_size();
+  } else if (targetGpuArch == "gfx1030") {
+    toc = iree_uk_amdgpu_gfx1030_create();
+    toc_size = iree_uk_amdgpu_gfx1030_size();
+  } else if (targetGpuArch == "gfx1100") {
+    toc = iree_uk_amdgpu_gfx1100_create();
+    toc_size = iree_uk_amdgpu_gfx1100_size();
+  } else {
+    return {};
+  }


Similar comment here - could have the build system generate this if the list is expected to grow or be dynamic.

Or have a registration mechanism / dependency injection, so we don't have 5 places to update whenever adding a new architecture.

See reply to previous comment. This code, currently in ROCMTargetUtils (ie linking-time) will move up to the earlier LowerToUkernels time where it will be key to how LowerToUkernels decides whether to succeed or fail. Separate arch-specific TOC's are going to be key.

ScottTodd

LGTM. May want at least one other person familiar with AMDGPU codegen to review though.

kuhar

LGTM % nits

build_tools/bazel/iree_bitcode_library.bzl

compiler/plugins/target/ROCM/ROCMTargetUtils.cpp

Signed-off-by: Benoit Jacob <jacob.benoit.1@gmail.com>

…s. (iree-org#19274) 1. Implement Bazel, generate CMake from Bazel. 2. Split .bc bitcode files, one .bc file <-> one ukernel function. 3. Generate embedded-data archives. 4. Update the compiler code to use the embedded-data archives. 5. Simplify setAlwaysInline now that we are no longer dealing with HIP symbols. --------- Signed-off-by: Benoit Jacob <jacob.benoit.1@gmail.com>

…s. (iree-org#19274) 1. Implement Bazel, generate CMake from Bazel. 2. Split .bc bitcode files, one .bc file <-> one ukernel function. 3. Generate embedded-data archives. 4. Update the compiler code to use the embedded-data archives. 5. Simplify setAlwaysInline now that we are no longer dealing with HIP symbols. --------- Signed-off-by: Benoit Jacob <jacob.benoit.1@gmail.com> Signed-off-by: Giacomo Serafini <179146510+giacs-epic@users.noreply.github.com>

bjacob mentioned this pull request Nov 22, 2024

Bunch of AMDGPU ukernel changes #19272

Closed

bjacob requested review from manupak, raikonenfnu, Groverkss and benvanik November 22, 2024 21:03

bjacob marked this pull request as ready for review November 22, 2024 21:04

bjacob requested review from kuhar, MaheshRavishankar and ScottTodd as code owners November 22, 2024 21:04

Base automatically changed from users/bjacob/uk-split to main November 24, 2024 02:14

bjacob requested review from qedawkins and antiagainst as code owners November 24, 2024 02:14

bjacob force-pushed the users/bjacob/uk-bazel-embed branch from b0bbed8 to bd7f1af Compare November 25, 2024 16:39

ScottTodd reviewed Nov 25, 2024

View reviewed changes

bjacob requested a review from ScottTodd November 25, 2024 18:20

ScottTodd approved these changes Nov 25, 2024

View reviewed changes

kuhar approved these changes Nov 27, 2024

View reviewed changes

bjacob added 2 commits November 27, 2024 11:37

build-system do-over; switch to embedded data files

570bf31

Signed-off-by: Benoit Jacob <jacob.benoit.1@gmail.com>

review comments

a483247

Signed-off-by: Benoit Jacob <jacob.benoit.1@gmail.com>

bjacob force-pushed the users/bjacob/uk-bazel-embed branch from 43af469 to a483247 Compare November 27, 2024 19:16

bjacob enabled auto-merge (squash) November 27, 2024 19:17

bjacob merged commit 8272490 into main Nov 27, 2024
40 checks passed

bjacob deleted the users/bjacob/uk-bazel-embed branch November 27, 2024 19:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AMDGPU ukernels: Bazel build, separate bitcode files, c-embed archives. #19274

AMDGPU ukernels: Bazel build, separate bitcode files, c-embed archives. #19274

bjacob commented Nov 22, 2024

ScottTodd Nov 25, 2024

bjacob Nov 25, 2024 •

edited

Loading

ScottTodd Nov 25, 2024

bjacob Nov 25, 2024

ScottTodd Nov 25, 2024

bjacob Nov 25, 2024

ScottTodd Nov 25, 2024

bjacob Nov 25, 2024

ScottTodd left a comment

kuhar left a comment

AMDGPU ukernels: Bazel build, separate bitcode files, c-embed archives. #19274

AMDGPU ukernels: Bazel build, separate bitcode files, c-embed archives. #19274

Conversation

bjacob commented Nov 22, 2024

ScottTodd Nov 25, 2024

Choose a reason for hiding this comment

bjacob Nov 25, 2024 • edited Loading

Choose a reason for hiding this comment

ScottTodd Nov 25, 2024

Choose a reason for hiding this comment

bjacob Nov 25, 2024

Choose a reason for hiding this comment

ScottTodd Nov 25, 2024

Choose a reason for hiding this comment

bjacob Nov 25, 2024

Choose a reason for hiding this comment

ScottTodd Nov 25, 2024

Choose a reason for hiding this comment

bjacob Nov 25, 2024

Choose a reason for hiding this comment

ScottTodd left a comment

Choose a reason for hiding this comment

kuhar left a comment

Choose a reason for hiding this comment

bjacob Nov 25, 2024 •

edited

Loading