GPU target parameters for data tiling. #18839

bjacob · 2024-10-18T20:30:39Z

This replaces some constants what were hardcoded in GPUMaterializeEncoding.cpp by actual GPU target parameters.

The logic in getSwizzle was doing wonky things with its own local const int targetPreferredLoadBitWidth = 128;, using it in a helper function inferring interleaving dimensions. That was all dating back to early days -- that was effectively trying to infer which inner-most dimensions to skip to get at the first CrossThread dimension... so that is one more thing that we can fix now that we have TileSwizzle::Dim::Kind. See getInnermostCrossThreadDimIdx.

The heuristic in chooseDataTiledMMAAttr becomes much more robust, and tested more extensively by gpu_materialize_encoding.mlir, now that we can pass arbitrary parameters in ad-hoc #iree_gpu.target attributes, see the test updates. It's unfortunately verbose (one screenful of MLIR code for each testcase) because each has to be a complete function with flow.dispatch ops, but that's a separate problem.

Signed-off-by: Benoit Jacob <jacob.benoit.1@gmail.com>

Max191

Cool! Very nice comments for describing the heuristics. I'm probably being too picky about supporting narrow N/M cases, and it is fine to land as is for now, but I left a few comments.

Max191 · 2024-10-18T20:43:43Z

compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/IREEGPUAttrs.td

+    // Number of SIMDs per workgroup
+    OptionalParameter<"std::optional<int32_t>">:$workgroup_simds,


nit: rename to simds_per_workgroup

Max191 · 2024-10-18T20:45:18Z

compiler/src/iree/compiler/Codegen/Dialect/GPU/TargetUtils/KnownTargets.cpp

  std::array<int32_t, 3> maxWorkgroupCounts;
+  std::optional<int32_t> maxLoadInstructionBits;
+  std::optional<int32_t> workgroupSimds;


nit: simdsPerWorkgroup to match my above review comment.

Max191 · 2024-10-18T20:51:41Z

compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/GPUTileSwizzleUtils.cpp

+      int interleavingIdx =
+          getInnermostCrossThreadDimIdx(swizzle.expandShape[1]);


Nice! This is much clearer to me than before.

Max191 · 2024-10-18T21:09:58Z

compiler/src/iree/compiler/Codegen/Common/GPU/GPUMaterializeEncoding.cpp

+  // variables, we self-impose a second constraint for now: that the unrolling
+  // shape should be square, i.e. unrollM == unrollN. Now we have only one
+  // variable, call it x, to solve for.


This will break down for narrow M or narrow N cases. Perhaps we can impose additional constraints in the event that M or N are narrow. For example, when M is small enough that the solution for unrollM == unrollN is too large, then we can impose an additional constraint on unrollM to be equal to the largest possible unrolling factor, and solve for unrollN:

x * M_max * sizeInBits(intrinsicC) + M_max * unrollK * sizeInBits(intrinsicA) + x * unrollK * sizeInBits(intrinsicB) == wgp.getVgprSpaceBits()

Max191 · 2024-10-18T21:15:50Z

compiler/src/iree/compiler/Codegen/Common/GPU/GPUMaterializeEncoding.cpp

+  // 3. totalUnrollN >= totalUnrollM.
+  //    * Reason: Just like the previous constraint, that is also motivated by
+  //      the code below currently putting all the unroll-to-subgroups in the N
+  //      dimension, which requires a sufficiently large totalUnrollN.


This is fine for now, since we are still not looking much at narrow cases, but just noting that it is sort of leading in the direction of doing transpose narrow N on GPU as well.

hanhanW

nice work!

hanhanW · 2024-10-18T21:10:24Z

compiler/src/iree/compiler/Codegen/Common/GPU/GPUMaterializeEncoding.cpp

+  const int unrollK = std::max(
+      1, static_cast<int>(*wgp.getMaxLoadInstructionBits() /
+                          std::min(intrinsicA.getElementTypeBitWidth() *
+                                       intrinsicA.getNumElements(),
+                                   intrinsicB.getElementTypeBitWidth() *
+                                       intrinsicB.getNumElements())));


A couple of suggestion, because this statement is very long.

We can declare numATotalBits and numBTotalBits which helps the readability a bit.

you can write std::max<int>(), then you don't need the static_cast cast.

hanhanW · 2024-10-18T21:12:09Z

compiler/src/iree/compiler/Codegen/Common/GPU/GPUMaterializeEncoding.cpp

+  auto sizeInBits = [](VectorType type) {
+    return type.getElementTypeBitWidth() * type.getNumElements();
+  };


ah, you can move this lambda to above and reuse it in unrollK computation.

hanhanW · 2024-10-18T21:18:59Z

compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/GPUTileSwizzleUtils.cpp

+static int64_t getInnermostCrossThreadDimIdx(
+    const TileSwizzle::ExpandShapeDimVectorType &shape) {


We are missing the logic when there are no CrossThread kind of tiles. If it is very wrong, we can add an assertion at the end. If not (i.e., expected input), we should make the return type std::optiona<int64_t> and ask the callers to handle the failure.

hanhanW · 2024-10-18T21:20:49Z

compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/IREEGPUAttrs.td

+    // The maximum number of workgroups per X/Y/Z dimension in a dispatch.
    "DenseI32ArrayAttr":$max_workgroup_counts,
+    // Max load instruction size in bits
+    OptionalParameter<"std::optional<int32_t>">:$max_load_instruction_bits,
+    // Number of SIMDs per workgroup
+    OptionalParameter<"std::optional<int32_t>">:$workgroup_simds,
+    // VGPR register space size in bits


style nit: let's add a period at the end of comments for consistency?

Suggested change

// The maximum number of workgroups per X/Y/Z dimension in a dispatch.

"DenseI32ArrayAttr":$max_workgroup_counts,

// Max load instruction size in bits

OptionalParameter<"std::optional<int32_t>">:$max_load_instruction_bits,

// Number of SIMDs per workgroup

OptionalParameter<"std::optional<int32_t>">:$workgroup_simds,

// VGPR register space size in bits

// The maximum number of workgroups per X/Y/Z dimension in a dispatch.

"DenseI32ArrayAttr":$max_workgroup_counts,

// Max load instruction size in bits.

OptionalParameter<"std::optional<int32_t>">:$max_load_instruction_bits,

// Number of SIMDs per workgroup.

OptionalParameter<"std::optional<int32_t>">:$workgroup_simds,

// VGPR register space size in bits.

hanhanW · 2024-10-18T21:26:58Z

compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/IREEGPUAttrs.td

+    // Number of SIMDs per workgroup
+    OptionalParameter<"std::optional<int32_t>">:$workgroup_simds,
+    // VGPR register space size in bits
+    OptionalParameter<"std::optional<int32_t>">:$vgpr_space_bits,


I followed the discussion on discord about why they are optional, very helpful. Can we add a comment that explains why they are optional in the td file?

gpu-params

9daa23a

Signed-off-by: Benoit Jacob <jacob.benoit.1@gmail.com>

bjacob marked this pull request as ready for review October 18, 2024 20:37

bjacob requested review from kuhar, antiagainst and qedawkins as code owners October 18, 2024 20:37

bjacob requested review from hanhanW and Max191 October 18, 2024 20:38

Max191 reviewed Oct 18, 2024

View reviewed changes

hanhanW requested changes Oct 18, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU target parameters for data tiling. #18839

GPU target parameters for data tiling. #18839

bjacob commented Oct 18, 2024

Max191 left a comment

Max191 Oct 18, 2024

Max191 Oct 18, 2024

Max191 Oct 18, 2024

Max191 Oct 18, 2024

Max191 Oct 18, 2024

hanhanW left a comment

hanhanW Oct 18, 2024

hanhanW Oct 18, 2024

hanhanW Oct 18, 2024

hanhanW Oct 18, 2024

hanhanW Oct 18, 2024

		// Number of SIMDs per workgroup
		OptionalParameter<"std::optional<int32_t>">:$workgroup_simds,

		int interleavingIdx =
		getInnermostCrossThreadDimIdx(swizzle.expandShape[1]);

		static int64_t getInnermostCrossThreadDimIdx(
		const TileSwizzle::ExpandShapeDimVectorType &shape) {

GPU target parameters for data tiling. #18839

Are you sure you want to change the base?

GPU target parameters for data tiling. #18839

Conversation

bjacob commented Oct 18, 2024

Max191 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hanhanW left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment