Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU target parameters for data tiling. #18839

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

bjacob
Copy link
Contributor

@bjacob bjacob commented Oct 18, 2024

This replaces some constants what were hardcoded in GPUMaterializeEncoding.cpp by actual GPU target parameters.

The logic in getSwizzle was doing wonky things with its own local const int targetPreferredLoadBitWidth = 128;, using it in a helper function inferring interleaving dimensions. That was all dating back to early days -- that was effectively trying to infer which inner-most dimensions to skip to get at the first CrossThread dimension... so that is one more thing that we can fix now that we have TileSwizzle::Dim::Kind. See getInnermostCrossThreadDimIdx.

The heuristic in chooseDataTiledMMAAttr becomes much more robust, and tested more extensively by gpu_materialize_encoding.mlir, now that we can pass arbitrary parameters in ad-hoc #iree_gpu.target attributes, see the test updates. It's unfortunately verbose (one screenful of MLIR code for each testcase) because each has to be a complete function with flow.dispatch ops, but that's a separate problem.

Signed-off-by: Benoit Jacob <jacob.benoit.1@gmail.com>
@bjacob bjacob marked this pull request as ready for review October 18, 2024 20:37
@bjacob bjacob requested review from hanhanW and Max191 October 18, 2024 20:38
Copy link
Contributor

@Max191 Max191 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! Very nice comments for describing the heuristics. I'm probably being too picky about supporting narrow N/M cases, and it is fine to land as is for now, but I left a few comments.

Comment on lines +343 to +344
// Number of SIMDs per workgroup
OptionalParameter<"std::optional<int32_t>">:$workgroup_simds,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: rename to simds_per_workgroup

std::array<int32_t, 3> maxWorkgroupCounts;
std::optional<int32_t> maxLoadInstructionBits;
std::optional<int32_t> workgroupSimds;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: simdsPerWorkgroup to match my above review comment.

Comment on lines +168 to +169
int interleavingIdx =
getInnermostCrossThreadDimIdx(swizzle.expandShape[1]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! This is much clearer to me than before.

Comment on lines +123 to +125
// variables, we self-impose a second constraint for now: that the unrolling
// shape should be square, i.e. unrollM == unrollN. Now we have only one
// variable, call it x, to solve for.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will break down for narrow M or narrow N cases. Perhaps we can impose additional constraints in the event that M or N are narrow. For example, when M is small enough that the solution for unrollM == unrollN is too large, then we can impose an additional constraint on unrollM to be equal to the largest possible unrolling factor, and solve for unrollN:

x * M_max * sizeInBits(intrinsicC) + M_max * unrollK * sizeInBits(intrinsicA) + x * unrollK * sizeInBits(intrinsicB) == wgp.getVgprSpaceBits()

Comment on lines +166 to +169
// 3. totalUnrollN >= totalUnrollM.
// * Reason: Just like the previous constraint, that is also motivated by
// the code below currently putting all the unroll-to-subgroups in the N
// dimension, which requires a sufficiently large totalUnrollN.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine for now, since we are still not looking much at narrow cases, but just noting that it is sort of leading in the direction of doing transpose narrow N on GPU as well.

Copy link
Contributor

@hanhanW hanhanW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work!

Comment on lines +105 to +110
const int unrollK = std::max(
1, static_cast<int>(*wgp.getMaxLoadInstructionBits() /
std::min(intrinsicA.getElementTypeBitWidth() *
intrinsicA.getNumElements(),
intrinsicB.getElementTypeBitWidth() *
intrinsicB.getNumElements())));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of suggestion, because this statement is very long.

  1. We can declare numATotalBits and numBTotalBits which helps the readability a bit.
  2. you can write std::max<int>(), then you don't need the static_cast cast.

Comment on lines +135 to +137
auto sizeInBits = [](VectorType type) {
return type.getElementTypeBitWidth() * type.getNumElements();
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, you can move this lambda to above and reuse it in unrollK computation.

Comment on lines +147 to +148
static int64_t getInnermostCrossThreadDimIdx(
const TileSwizzle::ExpandShapeDimVectorType &shape) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are missing the logic when there are no CrossThread kind of tiles. If it is very wrong, we can add an assertion at the end. If not (i.e., expected input), we should make the return type std::optiona<int64_t> and ask the callers to handle the failure.

Comment on lines +339 to +345
// The maximum number of workgroups per X/Y/Z dimension in a dispatch.
"DenseI32ArrayAttr":$max_workgroup_counts,
// Max load instruction size in bits
OptionalParameter<"std::optional<int32_t>">:$max_load_instruction_bits,
// Number of SIMDs per workgroup
OptionalParameter<"std::optional<int32_t>">:$workgroup_simds,
// VGPR register space size in bits
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style nit: let's add a period at the end of comments for consistency?

Suggested change
// The maximum number of workgroups per X/Y/Z dimension in a dispatch.
"DenseI32ArrayAttr":$max_workgroup_counts,
// Max load instruction size in bits
OptionalParameter<"std::optional<int32_t>">:$max_load_instruction_bits,
// Number of SIMDs per workgroup
OptionalParameter<"std::optional<int32_t>">:$workgroup_simds,
// VGPR register space size in bits
// The maximum number of workgroups per X/Y/Z dimension in a dispatch.
"DenseI32ArrayAttr":$max_workgroup_counts,
// Max load instruction size in bits.
OptionalParameter<"std::optional<int32_t>">:$max_load_instruction_bits,
// Number of SIMDs per workgroup.
OptionalParameter<"std::optional<int32_t>">:$workgroup_simds,
// VGPR register space size in bits.

// Number of SIMDs per workgroup
OptionalParameter<"std::optional<int32_t>">:$workgroup_simds,
// VGPR register space size in bits
OptionalParameter<"std::optional<int32_t>">:$vgpr_space_bits,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed the discussion on discord about why they are optional, very helpful. Can we add a comment that explains why they are optional in the td file?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants