-
Notifications
You must be signed in to change notification settings - Fork 598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU target parameters for data tiling. #18839
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Benoit Jacob <jacob.benoit.1@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool! Very nice comments for describing the heuristics. I'm probably being too picky about supporting narrow N/M cases, and it is fine to land as is for now, but I left a few comments.
// Number of SIMDs per workgroup | ||
OptionalParameter<"std::optional<int32_t>">:$workgroup_simds, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: rename to simds_per_workgroup
std::array<int32_t, 3> maxWorkgroupCounts; | ||
std::optional<int32_t> maxLoadInstructionBits; | ||
std::optional<int32_t> workgroupSimds; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: simdsPerWorkgroup
to match my above review comment.
int interleavingIdx = | ||
getInnermostCrossThreadDimIdx(swizzle.expandShape[1]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! This is much clearer to me than before.
// variables, we self-impose a second constraint for now: that the unrolling | ||
// shape should be square, i.e. unrollM == unrollN. Now we have only one | ||
// variable, call it x, to solve for. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will break down for narrow M or narrow N cases. Perhaps we can impose additional constraints in the event that M or N are narrow. For example, when M is small enough that the solution for unrollM == unrollN is too large, then we can impose an additional constraint on unrollM to be equal to the largest possible unrolling factor, and solve for unrollN:
x * M_max * sizeInBits(intrinsicC) + M_max * unrollK * sizeInBits(intrinsicA) + x * unrollK * sizeInBits(intrinsicB) == wgp.getVgprSpaceBits()
// 3. totalUnrollN >= totalUnrollM. | ||
// * Reason: Just like the previous constraint, that is also motivated by | ||
// the code below currently putting all the unroll-to-subgroups in the N | ||
// dimension, which requires a sufficiently large totalUnrollN. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine for now, since we are still not looking much at narrow cases, but just noting that it is sort of leading in the direction of doing transpose narrow N on GPU as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice work!
const int unrollK = std::max( | ||
1, static_cast<int>(*wgp.getMaxLoadInstructionBits() / | ||
std::min(intrinsicA.getElementTypeBitWidth() * | ||
intrinsicA.getNumElements(), | ||
intrinsicB.getElementTypeBitWidth() * | ||
intrinsicB.getNumElements()))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of suggestion, because this statement is very long.
- We can declare
numATotalBits
andnumBTotalBits
which helps the readability a bit. - you can write
std::max<int>()
, then you don't need thestatic_cast
cast.
auto sizeInBits = [](VectorType type) { | ||
return type.getElementTypeBitWidth() * type.getNumElements(); | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, you can move this lambda to above and reuse it in unrollK
computation.
static int64_t getInnermostCrossThreadDimIdx( | ||
const TileSwizzle::ExpandShapeDimVectorType &shape) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are missing the logic when there are no CrossThread kind of tiles. If it is very wrong, we can add an assertion at the end. If not (i.e., expected input), we should make the return type std::optiona<int64_t>
and ask the callers to handle the failure.
// The maximum number of workgroups per X/Y/Z dimension in a dispatch. | ||
"DenseI32ArrayAttr":$max_workgroup_counts, | ||
// Max load instruction size in bits | ||
OptionalParameter<"std::optional<int32_t>">:$max_load_instruction_bits, | ||
// Number of SIMDs per workgroup | ||
OptionalParameter<"std::optional<int32_t>">:$workgroup_simds, | ||
// VGPR register space size in bits |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style nit: let's add a period at the end of comments for consistency?
// The maximum number of workgroups per X/Y/Z dimension in a dispatch. | |
"DenseI32ArrayAttr":$max_workgroup_counts, | |
// Max load instruction size in bits | |
OptionalParameter<"std::optional<int32_t>">:$max_load_instruction_bits, | |
// Number of SIMDs per workgroup | |
OptionalParameter<"std::optional<int32_t>">:$workgroup_simds, | |
// VGPR register space size in bits | |
// The maximum number of workgroups per X/Y/Z dimension in a dispatch. | |
"DenseI32ArrayAttr":$max_workgroup_counts, | |
// Max load instruction size in bits. | |
OptionalParameter<"std::optional<int32_t>">:$max_load_instruction_bits, | |
// Number of SIMDs per workgroup. | |
OptionalParameter<"std::optional<int32_t>">:$workgroup_simds, | |
// VGPR register space size in bits. |
// Number of SIMDs per workgroup | ||
OptionalParameter<"std::optional<int32_t>">:$workgroup_simds, | ||
// VGPR register space size in bits | ||
OptionalParameter<"std::optional<int32_t>">:$vgpr_space_bits, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I followed the discussion on discord about why they are optional, very helpful. Can we add a comment that explains why they are optional in the td file?
This replaces some constants what were hardcoded in GPUMaterializeEncoding.cpp by actual GPU target parameters.
The logic in
getSwizzle
was doing wonky things with its own localconst int targetPreferredLoadBitWidth = 128;
, using it in a helper function inferring interleaving dimensions. That was all dating back to early days -- that was effectively trying to infer which inner-most dimensions to skip to get at the first CrossThread dimension... so that is one more thing that we can fix now that we haveTileSwizzle::Dim::Kind
. SeegetInnermostCrossThreadDimIdx
.The heuristic in
chooseDataTiledMMAAttr
becomes much more robust, and tested more extensively bygpu_materialize_encoding.mlir
, now that we can pass arbitrary parameters in ad-hoc#iree_gpu.target
attributes, see the test updates. It's unfortunately verbose (one screenful of MLIR code for each testcase) because each has to be a complete function withflow.dispatch
ops, but that's a separate problem.