Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SM68] Initial proposal for Wave Matrix #61

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
396 changes: 396 additions & 0 deletions proposals/xxxx-wave-matrix.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,396 @@
<!-- {% raw %} -->

# Wave Matrix

* Proposal: [NNNN](NNNN-wave-matrix.md)
* Author(s): [Chris Bieneman](https://github.com/llvm-beanz)
* Sponsor: [Greg Roth](https://github.com/pow2clk), [Tex Riddell](https://github.com/tex3d)
* Status: **Under Consideration**
* Planned Version: Shader Model 6.8


## Introduction

This proposal adds HLSL support for a new DXIL feature WaveMatrix. The new data
types and operations add support for wave cooperative matrix multiplication and
accumulation. The underlying hardware and driver interfaces for this feature are
introduced in Shader Model 6.8.

## Motivation

GPUs have added dedicated hardware to support cooperative matrix multiplication
across SIMD lanes. This feature proposes adding new data types and built-in
functions to enable higher throughput matrix operations that fully utilize GPU
SIMD hardware.

These higher throughput matrix operations are required for optimal performance
of many machine learning and image processing workloads. Adding native support
to HLSL will enable high-performance matrix operations across all supported
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's minor, but I fear the notion of "supported" is vague here. It may be interpreted as all hardware that supports SM 6.8, which is probably not the case. My feeble attempt to recraft the sentence:

Adding support to HLSL will enable the high-performance of native matrix operations across all hardware with such support through Shader Model 6.8 drivers.

hardware with Shader Model 6.8 drivers.

## Proposed solution

WaveMatrix introduces new matrix templates to facilitate wave cooperative
operations:

```c++
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem to make a lick of difference in this case, but ```hlsl is an option in github:

WaveMatrixLeft  <TYPE_IN, M, N> ;             // M x K
WaveMatrixLeft  <TYPE_IN, M, N> ;             // M x K

Maybe someday we'll get proper syntax highlighting?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI if you check the 3rd party grammars supported by Github's Linguist integration here, HLSL is listed as being supported by Tim's textmate grammar (so hlsl is a valid fenced code block and we just need to update that grammar to update it as the language changes). I would actually suggest having an "official" Microsoft-maintained grammar file that you can update as the compiler updates if it isn't too much trouble. It'd be super helpful for HLSL tool maintainers to have a more "official" syntax highlighter.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this code block the HLSL highlighting is probably okay, but GitHub’s HLSL syntax highlighting doesn’t handle HLSL 2021 particularly well. The C++ mode syntax highlighting does much better IMO.

As an example here’s HLSL highlighted:

namespace detail {
template <typename ElTy, int NRows, int NCols>
class WaveMatrixBase {
};
} // namespace detail

And the same code C++ highlighted:

namespace detail {
template <typename ElTy, int NRows, int NCols>
class WaveMatrixBase {
};
} // namespace detail

I’m fine changing the simpler code samples to be HLSL highlighted, but I’d greatly prefer to keep the ones that have templates C++-mode. I had used C++ everywhere for consistency, but that isn’t strictly necessary.

Copy link

@jeremyong jeremyong Aug 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I think the thing to do is to fork the C++ tmLanguage file as a base, adding HLSL specific constructs/keywords, and then swap it as the official highlighting grammar for github (instructions for doing that are here. It'd be a shame for this to be fixed later and then have a bunch of HLSL snippets throughout the github ecosystem be tagged as C++ snippets into perpetuity I think!

// Matrix Depth (K) dimension is hardware dependent
// With TYPE_IN one of {float32_t, float16_t, uint8_t4_packed, int8_t4_packed}
WaveMatrixLeft <TYPE_IN, M, N> ; // M x K
WaveMatrixRight <TYPE_IN, M, N> ; // K x N

// With TYPE_ACC one of {float32_t, float16_t, int32_t}
WaveMatrixAccumulator <TYPE_ACC, M, N> ; // M x N
// WaveMatrixLeftColAcc and WaveMatrixRightRowAcc are provided support for
// quantization algorithms. See Zero Point section

// For accumulating columns from WaveMatrixLeft into a single column of sums
WaveMatrixLeftColAcc <TYPE_ACC, M, N> ; // M x 1

// For accumulating rows from WaveMatrixRight into a single row of sums
WaveMatrixRightRowAcc <TYPE_ACC, M, N> ; // 1 x N
```

WaveMatrix accumulator object methods support operating on corresponding Left
and Right operands. Results of operations can be stored or accumulated back into
the accumulator. A simple example of multiplication is:

```c++
[numthreads(64,1,1)]
void main(uint3 GTID : SV_GroupThreadID, uint GIDX : SV_GroupIndex)
{
WaveMatrixLeft<float, 16, 16> Left;
WaveMatrixRight<float, 16, 16> Right;
WaveMatrixAccumulator<float, 16, 16> Acc;

Acc.Multiply(Left, Right); // Stores Left * Right into Acc
Acc.MultiplyAccumulate(Left, Right); // Adds Left * Right to Acc
}
```

## Detailed design

### Reading This Spec

The next few sections include HLSL object definitions written in HLSL 2021
syntax and using inheritance to represent interface composition. The objects in
the `detail` namespace are not exposed in the HLSL runtime. They are provided
here to make the specification more concise to consume. Objects in no namespace
are exposed as public interfaces.

### WaveMatrix Fill

All WaveMatrix objects have a `Fill` method of the form `void Fill(ElTy Value)`
where `ElTy` is the element type.

The `Fill` method fills the matrix or matrix fragment with the provided value.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a more technical description would be "Assigns the given Value to every element in the matrix or matrix fragment".

All wave threads must provide the same value or the result is undefined. All
WaveMatrix objects have the same `Fill` method with the same behavior.

### WaveMatrix Matrix Objects
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like the explanation of Fill above would be a bit more logical after this section


The code below approximately declares the base interface that WaveMatrix matrix
objects implement.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add "the following sections will explain in detail what these methods do and what the parameters represent."


```c++
namespace detail {
template <typename ElTy, int NRows, int NCols>
class WaveMatrixBase {
void Fill(ElTy Val);
void Load(ByteAddressBuffer Res, uint StartOffset, uint Stride, bool ColMajor,
uint Align = 0);
void Load(RWByteAddressBuffer Res, uint StartOffset, uint Stride,
bool ColMajor, uint Align = sizeof(ElTy));

void Load(groupshared ElTy Arr[], uint StartIdx, uint Stride, bool ColMajor);

void Store(RWByteAddressBuffer Res, uint StartOffset, uint Stride,
bool ColMajor, uint Align = sizeof(ElTy));

void Store(groupshared ElTy Arr[], uint StartIdx, uint Stride, bool ColMajor);
};
} // namespace detail
```

#### Loading and Storing WaveMatrix Matrix Objects

Values for WaveMatrix matrix objects can be loaded from and stored to
`[RW]ByteAddressBuffer` or `groupshared` array of type `ElTy`.

Loading from or storing to a `[RW]ByteAddressBuffer` takes a start offset in
bytes, row stride in bytes, and optional row alignment in bytes. Rows begin at
algined offsets from the `StartOffset` based on the `Stride` and `Alignment`
(`StartOffset + (RowIdx * [(Stride % Alignment) + Stride])`).

The `Alignment` must be at least `sizeof(ElTy)` otherwise the load behavior is
undefined.

Loading from or storing to a `groupshared` array takes the starting index, and
the row stride as a number of elements.

##### Orientation

When loading and storing WaveMatrix matrices a boolean parameter is provided to
indicate if the matrix being loaded or stored is column major.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a brief explanation of what column major means is in order. I think we should say that, when false, row-major orientation is indicated


Matrices may be stored in row or column layout. Matrices are always loaded into
row-major orientation in memory for WaveMatrix objects. The `Load` and `Store`
methods perform matrix transposition when loading or storing column major
matrices.

##### Stride

When loading and storing from `groupshared` arrays, the stride is expressed in
number of elements.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, perhaps a quick definition of what the stride means in this context in addition to the units for the matrix types


When loading and storing from `[RW]ByteAddressBuffer` types, the stride is
expressed in bytes. The row stride must be a multiple of the size of the
element, and greater than or equal to the size of the element multiplied by the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think a comma is needed here since there are just two operands to "and"

number of elements per row. Any value below the minimum legal values are
ignored. The behavior of row stride values that are not a multiple of element
stride is undefined.

#### WaveMatrix Left & Right
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Text similar to the above "The code below approximately declares the base interface that WaveMatrix matrix
objects implement." would make clear the intent of this section


```c++
template <typename ElTy, int NRows, int NCols>
class WaveMatrixLeft : detail::WaveMatrixBase<ElTy, NRows, NCols> {
uint MatrixDepth();
};

template <typename ElTy, int NRows, int NCols>
class WaveMatrixRight : detail::WaveMatrixBase<ElTy, NRows, NCols> {
uint MatrixDepth();
};
```

`ElTy` must be either a 32 or 16 bit floating point type or an 8 bit packed
signed or unsigned integer type. `NRows` and `NCols` must be compile-time
constant expressions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is fairly obvious, but perhaps stating that NRows is the number of rows and NCols the number of columns.

I think we should explain what Left and Right matrices are and how they can be used

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to be a little careful about how deep this document goes in terms of usage. This isn't intended to be user documentation, this is a specification of the language feature.


##### WaveMatrix(Left|Right) MatrixDepth

The `MatrixDepth` method returns the hardware-dependent depth for the matrix
multiplication unit. The resulting value must be an even multiple of 16.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're using "even" here to indicate that there is no remainder. I'm not sure it's necessary and it might be interpreted as the number itself will not be odd. Obviously a multiple of 16 won't be, but still it's a bit ambiguous initially.


### WaveMatrix Matrix Fragment Objects

The code below approximately declares the base interface that all WaveMatrix
fragment objects implement. A WaveMatrix fragment object stores a single row or
column of a WaveMatrix matrix.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does what it represents depend on whether the matrix or row or column major?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it depends on which specific fragment type you use, either WaveMatrixLeftColAcc or WaveMatrixRightRowAcc, which are a column or row respectively.


```c++
namespace detail {
template <typename ElTy, int NRows, int NCols>
class WaveMatrixFragmentBase : WaveMatrixBase<ElTy, NRows, NCols> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inheritance here would seem to suggest that redeclaring these methods isn't strictly necessary. I agree that repeating it makes things clearer, but then I don't know if the inheritance adds anything and it makes the reader check above to see if there are any differences. I did exactly that briefly and saw none. Are there any?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inheritance should not be there. The fragment classes don't support orientation specification.

void Fill(ElTy Val);

void Load(ByteAddressBuffer Res, uint StartOffset, uint Stride,
uint Align = 0);
void Load(RWByteAddressBuffer Res, uint StartOffset, uint Stride,
uint Align = 0);

void Load(groupshared ElTy Arr[], uint StartIdx, uint Stride);

void Store(RWByteAddressBuffer Res, uint StartOffset, uint Stride,
uint Align = 0);

void Store(groupshared ElTy Arr[], uint StartIdx, uint Stride);
};
} // namespace detail
```

#### Loading and Storing WaveMatrix Fragment Objects

Values for WaveMatrix fragment objects can be loaded from and stored to
`[RW]ByteAddressBuffer` or `groupshared` array of type `ElTy`.

Loading from or storing to a `[RW]ByteAddressBuffer` takes a start offset in
bytes, element stride in bytes, and optional alignment in bytes. The alignment
must be at least `sizeof(ElTy)` otherwise the load behavior is undefined.

Loading from or storing to a `groupshared` array takes the starting index, and
the element stride as a number of elements.

##### Stride

When loading and storing from `groupshared` arrays, the stride is expressed in
number of elements.

When loading and storing from `[RW]ByteAddressBuffer` types the stride must be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lacks the clarification that these are specified in bytes

greater than or equal to the size of the element type. Any value below the
minimum legal values are ignored. The behavior of row stride values that are not
a multiple of element stride is undefined.

### WaveMatrix Accumulator Objects

The WaveMatrix Accumulator objects come in three forms which are represented by
two categories: matrix accumulators and fragment accumulators. All accumulators
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found "which are represented by" to be confusing because there are 3 of one and 2 of another. Would "fall into two categories" be equally accurate?

implement the interface below:

```c++
namespace detail {
template <typename ElTy, int NRows, int NCols, typename BaseT>
class WaveMatrixAccumulatorBase : BaseT<ElTy, NRows, NCols> {

void ScalarMultiply(ElTy Value);
void ScalarDivide(ElTy Value);
void ScalarAdd(ElTy Value);
void ScalarSubtract(ElTy Value);
};
} // namespace detail
```

Each of these operations performs the corresponding element-wise arithmetic
operation on the accumulator and the provided scalar value, storing the result
back into the accumulator.

### WaveMatrixAccumulator

```c++
namespace detail {
template <typename T>
struct is_8bit_packed_int_type =
std::enable_if_t<(std::is_same<T, int8_t4_packed>::value ||
std::is_same<T, uint8_t4_packed>::value),
std::true_type>;

template <typename T> struct is_8bit_packed_int_type = std::false_type;

template <typename T>
struct is_32bit_int_type = std::enable_if_t<(std::is_same<T, int32_t>::value ||
std::is_same<T, uint32_t>::value),
std::true_type>;

template <typename T> struct is_32bit_int_type = std::false_type;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid I'm not familiar with this feature of C++ and it looks like you're redeclaring variables. Am I exceptionally ignorant of this feature (possible) or is it going to confuse others?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's because I mixed syntaxes and made this invalid. That should be a type alias:

using is_32bit_int_type = std::false_type;

} // namespace detail

template <typename ElTy, int NRows, int NCols>
class WaveMatrixAccumulator
: WaveMatrixAccumulatorBase<ElTy, NRows, NCols, detail::WaveMatrixBase> {
void Add(WaveMatrixLeftColAcc<ElTy, NRows, NCols> LeftMatrix);
void Add(WaveMatrixRightRowAcc<ElTy, NRows, NCols> RightMatrix);
void Add(WaveMatrixAccumulator<ElTy, NRows, NCols> Matrix);

void Multiply(WaveMatrixLeft<ElTy, NRows, NCols> LHS,
WaveMatrixRight<ElTy, NRows, NCols> RHS);
void MultiplyAccumulate(WaveMatrixLeft<ElTy, NRows, NCols> LHS,
WaveMatrixRight<ElTy, NRows, NCols> RHS);

template <typename MatElTy1, typename MatElTy2>
std::enable_if_t<detail::is_32bit_int_type<ElTy> &&
detail::is_8bit_packed_int_type<MatElTy1> &&
detail::is_8bit_packed_int_type<MatElTy2>,
void>::type
Multiply(WaveMatrixLeft<MatElTy1, NRows, NCols> LHS,
WaveMatrixRight<MatElTy2, NRows, NCols> RHS);

template <typename MatElTy1, typename MatElTy2>
std::enable_if_t<detail::is_32bit_int_type<ElTy> &&
detail::is_8bit_packed_int_type<MatElTy1> &&
detail::is_8bit_packed_int_type<MatElTy2>,
void>::type
MultiplyAccumulate(WaveMatrixLeft<MatElTy1, NRows, NCols> LHS,
WaveMatrixRight<MatElTy2, NRows, NCols> RHS);
};

```

#### Broadcast Add

The `WaveMatrixAccumulator::Add` methods perform an element-wise addition of an
accumulator matrix and the provided matrix or fragment accumulator.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last "accumulator" seems out of place to me. Is the intent to say that it adds an accumulator matrix (which I think I know what is) to a matrix accumulator or a [matrix] fragment accumulator? I'm not sure it's clear what those are and the terms are kind of ambiguous with the accumulator matrix term

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it adds one accumulator to another. It doesn't work with matrix types, just the accumulator types on both sides.


If the argument is a fragment object, the `Add` method broadcasts the argument
accumulator up from an Mx1 or 1xN matrix to an MxN matrix then performs
element-wise addition.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is what I tend to call a "splat". Is it a conversion to a full matrix that copies the values of the fragment into a temporary full matrix that gets accumulated? How do the values get selected? I think I can guess, but I think it should be explicit.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is explicitly not specified. The hardware can decide if it needs to copy the fragment to a full matrix or if it can more optimally perform a row/column-wise operation.


#### WaveMatrixAccumulator::Multiply

The `WaveMatrixAccumulator::Multiply` method performs multiplication of the left
and right arguments and stores the result back into the `WaveMatrixAccumulator`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you specify that it "stores" the result, but given that this is the only one that doesn't accumulate (right?) maybe we should call out explicitly that it overwrites the contents of the accumulator

This is a wave-level operation and cannot be used inside divergent control flow.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is, is the result undefined?

Longer term goal, but perhaps we should crib from other specs' formal definitions of "should", "must" etc.


For `Multiply` operations the matrix element types must match the accumulator
type unless the accumulator is a signed or unsigned 32-bit integer type. For
signed or unsigned 32-bit integer accumulators, signed or unsigned 8-bit packed
integers are also supported and can be mixed interchangeably.

#### WaveMatrixAccumulator::MultiplyAccumulate

The `WaveMatrixAccumulator::MultiplyAccumulate` method performs multiplication
of the left and right arguments and adds the result back into the
`WaveMatrixAccumulator`. This is a wave-level operation and cannot be used
inside divergent control flow.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above, what if I do?


For `MultiplyAccumulate` operations the matrix element types must match the
accumulator type unless the accumulator is a signed or unsigned 32-bit integer
type. For signed or unsigned 32-bit integer accumulators, signed or unsigned
8-bit packed integers are also supported and can be mixed interchangeably.

### WaveMatrix Fragment Accumulators

WaveMatrix intrinsics are defined to support quantization calculations.
Including calculating a sum for the rows of the left matrix and a sum of the
columns of the right matrix. The `WaveMatrixRightRowAcc` and
`WaveMatrixLeftColAcc` fragment accumulators perform this operation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should they have "Fragment" in the name? I was initially confused why they took full matrices as parameters, but I see through the inheritance that they are fragment accumulators

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love the names of those types, but I think if we add "Fragment" to the name we're going to be getting dangerously close to a type name that can't fit on one line of code with reasonable line wrapping rules.


```c++
template <typename ElTy, int NRows, int NCols>
class WaveMatrixLeftColAcc
: WaveMatrixAccumulatorBase<ElTy, NRows, NCols,
detail::WaveMatrixFragmentBase> {
template <typename MatElTy>
void SumAccumulate(WaveMatrixLeft<MatElTy, NRows, NCols> LeftMatrix);
};

template <typename ElTy, int NRows, int NCols>
class WaveMatrixRightRowAcc
: WaveMatrixAccumulatorBase<ElTy, NRows, NCols,
detail::WaveMatrixFragmentBase> {
void SumAccumulate(WaveMatrixRight<MatElTy, NRows, NCols> RightMatrix);
};
```

#### Zero Point

The following is the equation for matrix multiplication with zero point
adjustment included:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section feels out of place in between the definition of WaveMatrix Fragment Acuumulators and the description of their SumAccumulate method. I recognize that it is referenced in the section just after it, but if this is defining the multiply operations, perhaps it can go after the multiply method description and the next section can link back to it?


$C_{[x,y]} = (\sum_{i=0}^{K} A_{[x,i]} * B_{[i,y]}) - Z_a * (\sum_{i=0}^{K} B_{[i,y]}) - Z_b * (\sum_{i=0}^{K} A_{[x,i]}) + Z_a * Z_b * K$

$(\sum_{i=0}^{K} A_{[x,i]} * B_{[i,y]})$ is basic matrix multiplication.

$- Z_a * (\sum_{i=0}^{K} B_{[i,y]})$ is the zero point adjustment for matrix $A$

$- Z_b * (\sum_{i=0}^{K} A_{[x,i]})$ is the zero point adjustment for matrix $B$

$+ Z_a * Z_b * K$ is the static zero point adjustment for both matrix $A$ and $B$

$Z_*$ are constant zero points values

#### Wave Matrix SumAccumulate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe "WaveMatrix Fragment SumAccumulate" to be consistent with the title of that section?


The `SumAccumulate` methods accumulate the values of the argument matrix into
the WaveMatrix fragment accumulator. The fragment WaveMatrix must have the same
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading this and the description above the class, it's not clear to me what actually gets placed into the accumulator. Is it the sum of all elements in each row of a row-major matrix and the sum of all elements in each column of a column-major matrix?

data type as the fragment accumulator.

This intrinsics is used to calculated
$(\sum_{i=0}^{K} A_{[x,i]})$ and $(\sum_{i=0}^{K} B_{[i,y]})$ from our above
equation.


## Acknowledgments

This spec was developed as an extensive collaboration between the Microsoft HLSL
and Direct3D teams and IHV partners.

Special thanks to:

Claire Andrews
Nick Feeney
Amar Patel
Tex Riddell
Greg Roth

<!-- {% endraw %} -->