This section summarizes the interfaces and implementations
of common numerical algorithms performed on Tensor
s.
The implementation of these algorithms may be found in the include/cute/algorithm/ directory.
CuTe's copy
algorithm copies the elements of a source Tensor
into the elements of a destination Tensor
.
The various overloads of copy
can be found in
include/cute/algorithm/copy.hpp
.
A Tensor
encapsulates the data type, data location,
and possibly also the shape and stride of the tensor at compile time.
As a result, copy
can and does dispatch,
based on the types of its arguments,
to use any of various synchronous or asynchronous hardware copy instructions.
The copy
algorithm has two main overloads.
The first just takes the source Tensor
and the destination Tensor
.
template <class SrcEngine, class SrcLayout,
class DstEngine, class DstLayout>
CUTE_HOST_DEVICE
void
copy(Tensor<SrcEngine, SrcLayout> const& src,
Tensor<DstEngine, DstLayout> & dst);
The second takes those two parameters, plus a Copy_Atom
.
template <class... CopyArgs,
class SrcEngine, class SrcLayout,
class DstEngine, class DstLayout>
CUTE_HOST_DEVICE
void
copy(Copy_Atom<CopyArgs...> const& copy_atom,
Tensor<SrcEngine, SrcLayout> const& src,
Tensor<DstEngine, DstLayout> & dst);
The two-parameter copy
overload picks a default implementation
based only on the types of the two Tensor
parameters.
The Copy_Atom
overload lets callers override that default
by specifying a nondefault copy
implementation.
Either the default implementation or
the implementation selected by a Copy_Atom
overload
may use none or all available parallelism,
and may have a variety of synchronization semantics.
The behavior depends on copy
's parameter types.
Users are expected to figure this out based on their knowledge
of the architecture on which they are running.
(Developers often write a custom optimized kernel
for each GPU architecture.)
The copy
algorithm may be sequential per thread,
or it may be parallel across some collection of threads
(e.g., a block or cluster).
If copy
is parallel,
then the collection of participating threads
may need synchronization before any thread in the collection
may assume that the copy operation has completed.
For example, if the participating threads form a thread block,
then users must invoke __syncthreads()
or the Cooperative Groups equivalent
before they may use the results of copy
.
The copy
algorithm may use asynchronous copy instructions,
such as cp.async
, or its C++ interface memcpy_async
.
In that case, users will need to perform
the additional synchronization appropriate to that underlying implementation
before they may use the results of the copy
algorithm.
The CuTe GEMM tutorial example
shows one such synchronization method.
More optimized GEMM implementations use pipelining techniques
to overlap asynchronous copy
operations with other useful work.
A simple example of a generic copy
implementation
for any two Tensor
s looks like this.
template <class TA, class ALayout,
class TB, class BLayout>
CUTE_HOST_DEVICE
void
copy(Tensor<TA, ALayout> const& src, // Any logical shape
Tensor<TB, BLayout> & dst) // Any logical shape
{
for (int i = 0; i < size(src); ++i) {
dst(i) = src(i);
}
}
This generic copy
algorithm addresses both Tensor
s
with 1-D logical coordinates, thus traversing both Tensor
s
in a logical column-major order.
Some reasonable architecture-independent optimizations
would include the following.
-
If the two
Tensor
s have known memory spaces with optimized access instructions (likecp.async
), then dispatch to the custom instruction. -
The two
Tensor
s have static layouts and it can be proven that element vectorization is valid -- for example, fourld.global.b32
s can be combined into a singleld.global.b128
-- then vectorize the source and destinations tensors. -
If possible, validate that the copy instruction to be used is appropriate for the source and destination tensors.
CuTe's optimized copy implementations can do all of these.
CuTe's copy_if
algorithm lives in the same header as copy
,
include/cute/algorithm/copy.hpp
.
The algorithm takes source and destination Tensor
parameters like copy
,
but it also takes a "predication Tensor
"
with the same shape as the input and output.
Elements of the source Tensor
are only copied
if the corresponding predication Tensor
element is nonzero.
For details on why and how to use copy_if
,
please refer to the
"predication" section of the tutorial.
The gemm
algorithm takes three Tensor
s, A, B, and C.
What it does depends on the number of modes
that its Tensor
parameters have.
We express these modes using letters.
-
V indicates a "vector," a mode of independent elements.
-
M and N indicate the number of rows resp. columns of the matrix result C of the BLAS's GEMM routine.
-
K indicates the "reduction mode" of GEMM, that is, the mode along which GEMM sums. Please see the GEMM tutorial for details.
We list the modes of the input Tensor
s A and B,
and the output Tensor
C,
using a notation (...) x (...) => (...)
.
The two leftmost (...)
describe A and B (in that order),
and the (...)
to the right of the =>
describes C.
-
(V) x (V) => (V)
. The element-wise product of vectors: Cv += Av Bv. Dispatches to FMA or MMA. -
(M) x (N) => (M,N)
. The outer product of vectors: Cmn += Am B_n. Dispatches to (4) with V=1. -
(M,K) x (N,K) => (M,N)
. The product of matrices: Cmn += Amk Bnk. Dispatches to (2) for each K. -
(V,M) x (V,N) => (V,M,N)
. The batched outer product of vectors: Cvmn += Avm Bvn. Optimizes for register reuse and dispatches to (1) for each M, N. -
(V,M,K) x (V,N,K) => (V,M,N)
. The batched product of matrices: Cvmn += Avmk Bvnk. Dispatches to (4) for each K.
Please refer to the GEMM tutorial for an overview of CuTe's convention for ordering the modes. For example, if K appears, it always appears rightmost ("outermost"). If V appears, it always appears leftmost ("innermost").
Just like with copy
, CuTe's implementations of gemm
uses its Tensor
arguments' types to dispatch
to an appropriately optimized implementation.
Also like copy
, gemm
takes an optional MMA_Atom
parameter
that lets callers override the default FMA
instruction
that CuTe would select based on the Tensor
arguments' types.
For more information on MMA_Atom
and on specialization of gemm
for different architectures, please refer to the
MMA section of the tutorial.
The axpby
algorithm lives in the header file
include/cute/algorithm/axpby.hpp
.
It assigns to Tensor
s.
The name stands for "Alpha times X Plus Beta times Y,"
and is a generalization of the original BLAS "AXPY" routine
("Alpha times X Plus Y").
The fill
algorithm lives in the header file
include/cute/algorithm/fill.hpp
.
It overwrites the elements of its Tensor
output argument
with a given scalar value.
The clear
algorithm lives in the header file
include/cute/algorithm/clear.hpp
.
It overwrites the elements of its Tensor
output argument with zeros.
CuTe provides other algorithms.
Their header files can be found in the
include/cute/algorithm
directory.