-
Notifications
You must be signed in to change notification settings - Fork 102
/
Copy pathGLSL_NV_cooperative_matrix2.txt
779 lines (573 loc) · 33.5 KB
/
GLSL_NV_cooperative_matrix2.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
Name
NV_cooperative_matrix2
Name Strings
GL_NV_cooperative_matrix2
Contact
Jeff Bolz, NVIDIA (jbolz 'at' nvidia.com)
Contributors
Karthik Vaidyanathan, NVIDIA
Status
Complete
Version
Last Modified: October 21, 2024
Revision: 1
Dependencies
This extension can be applied to OpenGL GLSL versions 4.50
(#version 450) and higher.
This extension can be applied to OpenGL ES ESSL versions 3.20
(#version 320) and higher.
This extension depends on GL_KHR_cooperative_matrix.
Overview
This extension adds several new features building on the cooperative matrix
types added in GL_KHR_cooperative_matrix. The goal is to add and accelerate
features beyond just simple GEMM kernels, including adding support for type/use
conversions, reductions, per-element operations, and tensor addressing, and
also to improve usability and out-of-the-box performance by adding support
for more flexible matrix sizes, and workgroup scope matrices with
compiler-managed staging through shared memory.
Mapping to SPIR-V
-----------------
For informational purposes (non-normative), the following is an
expected way for an implementation to map GLSL constructs to SPIR-V
constructs:
tensorLayoutNV -> OpTypeTensorLayoutNV
createTensorLayoutNV -> OpCreateTensorLayoutNV
setTensorLayoutDimensionNV -> OpTensorLayoutSetDimensionNV
setTensorLayoutStrideNV -> OpTensorLayoutSetStrideNV
sliceTensorLayoutNV -> OpTensorLayoutSliceNV
setTensorLayoutClampValueNV -> OpTensorLayoutSetClampValueNV
setTensorLayoutBlockSizeNV -> OpTensorLayoutSetBlockSizeNV
tensorViewNV -> OpTypeTensorViewNV
createTensorViewNV -> OpCreateTensorViewNV
setTensorViewDimensionsNV -> OpTensorViewSetDimensionNV
setTensorViewStrideNV -> OpTensorViewSetStrideNV
setTensorViewClipNV -> OpTensorViewSetClipNV
coopMatLoadTensorNV -> OpCooperativeMatrixLoadTensorNV
coopMatStoreTensorNV -> OpCooperativeMatrixStoreTensorNV
coopMatReduceNV -> OpCooperativeMatrixReduceNV
coopmat constructor changing component type or Use -> OpCooperativeMatrixConvertNV
coopMatPerElementNV -> OpCooperativeMatrixPerElementOpNV
coopMatTransposeNV -> OpCooperativeMatrixTransposeNV
Modifications to the OpenGL Shading Language Specification, Version 4.60
Including the following line in a shader can be used to control the
language features described in this extension:
#extension GL_NV_cooperative_matrix2 : <behavior>
where <behavior> is as specified in section 3.3.
New preprocessor #defines are added to the OpenGL Shading Language:
#define GL_NV_cooperative_matrix2 1
Update Section 5.4.X, Cooperative Matrix Type Constructors
Cooperative matrices can be constructed from another cooperative matrix
type with the same scope, number of rows, and number of columns, and where
the use of the source value is gl_MatrixUseAccumulator and the use of the
result type is gl_MatrixUseA or gl_MatrixUseB. This performs a
component-wise type conversion to initialize the new cooperative matrix.
Add a new Section 4.1.X, Tensor Layout Types
Tensor layout and tensor view types are representations of the mapping
between matrix coordinates and tensor memory layout. They each have a
number of dimensions in the range [1,5], with dimension 0 being the
outermost dimension and the last dimension being the innermost. These types
have the following logical state:
struct tensorLayoutNV<uint32_t Dim,
ClampMode Mode = gl_CooperativeMatrixClampModeUndefinedNV>
{
static constexpr uint32_t LDim = Dim;
static constexpr ClampMode clampMode = Mode;
uint32_t blockSize[LDim];
uint32_t layoutDimension[LDim];
uint32_t stride[LDim];
int32_t offset[LDim];
uint32_t span[LDim];
uint32_t clampValue;
};
struct tensorViewNV<uint Dim, bool hasDimensions, uint32_t p0, ..., uint32_t p<Dim-1>>
{
static constexpr uint32_t VDim = Dim;
static constexpr bool hasDim = hasDimensions;
static constexpr uint32_t permutation[VDim] = {p0, ..., p<Dim-1>};
uint32_t viewDimension[VDim];
uint32_t viewStride[VDim];
uint32_t clipRowOffset, clipRowSpan, clipColOffset, clipColSpan;
};
A tensor layout represents the layout of values in memory (number of
dimensions and size), along with a region being accessed (offset and span).
---------------------------------------------------------------------------
| layoutDimension1 |
| |
| |
| |
| |
| |
| |
| |
| span1 |
| ----------------- |
| | | |
| | | |
| | slice | span0 |
| | | layoutDimension0|
| | | |
| offset1 | | |
| ---------------> ----------------- |
| |
| ^ |
| | |
| | |
| | offset0 |
| | |
| | |
| | |
| | |
---------------------------------------------------------------------------
Figure: A 2D tensor layout, and a slice selecting a region within it.
A tensor view allows reinterpreting the dimensions of the region being
accessed, including changing the number of dimensions, reordering the
dimensions as they are loaded or stored, and clipping the region of the
matrix that is loaded or stored. Often the span will have the
same number of elements as the matrix, but in some more advanced uses
that may not be the case.
Loads and stores can either use just a tensor layout, or a tensor layout and
tensor view. The addressing starts by treating the matrix itself as a 2D
"view" and mapping the (row,col) coordinate to a 1D index. If there is only a
tensor layout parameter, then that 1D index is mapped to an N-D coordinate
within the slice. If there is both a tensor layout and a tensor view, then
the 1D index is first mapped to a coordinate within the view, the
coordinate components can be permuted, and then is converted back to a 1D
index which is then run through the tensor layout addressing calculation.
The tensor view dimensions and stride can be used to do more complex
addressing calculations. If the tensor view type has "hasDimensions" false,
then the dimensions of the tensor layout span are used instead.
The tensor view "clip" region restricts which elements of the matrix are
loaded or stored, and also affects the shape of the implicit 2D "view".
Unlike some other ML APIs, tensor layouts and views only describe
addressing calculations and never involve making copies of tensors. For
this reason, the functionality is slightly more limited (e.g. there's no
way to slice, then permute, then slice again).
See Section 8.X, Cooperative Matrix Functions for more details on the
addressing calculations. While these calculations may look expensive in
their full generality, certain calculations can be skipped when they're
not needed, and the common cases should be quite efficient.
Tensor layouts are created by calling:
tensorLayoutNV<Dim, Mode> createTensorLayoutNV(uint32_t Dim,
uint32_t Mode = gl_CooperativeMatrixClampModeUndefined);
The layoutDimension, stride, span, and offset elements are initialized to
zero. The blockSize elements are initialized to one. clampValue is
initialized to zero. ClampMode can take the following values:
const int gl_CooperativeMatrixClampModeUndefinedNV = 0;
const int gl_CooperativeMatrixClampModeConstantNV = 1;
const int gl_CooperativeMatrixClampModeClampToEdgeNV = 2;
const int gl_CooperativeMatrixClampModeRepeatNV = 3;
const int gl_CooperativeMatrixClampModeMirrorRepeatNV = 4;
If clampMode is Undefined, then out of bounds accesses have undefined
behavior. If clampMode is Constant, then out of bounds loads return
the bit pattern in the LSBs of _value_ and out of bounds stores are
dropped. If clampMode is ClampToEdge, Repeat, or MirrorRepeat, out of
bounds coordinates are clamped, repeated or reflected as described in
Section 8.X, Cooperative Matrix Functions.
The layout's block size can be set by calling
tensorLayoutNV<N, ...> setTensorLayoutBlockSizeNV(tensorLayoutNV<N, ...> t, uint32_t blockSize0, ..., uint32_t blockSize<N-1>);
The returned tensorLayoutNV is initialized to a copy of _t_. The blockSize
elements are set to the blockSize parameters. The blockSize should be set
before the dimensions, because it affects the implicit stride calculation.
When the blockSize is not 1, the strides are considered to be in blocks
rather than in elements.
The layout's dimensions and span can be initialized by calling:
tensorLayoutNV<N, ...> setTensorLayoutDimensionNV(tensorLayoutNV<N, ...> t, uint32_t dim0, ..., uint32_t dim<N-1>);
The returned tensorLayoutNV is initialized to a copy of _t_. The
layoutDimension and span elements are set to the dimension parameters,
in order. offset elements are set to zero. stride[i] is set as follows:
uint32_t s = 1;
for (int32_t i = N-1; i >= 0; --i) {
stride[i] = s;
s *= ceiling(dimensions[i] / blockSize[i]);
}
The layout's stride can be set by calling:
tensorLayoutNV<N, ...> setTensorLayoutStrideNV(tensorLayoutNV<N, ...> t, uint32_t s0, ..., uint32_t s<N-1>);
The returned tensorLayoutNV is initialized to a copy of _t_. The
stride elements are set to the _s_ parameters, in order. s<i> must be
at least s<i+1>*ceiling(dim<i+1> / t.blockSize[i+1]).
The offset and span members can be updated by slicing the tensor layout
by calling:
tensorLayoutNV<N, ...> sliceTensorLayoutNV(tensorLayoutNV<N, ...> t, int32_t offset0, uint32_t span0, ..., int32_t offset<N-1>, uint32_t span<N-1>);
The returned tensorLayoutNV is initialized to a copy of _t_. The offset
elements have the offset parameters added to them, and the span elements
are set to the span parameters.
The clamp value of a tensor layout can be set by calling:
tensorLayoutNV<...> setTensorLayoutClampValueNV(tensorLayoutNV<...> t, uint32_t value);
The returned tensorLayoutNV is initialized to a copy of _t_, and clampValue
is set to _value_.
Tensor views are created by calling:
tensorViewNV<Dim, hasDimensions, p0, ..., p<Dim-1>> createTensorViewNV(uint32_t Dim,
bool hasDimensions = false,
uint32_t p0 = 0,
...,
uint32_t p<Dim-1> = Dim-1);
The viewDimension and viewStride elements are initialized to zero. The clip values
are initialized to offsets of 0, spans of 0xFFFFFFFF.
The view's dimensions can be initialized by calling:
tensorViewNV<N> setTensorViewDimensionsNV(tensorViewNV<N> v, uint32_t dim0, ..., uint32_t dim<N-1>);
The returned tensorViewNV is initialized to a copy of _v_. The viewDimension
elements are initialized to the dimension parameters. viewStride[i] is set
to the product of dim<i+1> to dim<N-1> (and viewStride[N-1] is set to 1).
The view's stride can be set by calling:
tensorViewNV<N, ...> setTensorViewStrideNV(tensorViewNV<N, ...> v, uint32_t s0, ..., uint32_t s<N-1>);
The returned tensorViewNV is initialized to a copy of _v_. The
viewStride elements are set to the _s_ parameters, in order.
The clip values can be updated by calling:
tensorViewNV<N> setTensorViewClipNV(tensorViewNV<N> v, uint clipRowOffset, uint clipRowSpan, uint clipColOffset, uint clipColSpan);
The returned tensorViewNV is initialized to a copy of _v_. The clip elements
are set to the corresponding parameters.
Tensor layouts and views are used in cooperative matrix load and store
functions to determine address calculations and clamping, as described
in Section 8.X, Cooperative Matrix Functions.
Modify Section 5.9, Expressions
Conversions are allowed between cooperative matrix types assuming the
scope, row size, column size, and use are the same, or the use of the
source is gl_MatrixUseAccumulator and the use of the result type is
gl_MatrixUseA or gl_MatrixUseB.
Modify Section 8.X, Cooperative Matrix Functions
Add the following to the list of cooperative matrix load and store
functions:
void coopMatLoadTensorNV(inout coopmat m, volatile coherent T[] buf, uint elementOffset, tensorLayoutNV layout);
void coopMatLoadTensorNV(inout coopmat m, volatile coherent T[] buf, uint elementOffset, tensorLayoutNV layout, tensorViewNV view);
void coopMatLoadTensorNV(inout coopmat m, volatile coherent T[] buf, uint elementOffset, tensorLayoutNV layout, T2 decodeFunc);
void coopMatLoadTensorNV(inout coopmat m, volatile coherent T[] buf, uint elementOffset, tensorLayoutNV layout, tensorViewNV view, T2 decodeFunc);
Description: Load a cooperative matrix from buf.
void coopMatStoreTensorNV(coopmat m, volatile coherent out T[] buf, uint elementOffset, tensorLayoutNV layout);
void coopMatStoreTensorNV(coopmat m, volatile coherent out T[] buf, uint elementOffset, tensorLayoutNV layout, tensorViewNV view);
Description: Store a cooperative matrix to buf.
where T can be any type and T2 is a decode function type as described below.
For load and store functions with no _view_ parameter, an element index
is computed according to the matrixCoordToTensorElement function for each
(row,col) of the matrix _m_, where _m_ has M rows and N columns:
constexpr uint32_t MAX_DIM = 5;
using Coord = array<uint32_t, MAX_DIM>;
uint32_t matrixCoordToLinear(tensorLayoutNV t, uint32_t row, uint32_t col, uint32_t N)
{
uint32_t index = row * N + col;
return index;
}
Coord linearToSpanCoord(tensorLayoutNV t, uint32_t index)
{
Coord spanCoord {};
for (int32_t dim = t.LDim-1; dim >= 0; --dim) {
spanCoord[dim] = index % t.span[dim];
index /= t.span[dim];
}
return spanCoord;
}
auto spanCoordToTensorCoord(tensorLayoutNV t, Coord spanCoord)
{
Coord blockCoord {};
Coord coordInBlock {};
for (uint32_t dim = 0; dim <= t.LDim-1; ++dim) {
int32_t c = spanCoord[dim] + t.offset[dim];
if (c < 0 || c >= t.layoutDimension[dim]) {
ClampMode clampMode = t.clampMode;
// For stores, other than Undefined, everything is treated as "discard"
if (operation is a store && clampMode != Undefined) {
clampMode = Constant;
}
// remainders are computed as defined in OpSMod
switch (clampMode) {
case Undefined:
undefined behavior;
case Constant:
For load, set result value to t.clampValue;
For store, discard the store;
terminate index calculation;
case ClampToEdge:
c = min(max(c, 0), t.layoutDimension[dim]-1);
break;
case Repeat:
c = c % t.layoutDimension[dim];
break;
case MirrorRepeat:
c = c % (2*t.layoutDimension[dim]-2);
c = (c >= dim) ? (2*dim-2-c) : c;
break;
}
}
coordInBlock[dim] = c % t.blockSize[dim];
blockCoord[dim] = c / t.blockSize[dim];
}
return tuple(blockCoord, coordInBlock);
}
uint32_t tensorCoordToLinear(tensorLayoutNV t, Coord blockCoord)
{
uint32_t index = 0;
for (uint32_t dim = 0; dim <= t.LDim-1; ++dim) {
index += blockCoord[dim] * t.stride[dim];
}
return index;
}
// map (row,col) -> linear index in span -> span coordinate -> tensor coordinate -> linear index in tensor
uint32_t matrixCoordToTensorElement(tensorLayoutNV t, uint32_t row, uint32_t col, uint32_t N)
{
uint32_t index = matrixCoordToLinear(t, row, col, N);
Coord spanCoord = linearToSpanCoord(t, index);
Coord blockCoord;
Coord coordInBlock;
tie(blockCoord, coordInBlock) = spanCoordToTensorCoord(t, spanCoord);
index = tensorCoordToLinear(t, blockCoord);
return index;
}
This index is then multiplied by the size of the component type of _m_ and
treated as a byte offset from &buf[elementOffset]. The matrix element is
loaded from or stored to this location. If the Load function has a decode
function parameter, then the blockCoord and coordInBlock arrays are passed
to it as parameters.
_elementOffset_ multiplied by the size of T must be a multiple of 16B. But
the elements selected by the tensor layout and view need not be so aligned.
For load and store functions with a _view_ parameter, an element index
is computed according to the matrixCoordToTensorElementWithView function
for each (row,col) of the matrix _m_, where _m_ has M rows and N columns:
uint32_t matrixCoordToLinear(tensorLayoutNV t, tensorViewNV v, uint32_t row, uint32_t col, uint32_t N)
{
if (row < v.clipRowOffset ||
row >= v.clipRowOffset + v.clipRowSpan ||
col < v.clipColOffset ||
col >= v.clipColOffset + v.clipColSpan) {
Load or store is skipped. For load, the matrix element is unmodified.
terminate index calculation;
}
row -= v.clipRowOffset;
col -= v.clipColOffset;
uint32_t width = min(N, v.clipColSpan);
uint32_t index = row * width + col;
return index;
}
Coord linearToViewCoord(tensorLayoutNV t, tensorViewNV v, uint32_t index)
{
auto &dimensions = v.hasDimensions ? v.viewDimension : t.span;
Coord viewCoord {};
for (int32_t dim = v.VDim-1; dim >= 0; --dim) {
uint32_t i = v.permutation[dim];
viewCoord[i] = index % dimensions[i];
index /= dimensions[i];
}
return viewCoord;
}
uint32_t viewCoordToLinear(tensorLayoutNV t, tensorViewNV v, Coord viewCoord)
{
Coord stride {};
if (v.hasDimensions) {
stride = v.viewStride;
} else {
// set stride to match t.span
stride[v.VDim-1] = 1;
for (int32_t dim = v.VDim-2; dim >= 0; --dim) {
stride[dim] = stride[dim+1] * t.span[dim+1];
}
}
uint32_t index = 0;
for (int32_t dim = v.VDim-1; dim >= 0; --dim) {
index += viewCoord[dim] * stride[dim];
}
return index;
}
// map (row,col) -> linear index in view -> view coordinate -> linear index in span -> span coordinate -> tensor coordinate -> linear index in tensor
uint32_t matrixCoordToTensorElementWithView(tensorLayoutNV t, uint32_t row, uint32_t col, uint32_t N)
{
uint32_t index = matrixCoordToLinear(t, v, row, col, N);
Coord viewCoord = linearToViewCoord(t, v, index);
index = viewCoordToLinear(t, v, viewCoord);
Coord spanCoord = linearToSpanCoord(t, index);
Coord blockCoord;
Coord coordInBlock;
tie(blockCoord, coordInBlock) = spanCoordToTensorCoord(t, spanCoord);
index = tensorCoordToLinear(t, blockCoord);
return index;
}
The final result is then multiplied by the size of the component type of
_m_ and treated as a byte offset from &buf[elementOffset]. The matrix
element is loaded from or stored to this location.
For Load functions with a _decodeFunc_ parameter, rather than loading a
value, the _decodeFunc_ is invoked for each matrix element at least once.
_decodeFunc_ must be a function whose return type matches the component
type of _result_. The first parameter must be a buffer_reference type,
and the parameter is filled with a pointer computed by multiplying the index
returned by matrixCoordToTensorElement(WithView) by the size of the struct the buffer_reference
points to. The second and third parameters must each be an array of
uint32_t whose dimension matches the tensor dimension. The second parameter
is filled with the blockCoord, and the third parameter with the
coordInBlock, for the matrix element being decoded. All parameters types
must be qualified as 'const in'. The return value is stored in the
corresponding element of _result_. _buf_ must point to buffer memory
(either an SSBO or buffer_reference).
In any function used as a _decodeFunc_ parameter, and any function
called directly or indirectly by those functions, tangled instructions
(as defined in the SPIR-V spec) are not allowed.
Elements of a matrix can have a reduction operation applied by calling:
void coopMatReduceNV(out coopmat result, coopmat m, int reduceMask, T combineOp);
Description: Reduce the values in each row, column, 2x2, or entire matrix
by applying the combineOp function to combine values of the elements. The
result matrix has the reduced values in all the corresonding elements of
the matrix. _m_ must have a floating-point component type.
_m_ and _result_ must each have use of gl_MatrixUseAccumulator.
If reduceMask includes gl_CooperativeMatrixReduce2x2, it must not include
gl_CooperativeMatrixReduceRow or gl_CooperativeMatrixReduceColumn.
If reduceMask includes gl_CooperativeMatrixReduce2x2, the dimensions of
_result_ must be half the dimensions of _m_.
If reduceMask equals gl_CooperativeMatrixReduceRow, then elements of each
row are combined and the resulting value is assigned to all elements of the
corresponding row of the result, and _result_ must have the same number of
rows as _m_.
If reduceMask equals gl_CooperativeMatrixReduceColumn, then elements of each
column are combined and the resulting value is assigned to all elements of
the corresponding column of the result, and _result_ must have the same number
of columns as _m_.
If reduceMask equals gl_CooperativeMatrixReduceRowAndColumn, all elements
are combined and the resulting value is assign to all elements of the result,
and _result_ can have any number of rows and columns.
_combineOp_ must be the identifier of a function. It must have two
parameters, each qualified as 'const in', with the same type as the
component type of _m_. It will be called on implementation-dependent
elements of _m_ or combinations thereof, to compute the combination of
all elements in the row, column, 2x2, or entire matrix.
gl_CooperativeMatrixReduce* are constant integer values which can be used for
the reduceMask parameter in the load/store functions.
const int gl_CooperativeMatrixReduceRowNV = 0x1;
const int gl_CooperativeMatrixReduceColumnNV = 0x2;
const int gl_CooperativeMatrixReduceRowAndColumnNV = 0x3;
const int gl_CooperativeMatrixReduce2x2NV = 0x4;
Note that sum-reductions can be efficiently performed on UseA and UseB
matrices by multiplying by a matrix filled with the value one.
An operation can be performed on each element of a matrix by calling:
void coopMatPerElementNV(out coopmat result, coopmat m, T elemOp, ...);
_elemOp_ must be the identifier of a user-defined function. All parameter
types must be qualified as 'const in'. The first two parameters of elemOp
must be uint32_t values which are passed the row and column number of the
element being operated on. The third parameter must have type matching the
component type of _m_, and is passed the value of the element being
operated on. The number of additional parameters and their types must match
the signature of _elemOp_, with any additional cooperative matrix
parameters having component type that matches the type of the corresponding
formal parameter. Any additional cooperative matrix parameters must be the
same type as _m_, and the corresponding element of that parameter is passed
to the function. _result_ must be the same type as _m_, and the return type
of _elemOp_ must mmatch the component type of _result_.
coopMatPerElementNV treats the cooperative matrices as composite types, and
invokes _elemOp_ at least once per element of the composite, with the
return values of the function forming the corresponding elements of the
return value of coopMatPerElementNV. The calls to _elemOp_
are considered to be unordered against each other.
In any function used as an _elemOp_ parameter, and any function
called directly or indirectly by those functions, tangled instructions
(as defined in the SPIR-V spec) are not allowed.
A gl_MatrixUseAccumulator matrix can be transposed to a gl_MatrixUseB
matrix by calling:
void coopMatTransposeNV(out coopmat result, coopmat m);
_m_ must have use of gl_MatrixUseAccumulator, _result_ must have use of
gl_MatrixUseB. _m_ and _result_ must have the same scope and component
type, and the number of rows of _m_ must match the number of columns of
_result_ and the number of columns of _m_ must match the number of rows of
_result_. _result_ is filled with the transpose of the _m_.
Modify Section 9, Shading Language Grammar for Core Profile
Add to tokens list:
TENSORLAYOUTNV
TENSORVIEWNV
FUNCTION
Add to type_specifier_nonarray:
TENSORLAYOUTNV
TENSORVIEWNV
FUNCTION
Examples
Load from row-major matrix:
// Replaces coopmatLoad(mat, input.buf, elementoffset + row*NumCols + col, NumCols, gl_CooperativeMatrixLayoutRowMajor);
coopmat<float16_t, gl_ScopeWorkgroup, M, N, gl_MatrixUseA> mat;
tensorLayoutNV<2> t = createTensorLayoutNV(2);
t = setTensorLayoutDimensions(t, NumRows, NumCols);
t = sliceTensorLayoutNV(t, row, M, col, N);
coopMatLoadTensorNV(mat, input.buf, elementoffset, t);
Load from col-major matrix:
// Replaces coopmatLoad(mat, input.buf, elementoffset + col*NumRows + row, NumRows, gl_CooperativeMatrixLayoutColumnMajor);
coopmat<float16_t, gl_ScopeWorkgroup, M, N, gl_MatrixUseB> mat;
tensorLayoutNV<2> t = createTensorLayoutNV(2);
// columns are the outermost dimension
t = setTensorLayoutDimensions(t, NumCols, NumRows);
t = sliceTensorLayoutNV(t, col, N, row, M);
// Create a view matching the tensor's dimensions, permuting
// dimensions 0 <-> 1 to swap row/col indices and to match the matrix
// layout
coopMatLoadTensorNV(mat, input.buf, elementoffset, t, createTensorViewNV(2, false, 1, 0));
Load an 8x8 patch, where each element of the patch has 32 channels
coopmat<float16_t, gl_ScopeWorkgroup, 8*8, 32, gl_MatrixUseA> mat;
// HWC layout
tensorLayoutNV<3> t = createTensorLayoutNV(3);
t = setTensorLayoutDimensions(t, NumRows, NumCols, 32);
// Slice an 8x8 32 channel region
t = sliceTensorLayoutNV(t, row, 8, col, 8, 0, 32);
coopMatLoadTensorNV(mat, input.buf, elementoffset, t);
Perform 2x2 space_to_depth transform while loading from memory:
coopmat<float16_t, gl_ScopeWorkgroup, H/2*W/2, 4*NumCh, gl_MatrixUseAccumulator> mat;
// Memory layout is HWC
tensorLayoutNV<3> t = createTensorLayoutNV(3);
t = setTensorLayoutDimensions(t, H, W, NumCh);
// No slicing, we're loading the whole matrix
// View of tensor is H/2 x 2 x W/2 x 2 x NumCh, and is permuted to
// H/2 x W/2 x 2 x 2 x NumCh during the load
tensorViewNV<5, true, 0, 2, 1, 3, 4> v = createTensorViewNV(5, true, 0, 2, 1, 3, 4);
v = setTensorViewDimensionsNV(v, H/2, 2, W/2, 2, NumCh);
coopMatLoadTensorNV(mat, input.buf, elementoffset, t, v);
Issues
(1) Alignment rules?
RESOLVED: The base of the tensor (buf/elementOffset) passed into
coopMatLoadTensorNV or coopMatStoreTensorNV must be 16B aligned.
The offset/span don't have any alignment requirements. The compiler
can detect greater alignment for those when it's available.
(2) Should we replace _element_ with _byteOffset_ in the new load/store
functions?
RESOLVED: While byte offsets are often desirable, leaving this as element
offset to match GL_KHR_cooperative_matrix.
(3) What matrix dimensions are supported?
RESOLVED: The API extension should provide a mechanism to query supported
matrix sizes.
(4) How can we support loading from matrices encoded using block-based
quantization?
RESOLVED:
Treat addressing calculations similarly to block-compressed
textures, i.e. "element size" is the size of the block, and strides are
implicitly shrunken by the block dimensions.
// Assume the tensor in memory is logically NumRows x NumCols, and each
// block stores information for a block of size BlockRows x BlockCols
// in a struct of type S.
// We want to load M x N elements while converting to float16_t.
coopmat<float16_t, gl_ScopeWorkgroup, M, N, gl_MatrixUseA> mat;
tensorLayoutNV<2> t = createTensorLayoutNV(2);
t = setTensorLayoutBlockSize(t, BlockRows, BlockCols);
t = setTensorLayoutDimensions(t, NumRows, NumCols);
// setTensorLayoutDimensions implicitly sets strides to
// stride[LDim-1] = 1
// stride[LDim-2] = ceiling(dimensions[LDim-1] / blockSize[LDim-1]) * stride[LDim-1];
// stride[LDim-3] = ceiling(dimensions[LDim-2] / blockSize[LDim-2]) * stride[LDim-2];
// ...
t = sliceTensorLayoutNV(t, row, M, col, N);
float16_t myDecodeFunc(/*buffer reference type pointing to S*/ Sref s, uint32_t blockCoord[2], uint32_t coordInBlock[2]);
coopMatLoadTensorNV(mat, input.buf, elementoffset, t, myDecodeFunc);
Tensor layout coordinate and stride calculations work in block
coordinates:
uint32_t coordInBlock[t.LDim] {};
index = 0;
for (uint32_t dim = 0; dim <= t.LDim-1; ++dim) {
int32_t c = coord[dim] + t.offset[dim];
// bounds checking logic (not shown)
/*--- block coordinate calculation code ---*/
coordInBlock[dim] = c % t.blockSize[dim];
c /= t.blockSize[dim];
/*--- end block coordinate calculation code ---*/
index += c * t.stride[dim];
}
The index, rather than being multiplied by the size of the matrix
component type, is multiplied by the size of the structure that is the
pointee type of the first function parameter in the decode function passed
to Load. The blockCoord and coordInBlock values are also passed to
the decode function. The decode function is called for each matrix
element, with a reference to memory containing the block data and
the block-relative coordinates passed in. The return type must match
the component type of the matrix, and the return value is stored in
the corresponding element of the matrix.
Lacking a way to express pointers to shared memory, this is limited
to buffer and buffer_reference inputs.
Revision History
Revision 1
- Initial revision.