Summary:
Pull Request resolved: pytorch#4336
This diff introduces a profiler that obtains the maximum and minimum bandwidth for reading unique addresses from 3D textures in each of its dimensions, using the following shader, where A is a 3D texture and B is a writeonly buffer.
The calculation of the texel position will depend on the dimension that is being benchmarked
x : pos = ivec3(offset, 0, 0)
y : pos = ivec3(0, offset, 0)
z : pos = ivec3(0, 0, offset)
void main() {
vec4 sum = vec4(0);
const uint workgroup_width = local_group_size * niter * ${NUNROLL};
uint offset = (gl_WorkGroupID[0] * workgroup_width + gl_LocalInvocationID[0]) & addr_mask;
int i = 0;
for (; i < niter; ++i)
{
sum *= texelFetch(A, pos, 0);
offset = (offset + local_group_size) & addr_mask;
...
...
sum *= texelFetch(A, pos, 0);
offset = (offset + local_group_size) & addr_mask;
}
vec4 zero = vec4(i>>31);
B[gl_LocalInvocationID[0]] = sum + zero;
}
The address mask allows us to control how many unique addresses we are accessing. If the number of unique vectors we want to read is 3, the offset will jump between three unique addresses throughout the iterations, giving us the bandwidth for that specific size of data. If the size of the unique data read is larger than the work group size, then each run will have its own block of data to read, defined by the initial offset calculation, where the offset is obtained through the workgroup ID and the local invocation ID.
Finally, we make sure to use the `sum` and `i ` variables so that the compiler's optimizer does not flatten the loops.
For a Samsung S22, the bandwidth behaves like this for each of the dimensions.
{F1767497386}
Comparing the bandwidth for the X dimension to OpenCL, which was obtained through [ArchProbe](https://github.com/microsoft/ArchProbe), we can observe that, although the behavior is the same, Vulkan has an increased bandwidth for most access sizes.
{F1767497972}
Comparing to the bandwidth for buffers, we can observe that the bandwidth is similar to regular buffers, but still much smaller than UBOs at small access sizes.
{F1767497707}
Reviewed By: jorgep31415
Differential Revision: D59980139