Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add 3D Texture Bandwidth metric #4336

Closed
wants to merge 1 commit into from

Commits on Jul 30, 2024

  1. Add 3D Texture Bandwidth metric (pytorch#4336)

    Summary:
    Pull Request resolved: pytorch#4336
    
    This diff introduces a profiler that obtains the maximum and minimum bandwidth for reading unique addresses from 3D textures in each of its dimensions, using the following shader, where A is a 3D texture and B is a writeonly buffer.
    
    The calculation of the texel position will depend on the dimension that is being benchmarked
    
    x : pos = ivec3(offset, 0, 0)
    y : pos = ivec3(0, offset, 0)
    z : pos = ivec3(0, 0, offset)
    
      void main() {
        vec4 sum = vec4(0);
        const uint workgroup_width = local_group_size * niter * ${NUNROLL};
        uint offset = (gl_WorkGroupID[0] * workgroup_width  + gl_LocalInvocationID[0]) & addr_mask;
    
        int i = 0;
        for (; i < niter; ++i)
        {
            sum *= texelFetch(A, pos, 0);
            offset = (offset + local_group_size) & addr_mask;
            ...
            ...
            sum *= texelFetch(A, pos, 0);
            offset = (offset + local_group_size) & addr_mask;
        }
    
        vec4 zero = vec4(i>>31);
    
        B[gl_LocalInvocationID[0]] = sum + zero;
      }
    
    The address mask allows us to control how many unique addresses we are accessing. If the number of unique vectors we want to read is 3, the offset will jump between three unique addresses throughout the iterations, giving us the bandwidth for that specific size of data. If the size of the unique data read is larger than the work group size, then each run will have its own block of data to read, defined by the initial offset calculation, where the offset is obtained through the workgroup ID and the local invocation ID.
    
    Finally, we make sure to use the `sum` and `i	` variables so that the compiler's optimizer does not flatten the loops.
    
    For a Samsung S22, the bandwidth behaves like this for each of the dimensions.
    {F1767497386}
    
    Comparing the bandwidth for the X dimension to OpenCL, which was obtained through [ArchProbe](https://github.com/microsoft/ArchProbe), we can observe that, although the behavior is the same, Vulkan has an increased bandwidth for most access sizes.
    
    {F1767497972}
    
    Comparing to the bandwidth for buffers, we can observe that the bandwidth is similar to regular buffers, but still much smaller than UBOs at small access sizes.
    
     {F1767497707}
    
    Reviewed By: jorgep31415
    
    Differential Revision: D59980139
    Esteban Padilla Cerdio authored and facebook-github-bot committed Jul 30, 2024
    Configuration menu
    Copy the full SHA
    e203ace View commit details
    Browse the repository at this point in the history