New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Add 3D Texture Bandwidth metric #4336

Closed

Esteb37 wants to merge 1 commit into pytorch:main from Esteb37:export-D59980139

Contributor

Esteb37 commented Jul 22, 2024 •

edited

Loading

Differential Revision: D59980139

Pull Request resolved: #4336

This diff introduces a profiler that obtains the maximum and minimum bandwidth for reading unique addresses from 3D textures in each of its dimensions, using the following shader, where A is a 3D texture and B is a writeonly buffer.

The calculation of the texel position will depend on the dimension that is being benchmarked

x : pos = ivec3(offset, 0, 0)
y : pos = ivec3(0, offset, 0)
z : pos = ivec3(0, 0, offset)

void main() {
vec4 sum = vec4(0);
const uint workgroup_width = local_group_size * niter * ${NUNROLL};
uint offset = (gl_WorkGroupID[0] * workgroup_width + gl_LocalInvocationID[0]) & addr_mask;

int i = 0;
for (; i < niter; ++i)
{
    sum *= texelFetch(A, pos, 0);
    offset = (offset + local_group_size) & addr_mask;
    ...
    ...
    sum *= texelFetch(A, pos, 0);
    offset = (offset + local_group_size) & addr_mask;
}

vec4 zero = vec4(i>>31);

B[gl_LocalInvocationID[0]] = sum + zero;

}

The address mask allows us to control how many unique addresses we are accessing. If the number of unique vectors we want to read is 3, the offset will jump between three unique addresses throughout the iterations, giving us the bandwidth for that specific size of data. If the size of the unique data read is larger than the work group size, then each run will have its own block of data to read, defined by the initial offset calculation, where the offset is obtained through the workgroup ID and the local invocation ID.

Finally, we make sure to use the sum and i variables so that the compiler's optimizer does not flatten the loops.

For a Samsung S22, the bandwidth behaves like this for each of the dimensions.

Comparing the bandwidth for the X dimension to OpenCL, which was obtained through ArchProbe, we can observe that, although the behavior is the same, Vulkan has an increased bandwidth for most access sizes.

Comparing to the bandwidth for buffers, we can observe that the bandwidth is similar to regular buffers, but still much smaller than UBOs at small access sizes.

pytorch-bot bot commented Jul 22, 2024 •

edited

Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/4336

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit e203ace with merge base b7c8378 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot added the CLA Signed label

Contributor

facebook-github-bot commented Jul 22, 2024

This pull request was exported from Phabricator. Differential Revision: D59980139

facebook-github-bot added the fb-exported label

jorgep31415 approved these changes

View reviewed changes

Esteb37 pushed a commit to Esteb37/executorch that referenced this pull request


          Add 3D Texture Bandwidth metric (pytorch#4336)

1efbca1

Summary:
Pull Request resolved: pytorch#4336

This diff introduces a profiler that obtains the maximum and minimum bandwidth for reading unique addresses from 3D textures in each of its dimensions, using the following shader, where A is a 3D texture and B is a writeonly buffer.

The calculation of the texel position will depend on the dimension that is being benchmarked

x : pos = ivec3(offset, 0, 0)
y : pos = ivec3(0, offset, 0)
z : pos = ivec3(0, 0, offset)

  void main() {
    vec4 sum = vec4(0);
    const uint workgroup_width = local_group_size * niter * ${NUNROLL};
    uint offset = (gl_WorkGroupID[0] * workgroup_width  + gl_LocalInvocationID[0]) & addr_mask;

    int i = 0;
    for (; i < niter; ++i)
    {
        sum *= texelFetch(A, pos, 0);
        offset = (offset + local_group_size) & addr_mask;
        ...
        ...
        sum *= texelFetch(A, pos, 0);
        offset = (offset + local_group_size) & addr_mask;
    }

    vec4 zero = vec4(i>>31);

    B[gl_LocalInvocationID[0]] = sum + zero;
  }

The address mask allows us to control how many unique addresses we are accessing. If the number of unique vectors we want to read is 3, the offset will jump between three unique addresses throughout the iterations, giving us the bandwidth for that specific size of data. If the size of the unique data read is larger than the work group size, then each run will have its own block of data to read, defined by the initial offset calculation, where the offset is obtained through the workgroup ID and the local invocation ID.

Finally, we make sure to use the `sum` and `i	` variables so that the compiler's optimizer does not flatten the loops.

For a Samsung S22, the bandwidth behaves like this for each of the dimensions.
{F1767497386}

Comparing the bandwidth for the X dimension to OpenCL, we can observe that, although the behavior is the same, Vulkan has an increased bandwidth for most access sizes.

{F1767497972}

Comparing to the bandwidth for buffers, we can observe that the bandwidth is similar to regular buffers, but still much smaller than UBOs at small access sizes.

 {F1767497707}

Differential Revision: https://internalfb.com/D59980139

Esteb37 pushed a commit to Esteb37/executorch that referenced this pull request


          Add 3D Texture Bandwidth metric (pytorch#4336)

832c4eb

Summary:
Pull Request resolved: pytorch#4336

This diff introduces a profiler that obtains the maximum and minimum bandwidth for reading unique addresses from 3D textures in each of its dimensions, using the following shader, where A is a 3D texture and B is a writeonly buffer.

The calculation of the texel position will depend on the dimension that is being benchmarked

x : pos = ivec3(offset, 0, 0)
y : pos = ivec3(0, offset, 0)
z : pos = ivec3(0, 0, offset)

  void main() {
    vec4 sum = vec4(0);
    const uint workgroup_width = local_group_size * niter * ${NUNROLL};
    uint offset = (gl_WorkGroupID[0] * workgroup_width  + gl_LocalInvocationID[0]) & addr_mask;

    int i = 0;
    for (; i < niter; ++i)
    {
        sum *= texelFetch(A, pos, 0);
        offset = (offset + local_group_size) & addr_mask;
        ...
        ...
        sum *= texelFetch(A, pos, 0);
        offset = (offset + local_group_size) & addr_mask;
    }

    vec4 zero = vec4(i>>31);

    B[gl_LocalInvocationID[0]] = sum + zero;
  }

The address mask allows us to control how many unique addresses we are accessing. If the number of unique vectors we want to read is 3, the offset will jump between three unique addresses throughout the iterations, giving us the bandwidth for that specific size of data. If the size of the unique data read is larger than the work group size, then each run will have its own block of data to read, defined by the initial offset calculation, where the offset is obtained through the workgroup ID and the local invocation ID.

Finally, we make sure to use the `sum` and `i	` variables so that the compiler's optimizer does not flatten the loops.

For a Samsung S22, the bandwidth behaves like this for each of the dimensions.
{F1767497386}

Comparing the bandwidth for the X dimension to OpenCL, we can observe that, although the behavior is the same, Vulkan has an increased bandwidth for most access sizes.

{F1767497972}

Comparing to the bandwidth for buffers, we can observe that the bandwidth is similar to regular buffers, but still much smaller than UBOs at small access sizes.

 {F1767497707}

Differential Revision: https://internalfb.com/D59980139

Esteb37 pushed a commit to Esteb37/executorch that referenced this pull request


          Add 3D Texture Bandwidth metric (pytorch#4336)

Summary:
Pull Request resolved: pytorch#4336

This diff introduces a profiler that obtains the maximum and minimum bandwidth for reading unique addresses from 3D textures in each of its dimensions, using the following shader, where A is a 3D texture and B is a writeonly buffer.

The calculation of the texel position will depend on the dimension that is being benchmarked

x : pos = ivec3(offset, 0, 0)
y : pos = ivec3(0, offset, 0)
z : pos = ivec3(0, 0, offset)

  void main() {
    vec4 sum = vec4(0);
    const uint workgroup_width = local_group_size * niter * ${NUNROLL};
    uint offset = (gl_WorkGroupID[0] * workgroup_width  + gl_LocalInvocationID[0]) & addr_mask;

    int i = 0;
    for (; i < niter; ++i)
    {
        sum *= texelFetch(A, pos, 0);
        offset = (offset + local_group_size) & addr_mask;
        ...
        ...
        sum *= texelFetch(A, pos, 0);
        offset = (offset + local_group_size) & addr_mask;
    }

    vec4 zero = vec4(i>>31);

    B[gl_LocalInvocationID[0]] = sum + zero;
  }

The address mask allows us to control how many unique addresses we are accessing. If the number of unique vectors we want to read is 3, the offset will jump between three unique addresses throughout the iterations, giving us the bandwidth for that specific size of data. If the size of the unique data read is larger than the work group size, then each run will have its own block of data to read, defined by the initial offset calculation, where the offset is obtained through the workgroup ID and the local invocation ID.

Finally, we make sure to use the `sum` and `i	` variables so that the compiler's optimizer does not flatten the loops.

For a Samsung S22, the bandwidth behaves like this for each of the dimensions.
{F1767497386}

Comparing the bandwidth for the X dimension to OpenCL, we can observe that, although the behavior is the same, Vulkan has an increased bandwidth for most access sizes.

{F1767497972}

Comparing to the bandwidth for buffers, we can observe that the bandwidth is similar to regular buffers, but still much smaller than UBOs at small access sizes.

 {F1767497707}

Differential Revision: https://internalfb.com/D59980139

Esteb37 pushed a commit to Esteb37/executorch that referenced this pull request


          Add 3D Texture Bandwidth metric (pytorch#4336)

cc1d42b

Summary:
Pull Request resolved: pytorch#4336

This diff introduces a profiler that obtains the maximum and minimum bandwidth for reading unique addresses from 3D textures in each of its dimensions, using the following shader, where A is a 3D texture and B is a writeonly buffer.

The calculation of the texel position will depend on the dimension that is being benchmarked

x : pos = ivec3(offset, 0, 0)
y : pos = ivec3(0, offset, 0)
z : pos = ivec3(0, 0, offset)

  void main() {
    vec4 sum = vec4(0);
    const uint workgroup_width = local_group_size * niter * ${NUNROLL};
    uint offset = (gl_WorkGroupID[0] * workgroup_width  + gl_LocalInvocationID[0]) & addr_mask;

    int i = 0;
    for (; i < niter; ++i)
    {
        sum *= texelFetch(A, pos, 0);
        offset = (offset + local_group_size) & addr_mask;
        ...
        ...
        sum *= texelFetch(A, pos, 0);
        offset = (offset + local_group_size) & addr_mask;
    }

    vec4 zero = vec4(i>>31);

    B[gl_LocalInvocationID[0]] = sum + zero;
  }

The address mask allows us to control how many unique addresses we are accessing. If the number of unique vectors we want to read is 3, the offset will jump between three unique addresses throughout the iterations, giving us the bandwidth for that specific size of data. If the size of the unique data read is larger than the work group size, then each run will have its own block of data to read, defined by the initial offset calculation, where the offset is obtained through the workgroup ID and the local invocation ID.

Finally, we make sure to use the `sum` and `i	` variables so that the compiler's optimizer does not flatten the loops.

For a Samsung S22, the bandwidth behaves like this for each of the dimensions.
{F1767497386}

Comparing the bandwidth for the X dimension to OpenCL, we can observe that, although the behavior is the same, Vulkan has an increased bandwidth for most access sizes.

{F1767497972}

Comparing to the bandwidth for buffers, we can observe that the bandwidth is similar to regular buffers, but still much smaller than UBOs at small access sizes.

 {F1767497707}

Differential Revision: https://internalfb.com/D59980139

Contributor

facebook-github-bot commented Jul 26, 2024

This pull request was exported from Phabricator. Differential Revision: D59980139

Esteb37 pushed a commit to Esteb37/executorch that referenced this pull request


          Add 3D Texture Bandwidth metric (pytorch#4336)

a5f48c2

Summary:
Pull Request resolved: pytorch#4336

This diff introduces a profiler that obtains the maximum and minimum bandwidth for reading unique addresses from 3D textures in each of its dimensions, using the following shader, where A is a 3D texture and B is a writeonly buffer.

The calculation of the texel position will depend on the dimension that is being benchmarked

x : pos = ivec3(offset, 0, 0)
y : pos = ivec3(0, offset, 0)
z : pos = ivec3(0, 0, offset)

  void main() {
    vec4 sum = vec4(0);
    const uint workgroup_width = local_group_size * niter * ${NUNROLL};
    uint offset = (gl_WorkGroupID[0] * workgroup_width  + gl_LocalInvocationID[0]) & addr_mask;

    int i = 0;
    for (; i < niter; ++i)
    {
        sum *= texelFetch(A, pos, 0);
        offset = (offset + local_group_size) & addr_mask;
        ...
        ...
        sum *= texelFetch(A, pos, 0);
        offset = (offset + local_group_size) & addr_mask;
    }

    vec4 zero = vec4(i>>31);

    B[gl_LocalInvocationID[0]] = sum + zero;
  }

The address mask allows us to control how many unique addresses we are accessing. If the number of unique vectors we want to read is 3, the offset will jump between three unique addresses throughout the iterations, giving us the bandwidth for that specific size of data. If the size of the unique data read is larger than the work group size, then each run will have its own block of data to read, defined by the initial offset calculation, where the offset is obtained through the workgroup ID and the local invocation ID.

Finally, we make sure to use the `sum` and `i	` variables so that the compiler's optimizer does not flatten the loops.

For a Samsung S22, the bandwidth behaves like this for each of the dimensions.
{F1767497386}

Comparing the bandwidth for the X dimension to OpenCL, which was obtained through [ArchProbe](https://github.com/microsoft/ArchProbe), we can observe that, although the behavior is the same, Vulkan has an increased bandwidth for most access sizes.

{F1767497972}

Comparing to the bandwidth for buffers, we can observe that the bandwidth is similar to regular buffers, but still much smaller than UBOs at small access sizes.

 {F1767497707}

Reviewed By: jorgep31415

Differential Revision: D59980139

Esteb37 force-pushed the export-D59980139 branch from ad0f834 to a5f48c2 Compare

July 26, 2024 20:25

Esteb37 pushed a commit to Esteb37/executorch that referenced this pull request


          Add 3D Texture Bandwidth metric (pytorch#4336)

b1054bb

Summary:
Pull Request resolved: pytorch#4336

This diff introduces a profiler that obtains the maximum and minimum bandwidth for reading unique addresses from 3D textures in each of its dimensions, using the following shader, where A is a 3D texture and B is a writeonly buffer.

The calculation of the texel position will depend on the dimension that is being benchmarked

x : pos = ivec3(offset, 0, 0)
y : pos = ivec3(0, offset, 0)
z : pos = ivec3(0, 0, offset)

  void main() {
    vec4 sum = vec4(0);
    const uint workgroup_width = local_group_size * niter * ${NUNROLL};
    uint offset = (gl_WorkGroupID[0] * workgroup_width  + gl_LocalInvocationID[0]) & addr_mask;

    int i = 0;
    for (; i < niter; ++i)
    {
        sum *= texelFetch(A, pos, 0);
        offset = (offset + local_group_size) & addr_mask;
        ...
        ...
        sum *= texelFetch(A, pos, 0);
        offset = (offset + local_group_size) & addr_mask;
    }

    vec4 zero = vec4(i>>31);

    B[gl_LocalInvocationID[0]] = sum + zero;
  }

The address mask allows us to control how many unique addresses we are accessing. If the number of unique vectors we want to read is 3, the offset will jump between three unique addresses throughout the iterations, giving us the bandwidth for that specific size of data. If the size of the unique data read is larger than the work group size, then each run will have its own block of data to read, defined by the initial offset calculation, where the offset is obtained through the workgroup ID and the local invocation ID.

Finally, we make sure to use the `sum` and `i	` variables so that the compiler's optimizer does not flatten the loops.

For a Samsung S22, the bandwidth behaves like this for each of the dimensions.
{F1767497386}

Comparing the bandwidth for the X dimension to OpenCL, we can observe that, although the behavior is the same, Vulkan has an increased bandwidth for most access sizes.

{F1767497972}

Comparing to the bandwidth for buffers, we can observe that the bandwidth is similar to regular buffers, but still much smaller than UBOs at small access sizes.

 {F1767497707}

Differential Revision: https://internalfb.com/D59980139

Esteb37 pushed a commit to Esteb37/executorch that referenced this pull request


          Add 3D Texture Bandwidth metric (pytorch#4336)

ba66d43

Summary:
Pull Request resolved: pytorch#4336

This diff introduces a profiler that obtains the maximum and minimum bandwidth for reading unique addresses from 3D textures in each of its dimensions, using the following shader, where A is a 3D texture and B is a writeonly buffer.

The calculation of the texel position will depend on the dimension that is being benchmarked

x : pos = ivec3(offset, 0, 0)
y : pos = ivec3(0, offset, 0)
z : pos = ivec3(0, 0, offset)

  void main() {
    vec4 sum = vec4(0);
    const uint workgroup_width = local_group_size * niter * ${NUNROLL};
    uint offset = (gl_WorkGroupID[0] * workgroup_width  + gl_LocalInvocationID[0]) & addr_mask;

    int i = 0;
    for (; i < niter; ++i)
    {
        sum *= texelFetch(A, pos, 0);
        offset = (offset + local_group_size) & addr_mask;
        ...
        ...
        sum *= texelFetch(A, pos, 0);
        offset = (offset + local_group_size) & addr_mask;
    }

    vec4 zero = vec4(i>>31);

    B[gl_LocalInvocationID[0]] = sum + zero;
  }

The address mask allows us to control how many unique addresses we are accessing. If the number of unique vectors we want to read is 3, the offset will jump between three unique addresses throughout the iterations, giving us the bandwidth for that specific size of data. If the size of the unique data read is larger than the work group size, then each run will have its own block of data to read, defined by the initial offset calculation, where the offset is obtained through the workgroup ID and the local invocation ID.

Finally, we make sure to use the `sum` and `i	` variables so that the compiler's optimizer does not flatten the loops.

For a Samsung S22, the bandwidth behaves like this for each of the dimensions.
{F1767497386}

Comparing the bandwidth for the X dimension to OpenCL, we can observe that, although the behavior is the same, Vulkan has an increased bandwidth for most access sizes.

{F1767497972}

Comparing to the bandwidth for buffers, we can observe that the bandwidth is similar to regular buffers, but still much smaller than UBOs at small access sizes.

 {F1767497707}

Differential Revision: https://internalfb.com/D59980139

Esteb37 pushed a commit to Esteb37/executorch that referenced this pull request


          Add 3D Texture Bandwidth metric (pytorch#4336)

Summary:
Pull Request resolved: pytorch#4336

This diff introduces a profiler that obtains the maximum and minimum bandwidth for reading unique addresses from 3D textures in each of its dimensions, using the following shader, where A is a 3D texture and B is a writeonly buffer.

The calculation of the texel position will depend on the dimension that is being benchmarked

x : pos = ivec3(offset, 0, 0)
y : pos = ivec3(0, offset, 0)
z : pos = ivec3(0, 0, offset)

  void main() {
    vec4 sum = vec4(0);
    const uint workgroup_width = local_group_size * niter * ${NUNROLL};
    uint offset = (gl_WorkGroupID[0] * workgroup_width  + gl_LocalInvocationID[0]) & addr_mask;

    int i = 0;
    for (; i < niter; ++i)
    {
        sum *= texelFetch(A, pos, 0);
        offset = (offset + local_group_size) & addr_mask;
        ...
        ...
        sum *= texelFetch(A, pos, 0);
        offset = (offset + local_group_size) & addr_mask;
    }

    vec4 zero = vec4(i>>31);

    B[gl_LocalInvocationID[0]] = sum + zero;
  }

The address mask allows us to control how many unique addresses we are accessing. If the number of unique vectors we want to read is 3, the offset will jump between three unique addresses throughout the iterations, giving us the bandwidth for that specific size of data. If the size of the unique data read is larger than the work group size, then each run will have its own block of data to read, defined by the initial offset calculation, where the offset is obtained through the workgroup ID and the local invocation ID.

Finally, we make sure to use the `sum` and `i	` variables so that the compiler's optimizer does not flatten the loops.

For a Samsung S22, the bandwidth behaves like this for each of the dimensions.
{F1767497386}

Comparing the bandwidth for the X dimension to OpenCL, which was obtained through [ArchProbe](https://github.com/microsoft/ArchProbe), we can observe that, although the behavior is the same, Vulkan has an increased bandwidth for most access sizes.

{F1767497972}

Comparing to the bandwidth for buffers, we can observe that the bandwidth is similar to regular buffers, but still much smaller than UBOs at small access sizes.

 {F1767497707}

Reviewed By: jorgep31415

Differential Revision: D59980139

Esteb37 force-pushed the export-D59980139 branch from a5f48c2 to 8321514 Compare

July 29, 2024 15:10

Contributor

facebook-github-bot commented Jul 29, 2024

This pull request was exported from Phabricator. Differential Revision: D59980139

Esteb37 pushed a commit to Esteb37/executorch that referenced this pull request


          Add 3D Texture Bandwidth metric (pytorch#4336)

7b96a5b

Summary:
Pull Request resolved: pytorch#4336

This diff introduces a profiler that obtains the maximum and minimum bandwidth for reading unique addresses from 3D textures in each of its dimensions, using the following shader, where A is a 3D texture and B is a writeonly buffer.

The calculation of the texel position will depend on the dimension that is being benchmarked

x : pos = ivec3(offset, 0, 0)
y : pos = ivec3(0, offset, 0)
z : pos = ivec3(0, 0, offset)

  void main() {
    vec4 sum = vec4(0);
    const uint workgroup_width = local_group_size * niter * ${NUNROLL};
    uint offset = (gl_WorkGroupID[0] * workgroup_width  + gl_LocalInvocationID[0]) & addr_mask;

    int i = 0;
    for (; i < niter; ++i)
    {
        sum *= texelFetch(A, pos, 0);
        offset = (offset + local_group_size) & addr_mask;
        ...
        ...
        sum *= texelFetch(A, pos, 0);
        offset = (offset + local_group_size) & addr_mask;
    }

    vec4 zero = vec4(i>>31);

    B[gl_LocalInvocationID[0]] = sum + zero;
  }

The address mask allows us to control how many unique addresses we are accessing. If the number of unique vectors we want to read is 3, the offset will jump between three unique addresses throughout the iterations, giving us the bandwidth for that specific size of data. If the size of the unique data read is larger than the work group size, then each run will have its own block of data to read, defined by the initial offset calculation, where the offset is obtained through the workgroup ID and the local invocation ID.

Finally, we make sure to use the `sum` and `i	` variables so that the compiler's optimizer does not flatten the loops.

For a Samsung S22, the bandwidth behaves like this for each of the dimensions.
{F1767497386}

Comparing the bandwidth for the X dimension to OpenCL, we can observe that, although the behavior is the same, Vulkan has an increased bandwidth for most access sizes.

{F1767497972}

Comparing to the bandwidth for buffers, we can observe that the bandwidth is similar to regular buffers, but still much smaller than UBOs at small access sizes.

 {F1767497707}

Differential Revision: https://internalfb.com/D59980139

Esteb37 force-pushed the export-D59980139 branch from 8321514 to 8268828 Compare

July 30, 2024 17:41

Contributor

facebook-github-bot commented Jul 30, 2024

This pull request was exported from Phabricator. Differential Revision: D59980139

Esteb37 pushed a commit to Esteb37/executorch that referenced this pull request


          Add 3D Texture Bandwidth metric (pytorch#4336)

Summary:
Pull Request resolved: pytorch#4336

This diff introduces a profiler that obtains the maximum and minimum bandwidth for reading unique addresses from 3D textures in each of its dimensions, using the following shader, where A is a 3D texture and B is a writeonly buffer.

The calculation of the texel position will depend on the dimension that is being benchmarked

x : pos = ivec3(offset, 0, 0)
y : pos = ivec3(0, offset, 0)
z : pos = ivec3(0, 0, offset)

  void main() {
    vec4 sum = vec4(0);
    const uint workgroup_width = local_group_size * niter * ${NUNROLL};
    uint offset = (gl_WorkGroupID[0] * workgroup_width  + gl_LocalInvocationID[0]) & addr_mask;

    int i = 0;
    for (; i < niter; ++i)
    {
        sum *= texelFetch(A, pos, 0);
        offset = (offset + local_group_size) & addr_mask;
        ...
        ...
        sum *= texelFetch(A, pos, 0);
        offset = (offset + local_group_size) & addr_mask;
    }

    vec4 zero = vec4(i>>31);

    B[gl_LocalInvocationID[0]] = sum + zero;
  }

The address mask allows us to control how many unique addresses we are accessing. If the number of unique vectors we want to read is 3, the offset will jump between three unique addresses throughout the iterations, giving us the bandwidth for that specific size of data. If the size of the unique data read is larger than the work group size, then each run will have its own block of data to read, defined by the initial offset calculation, where the offset is obtained through the workgroup ID and the local invocation ID.

Finally, we make sure to use the `sum` and `i	` variables so that the compiler's optimizer does not flatten the loops.

For a Samsung S22, the bandwidth behaves like this for each of the dimensions.
{F1767497386}

Comparing the bandwidth for the X dimension to OpenCL, which was obtained through [ArchProbe](https://github.com/microsoft/ArchProbe), we can observe that, although the behavior is the same, Vulkan has an increased bandwidth for most access sizes.

{F1767497972}

Comparing to the bandwidth for buffers, we can observe that the bandwidth is similar to regular buffers, but still much smaller than UBOs at small access sizes.

 {F1767497707}

Reviewed By: jorgep31415

Differential Revision: D59980139

Esteb37 pushed a commit to Esteb37/executorch that referenced this pull request


          Add 3D Texture Bandwidth metric (pytorch#4336)

ee5f6a5

Summary:
Pull Request resolved: pytorch#4336

This diff introduces a profiler that obtains the maximum and minimum bandwidth for reading unique addresses from 3D textures in each of its dimensions, using the following shader, where A is a 3D texture and B is a writeonly buffer.

The calculation of the texel position will depend on the dimension that is being benchmarked

x : pos = ivec3(offset, 0, 0)
y : pos = ivec3(0, offset, 0)
z : pos = ivec3(0, 0, offset)

  void main() {
    vec4 sum = vec4(0);
    const uint workgroup_width = local_group_size * niter * ${NUNROLL};
    uint offset = (gl_WorkGroupID[0] * workgroup_width  + gl_LocalInvocationID[0]) & addr_mask;

    int i = 0;
    for (; i < niter; ++i)
    {
        sum *= texelFetch(A, pos, 0);
        offset = (offset + local_group_size) & addr_mask;
        ...
        ...
        sum *= texelFetch(A, pos, 0);
        offset = (offset + local_group_size) & addr_mask;
    }

    vec4 zero = vec4(i>>31);

    B[gl_LocalInvocationID[0]] = sum + zero;
  }

The address mask allows us to control how many unique addresses we are accessing. If the number of unique vectors we want to read is 3, the offset will jump between three unique addresses throughout the iterations, giving us the bandwidth for that specific size of data. If the size of the unique data read is larger than the work group size, then each run will have its own block of data to read, defined by the initial offset calculation, where the offset is obtained through the workgroup ID and the local invocation ID.

Finally, we make sure to use the `sum` and `i	` variables so that the compiler's optimizer does not flatten the loops.

For a Samsung S22, the bandwidth behaves like this for each of the dimensions.
{F1767497386}

Comparing the bandwidth for the X dimension to OpenCL, we can observe that, although the behavior is the same, Vulkan has an increased bandwidth for most access sizes.

{F1767497972}

Comparing to the bandwidth for buffers, we can observe that the bandwidth is similar to regular buffers, but still much smaller than UBOs at small access sizes.

 {F1767497707}

Differential Revision: https://internalfb.com/D59980139


          Add 3D Texture Bandwidth metric (pytorch#4336)

e203ace

Summary:
Pull Request resolved: pytorch#4336

This diff introduces a profiler that obtains the maximum and minimum bandwidth for reading unique addresses from 3D textures in each of its dimensions, using the following shader, where A is a 3D texture and B is a writeonly buffer.

The calculation of the texel position will depend on the dimension that is being benchmarked

x : pos = ivec3(offset, 0, 0)
y : pos = ivec3(0, offset, 0)
z : pos = ivec3(0, 0, offset)

  void main() {
    vec4 sum = vec4(0);
    const uint workgroup_width = local_group_size * niter * ${NUNROLL};
    uint offset = (gl_WorkGroupID[0] * workgroup_width  + gl_LocalInvocationID[0]) & addr_mask;

    int i = 0;
    for (; i < niter; ++i)
    {
        sum *= texelFetch(A, pos, 0);
        offset = (offset + local_group_size) & addr_mask;
        ...
        ...
        sum *= texelFetch(A, pos, 0);
        offset = (offset + local_group_size) & addr_mask;
    }

    vec4 zero = vec4(i>>31);

    B[gl_LocalInvocationID[0]] = sum + zero;
  }

The address mask allows us to control how many unique addresses we are accessing. If the number of unique vectors we want to read is 3, the offset will jump between three unique addresses throughout the iterations, giving us the bandwidth for that specific size of data. If the size of the unique data read is larger than the work group size, then each run will have its own block of data to read, defined by the initial offset calculation, where the offset is obtained through the workgroup ID and the local invocation ID.

Finally, we make sure to use the `sum` and `i	` variables so that the compiler's optimizer does not flatten the loops.

For a Samsung S22, the bandwidth behaves like this for each of the dimensions.
{F1767497386}

Comparing the bandwidth for the X dimension to OpenCL, which was obtained through [ArchProbe](https://github.com/microsoft/ArchProbe), we can observe that, although the behavior is the same, Vulkan has an increased bandwidth for most access sizes.

{F1767497972}

Comparing to the bandwidth for buffers, we can observe that the bandwidth is similar to regular buffers, but still much smaller than UBOs at small access sizes.

 {F1767497707}

Reviewed By: jorgep31415

Differential Revision: D59980139

Contributor

facebook-github-bot commented Jul 30, 2024

This pull request was exported from Phabricator. Differential Revision: D59980139

Esteb37 force-pushed the export-D59980139 branch from 8268828 to e203ace Compare

July 30, 2024 17:47

Esteb37 pushed a commit to Esteb37/executorch that referenced this pull request


          Add 3D Texture Bandwidth metric (pytorch#4336)

2ddafed

Summary:
Pull Request resolved: pytorch#4336

This diff introduces a profiler that obtains the maximum and minimum bandwidth for reading unique addresses from 3D textures in each of its dimensions, using the following shader, where A is a 3D texture and B is a writeonly buffer.

The calculation of the texel position will depend on the dimension that is being benchmarked

x : pos = ivec3(offset, 0, 0)
y : pos = ivec3(0, offset, 0)
z : pos = ivec3(0, 0, offset)

  void main() {
    vec4 sum = vec4(0);
    const uint workgroup_width = local_group_size * niter * ${NUNROLL};
    uint offset = (gl_WorkGroupID[0] * workgroup_width  + gl_LocalInvocationID[0]) & addr_mask;

    int i = 0;
    for (; i < niter; ++i)
    {
        sum *= texelFetch(A, pos, 0);
        offset = (offset + local_group_size) & addr_mask;
        ...
        ...
        sum *= texelFetch(A, pos, 0);
        offset = (offset + local_group_size) & addr_mask;
    }

    vec4 zero = vec4(i>>31);

    B[gl_LocalInvocationID[0]] = sum + zero;
  }

The address mask allows us to control how many unique addresses we are accessing. If the number of unique vectors we want to read is 3, the offset will jump between three unique addresses throughout the iterations, giving us the bandwidth for that specific size of data. If the size of the unique data read is larger than the work group size, then each run will have its own block of data to read, defined by the initial offset calculation, where the offset is obtained through the workgroup ID and the local invocation ID.

Finally, we make sure to use the `sum` and `i	` variables so that the compiler's optimizer does not flatten the loops.

For a Samsung S22, the bandwidth behaves like this for each of the dimensions.
{F1767497386}

Comparing the bandwidth for the X dimension to OpenCL, which was obtained through [ArchProbe](https://github.com/microsoft/ArchProbe), we can observe that, although the behavior is the same, Vulkan has an increased bandwidth for most access sizes.

{F1767497972}

Comparing to the bandwidth for buffers, we can observe that the bandwidth is similar to regular buffers, but still much smaller than UBOs at small access sizes.

 {F1767497707}

Reviewed By: jorgep31415

Differential Revision: D59980139

facebook-github-bot closed this in

ea0c017

facebook-github-bot added the Merged label

Contributor

facebook-github-bot commented Jul 30, 2024

This pull request has been merged in ea0c017.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed fb-exported Merged