Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor and class split #4432

Closed
wants to merge 4 commits into from
Closed

Conversation

Esteb37
Copy link
Contributor

@Esteb37 Esteb37 commented Jul 26, 2024

Summary:
Big classes are scary ☹️

This diff subdivides the tests into categories, places them as functions inside the gpuinfo namespace, instead of as part of the App class, and the App class is now only for persisting device information and configuration.

Differential Revision: D60290882

Copy link

pytorch-bot bot commented Jul 26, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/4432

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit ba7f3ab with merge base b7c8378 (image):

NEW FAILURE - The following job has failed:

  • pull / unittest / macos / macos-job (gh)
    RuntimeError: Failed to compile /var/folders/bm/fnn3xd1d39lcpbxrgwys1c140000gn/T/tmpxdba_hzc/data.json to /var/folders/bm/fnn3xd1d39lcpbxrgwys1c140000gn/T/tmpxdba_hzc/data.pte. Set ET_EXIR_SAVE_FLATC_INPUTS_ON_FAILURE=1 to save input files on failure.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 26, 2024
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D60290882

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D60290882

Esteb37 pushed a commit to Esteb37/executorch that referenced this pull request Jul 26, 2024
Summary:
Pull Request resolved: pytorch#4432

Big classes are scary ☹️

This diff subdivides the tests into categories, places them as functions inside the gpuinfo namespace, instead of as part of the App class, and the App class is now only for persisting device information and configuration.

Differential Revision: D60290882
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D60290882

Esteb37 pushed a commit to Esteb37/executorch that referenced this pull request Jul 29, 2024
Summary:
Pull Request resolved: pytorch#4432

Big classes are scary ☹️

This diff subdivides the tests into categories, places them as functions inside the gpuinfo namespace, instead of as part of the App class, and the App class is now only for persisting device information and configuration.

Differential Revision: D60290882
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D60290882

Esteb37 pushed a commit to Esteb37/executorch that referenced this pull request Jul 29, 2024
Summary:
Pull Request resolved: pytorch#4432

Big classes are scary ☹️

This diff subdivides the tests into categories, places them as functions inside the gpuinfo namespace, instead of as part of the App class, and the App class is now only for persisting device information and configuration.

Differential Revision: D60290882
Esteb37 pushed a commit to Esteb37/executorch that referenced this pull request Jul 29, 2024
Summary:
Pull Request resolved: pytorch#4432

Big classes are scary ☹️

This diff subdivides the tests into categories, places them as functions inside the gpuinfo namespace, instead of as part of the App class, and the App class is now only for persisting device information and configuration.

Differential Revision: https://internalfb.com/D60290882
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D60290882

Esteb37 pushed a commit to Esteb37/executorch that referenced this pull request Jul 30, 2024
Summary:
Pull Request resolved: pytorch#4432

Big classes are scary ☹️

This diff subdivides the tests into categories, places them as functions inside the gpuinfo namespace, instead of as part of the App class, and the App class is now only for persisting device information and configuration.

Reviewed By: jorgep31415

Differential Revision: D60290882
Esteb37 pushed a commit to Esteb37/executorch that referenced this pull request Jul 30, 2024
Summary:
Pull Request resolved: pytorch#4432

Big classes are scary ☹️

This diff subdivides the tests into categories, places them as functions inside the gpuinfo namespace, instead of as part of the App class, and the App class is now only for persisting device information and configuration.

Differential Revision: https://internalfb.com/D60290882
Esteban Padilla Cerdio added 4 commits July 30, 2024 11:52
Summary:
Pull Request resolved: pytorch#4336

This diff introduces a profiler that obtains the maximum and minimum bandwidth for reading unique addresses from 3D textures in each of its dimensions, using the following shader, where A is a 3D texture and B is a writeonly buffer.

The calculation of the texel position will depend on the dimension that is being benchmarked

x : pos = ivec3(offset, 0, 0)
y : pos = ivec3(0, offset, 0)
z : pos = ivec3(0, 0, offset)

  void main() {
    vec4 sum = vec4(0);
    const uint workgroup_width = local_group_size * niter * ${NUNROLL};
    uint offset = (gl_WorkGroupID[0] * workgroup_width  + gl_LocalInvocationID[0]) & addr_mask;

    int i = 0;
    for (; i < niter; ++i)
    {
        sum *= texelFetch(A, pos, 0);
        offset = (offset + local_group_size) & addr_mask;
        ...
        ...
        sum *= texelFetch(A, pos, 0);
        offset = (offset + local_group_size) & addr_mask;
    }

    vec4 zero = vec4(i>>31);

    B[gl_LocalInvocationID[0]] = sum + zero;
  }

The address mask allows us to control how many unique addresses we are accessing. If the number of unique vectors we want to read is 3, the offset will jump between three unique addresses throughout the iterations, giving us the bandwidth for that specific size of data. If the size of the unique data read is larger than the work group size, then each run will have its own block of data to read, defined by the initial offset calculation, where the offset is obtained through the workgroup ID and the local invocation ID.

Finally, we make sure to use the `sum` and `i	` variables so that the compiler's optimizer does not flatten the loops.

For a Samsung S22, the bandwidth behaves like this for each of the dimensions.
{F1767497386}

Comparing the bandwidth for the X dimension to OpenCL, which was obtained through [ArchProbe](https://github.com/microsoft/ArchProbe), we can observe that, although the behavior is the same, Vulkan has an increased bandwidth for most access sizes.

{F1767497972}

Comparing to the bandwidth for buffers, we can observe that the bandwidth is similar to regular buffers, but still much smaller than UBOs at small access sizes.

 {F1767497707}

Reviewed By: jorgep31415

Differential Revision: D59980139
Summary:
Pull Request resolved: pytorch#4337

Now that the tool is getting larger, a configuration file for defining which tests to run and which to skip, as well as specifying some values like thresholds and ranges, comes in handy. This diff adds support for a JSON config file with specifications for each test.

Reviewed By: jorgep31415

Differential Revision: D60060188
Summary:
Pull Request resolved: pytorch#4421

This diff introduces a metric to calculate the maximum concurrent cache line accesses for each dimension of a 3D texture. The experiment works by allowing each thread to access a different texel on the texture and slowly increasing the number of threads, until the cache line is no longer able to handle all simultaneous accesses. By detecting a jump in latency, we can define the optimal maximum size that can be accessed concurrently on each dimension.

NOTE: ArchProbe uses this information to[ obtain a supposed cache line size for textures](https://fburl.com/98xiou3g). However, it is unclear why they define the cache line size as being the ratio between the larger concurrency value over the lower, times the texel size. It is also unclear how to extend their calculations to three dimensions.

TODO: Understand the relationship between concurrency and cache line size, and modify this metric to output the cache line size.

For a Samsung S22, the latency graph looks like this:

 {F1780375117}

Reviewed By: copyrightly

Differential Revision: D60246121
Summary:
Pull Request resolved: pytorch#4432

Big classes are scary ☹️

This diff subdivides the tests into categories, places them as functions inside the gpuinfo namespace, instead of as part of the App class, and the App class is now only for persisting device information and configuration.

Reviewed By: jorgep31415

Differential Revision: D60290882
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D60290882

Esteb37 pushed a commit to Esteb37/executorch that referenced this pull request Jul 30, 2024
Summary:
Pull Request resolved: pytorch#4432

Big classes are scary ☹️

This diff subdivides the tests into categories, places them as functions inside the gpuinfo namespace, instead of as part of the App class, and the App class is now only for persisting device information and configuration.

Reviewed By: jorgep31415

Differential Revision: D60290882
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in e03181d.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants