Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ET-VK] Deprecate gpu_sizes_ubo() and toggle packing layout via specialization shader #3181

Closed
wants to merge 2 commits into from

Conversation

SS-JIA
Copy link
Contributor

@SS-JIA SS-JIA commented Apr 19, 2024

Stack from ghstack (oldest at bottom):

Context

This changeset cleans up how shaders consume tensor metadata in two ways:

Pass in Packing Layout via Specialization Shader

The packing layout of a tensor determines how to convert between tensor indices and physical texture coordinates. Currently, the packing layout is determined by generating a completely new variant of a shader. However, this is rather expensive for build size.

Specialization constants support was added a while back, which enables packing layout to be communicated to the shader via a specialization constant. This is a much better and natural way for shaders to determine the packing layout of its tensors and vary its behaviour.

The primary benefit of this is that we can vastly reduce the number of variants that are generated. Generating shader variants for combinations of dtypes and memory layouts can lead to combinatorial explosion of build size.

Note that dtype cannot be passed as a specialization constant since it impacts the types used in the layout portion of a shader.

Deprecate GPU sizes

Currently there are two representations of the tensor's sizes; cpu_sizes() and the gpu_sizes(). The GPU sizes is a simple modification of the CPU sizes where the packed dim is aligned to the next multiple of 4.

However, often times shaders need to reference the original sizes of the tensor so we end up passing both CPU sizes and GPU sizes. The CPU sizes is used to determine out of bounds elements and the GPU sizes is used to convert between logical tensor indices and physical texture coordinates.

Since the GPU sizes is easily determined from the CPU sizes given the packing layout, deprecate GPU sizes and use CPU sizes exclusively as the canonical tensor sizes. Hence cpu_sizes() is renamed to simple sizes().

The primary benefit of this change is such:

  1. Less confusion over how to reference the tensor sizes
  2. Fewer descriptors to bind when constructing compute pipelines

Differential Revision: D56377775

…cialization shader

## Context

This changeset cleans up how shaders consume tensor metadata in two ways:

### Pass in Packing Layout via Specialization Shader

The packing layout of a tensor determines how to convert between tensor indices and physical texture coordinates. Currently, the packing layout is determined by generating a completely new variant of a shader. However, this is rather expensive for build size.

Specialization constants support was added a while back, which enables packing layout to be communicated to the shader via a specialization constant. This is a much better and natural way for shaders to determine the packing layout of its tensors and vary its behaviour.

The primary benefit of this is that we can vastly reduce the number of variants that are generated. Generating shader variants for combinations of dtypes and memory layouts can lead to combinatorial explosion of build size.

Note that dtype cannot be passed as a specialization constant since it impacts the types used in the layout portion of a shader.

### Deprecate GPU sizes

Currently there are two representations of the tensor's sizes; `cpu_sizes()` and the `gpu_sizes()`. The GPU sizes is a simple modification of the CPU sizes where the packed dim is aligned to the next multiple of 4.

However, often times shaders need to reference the original sizes of the tensor so we end up passing both CPU sizes and GPU sizes. The CPU sizes is used to determine out of bounds elements and the GPU sizes is used to convert between logical tensor indices and physical texture coordinates.

Since the GPU sizes is easily determined from the CPU sizes given the packing layout, deprecate GPU sizes and use CPU sizes exclusively as the canonical tensor sizes. Hence `cpu_sizes()` is renamed to simple `sizes()`.

The primary benefit of this change is such:

1. Less confusion over how to reference the tensor sizes
2. Fewer descriptors to bind when constructing compute pipelines

Differential Revision: [D56377775](https://our.internmc.facebook.com/intern/diff/D56377775/)

[ghstack-poisoned]
Copy link

pytorch-bot bot commented Apr 19, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/3181

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 6859982 with merge base 36453fc (image):

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 19, 2024
SS-JIA added a commit that referenced this pull request Apr 19, 2024
…cialization shader

## Context

This changeset cleans up how shaders consume tensor metadata in two ways:

### Pass in Packing Layout via Specialization Shader

The packing layout of a tensor determines how to convert between tensor indices and physical texture coordinates. Currently, the packing layout is determined by generating a completely new variant of a shader. However, this is rather expensive for build size.

Specialization constants support was added a while back, which enables packing layout to be communicated to the shader via a specialization constant. This is a much better and natural way for shaders to determine the packing layout of its tensors and vary its behaviour.

The primary benefit of this is that we can vastly reduce the number of variants that are generated. Generating shader variants for combinations of dtypes and memory layouts can lead to combinatorial explosion of build size.

Note that dtype cannot be passed as a specialization constant since it impacts the types used in the layout portion of a shader.

### Deprecate GPU sizes

Currently there are two representations of the tensor's sizes; `cpu_sizes()` and the `gpu_sizes()`. The GPU sizes is a simple modification of the CPU sizes where the packed dim is aligned to the next multiple of 4.

However, often times shaders need to reference the original sizes of the tensor so we end up passing both CPU sizes and GPU sizes. The CPU sizes is used to determine out of bounds elements and the GPU sizes is used to convert between logical tensor indices and physical texture coordinates.

Since the GPU sizes is easily determined from the CPU sizes given the packing layout, deprecate GPU sizes and use CPU sizes exclusively as the canonical tensor sizes. Hence `cpu_sizes()` is renamed to simple `sizes()`.

The primary benefit of this change is such:

1. Less confusion over how to reference the tensor sizes
2. Fewer descriptors to bind when constructing compute pipelines

Differential Revision: [D56377775](https://our.internmc.facebook.com/intern/diff/D56377775/)

ghstack-source-id: 223274863
Pull Request resolved: #3181
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D56377775

…out via specialization shader"

## Context

This changeset cleans up how shaders consume tensor metadata in two ways:

### Pass in Packing Layout via Specialization Shader

The packing layout of a tensor determines how to convert between tensor indices and physical texture coordinates. Currently, the packing layout is determined by generating a completely new variant of a shader. However, this is rather expensive for build size.

Specialization constants support was added a while back, which enables packing layout to be communicated to the shader via a specialization constant. This is a much better and natural way for shaders to determine the packing layout of its tensors and vary its behaviour.

The primary benefit of this is that we can vastly reduce the number of variants that are generated. Generating shader variants for combinations of dtypes and memory layouts can lead to combinatorial explosion of build size.

Note that dtype cannot be passed as a specialization constant since it impacts the types used in the layout portion of a shader.

### Deprecate GPU sizes

Currently there are two representations of the tensor's sizes; `cpu_sizes()` and the `gpu_sizes()`. The GPU sizes is a simple modification of the CPU sizes where the packed dim is aligned to the next multiple of 4.

However, often times shaders need to reference the original sizes of the tensor so we end up passing both CPU sizes and GPU sizes. The CPU sizes is used to determine out of bounds elements and the GPU sizes is used to convert between logical tensor indices and physical texture coordinates.

Since the GPU sizes is easily determined from the CPU sizes given the packing layout, deprecate GPU sizes and use CPU sizes exclusively as the canonical tensor sizes. Hence `cpu_sizes()` is renamed to simple `sizes()`.

The primary benefit of this change is such:

1. Less confusion over how to reference the tensor sizes
2. Fewer descriptors to bind when constructing compute pipelines

Differential Revision: [D56377775](https://our.internmc.facebook.com/intern/diff/D56377775/)

[ghstack-poisoned]
SS-JIA added a commit that referenced this pull request Apr 20, 2024
…ing layout via specialization constants

Pull Request resolved: #3181

## Context

This changeset cleans up how shaders consume tensor metadata in two ways:

### Pass in Packing Layout via Specialization Shader

The packing layout of a tensor determines how to convert between tensor indices and physical texture coordinates. Currently, the packing layout is determined by generating a completely new variant of a shader. However, this is rather expensive for build size.

Specialization constants support was added a while back, which enables packing layout to be communicated to the shader via a specialization constant. This is a much better and natural way for shaders to determine the packing layout of its tensors and vary its behaviour.

The primary benefit of this is that we can vastly reduce the number of variants that are generated. Generating shader variants for combinations of dtypes and memory layouts can lead to combinatorial explosion of build size.

Note that dtype cannot be passed as a specialization constant since it impacts the types used in the layout portion of a shader.

### Deprecate GPU sizes and Extents

Currently there are 3 representations of the tensor's sizes; `cpu_sizes()`, `gpu_sizes()`, and `extents()`. The GPU sizes is a simple modification of the CPU sizes where the packed dim is aligned to the next multiple of 4. Extents represents the physical extents of the image texture used to store the image.

However, often times shaders need to reference the original sizes of the tensor so we end up passing two different representations of the tensor sizes. The CPU sizes and extents is used to determine out of bounds elements and the GPU sizes is used to convert between logical tensor indices and physical texture coordinates.

Since the GPU sizes and extents are easily determined from the CPU sizes given the packing layout, deprecate GPU sizes and use CPU sizes exclusively as the canonical tensor sizes. Hence `cpu_sizes()` is renamed to simple `sizes()`.

The primary benefit of this change is such:

1. Less confusion over how to reference the tensor sizes
2. Fewer descriptors to bind when constructing compute pipelines
3. Fewer uniform buffers to update when resizing tensors between inferences.

Differential Revision: [D56377775](https://our.internmc.facebook.com/intern/diff/D56377775/)
ghstack-source-id: 223317313
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D56377775

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in c350e58.

SS-JIA added a commit that referenced this pull request Apr 22, 2024
## Context

#3181 deprecated the `gpu_sizes_ubo()` and `extents_ubo()` functions of `vTensor` in order to standardize how shaders consume shape/size metadata of input tensors. However, this came at the cost of increasing the overhead required for bounds checking, which is needed to support dynamic shapes as shaders now needed to convert the input sizes to texture limits before checking if a given texel position is out of bounds.

Benchmarking revealed that this overhead can be quite significant especially on lower power mobile GPUs. In the interest of preserving performance, `extents_ubo()` is re-introduced since bounds checking is an operation that is common to every single shader. However, some improvements are made:

* instead of `extents`, the nomenclature `texture_limits` is used in order to differentiate from physical image extents of the texture.
* `texture_limits` is represented via an `ivec3` (previously `uvec4`); this means that to use it for bounds checking, there does not need to be an implicit cast to from `uvec` to `ivec` and there is also no need for swizzling.

Also introduced in this changeset is the convention of passing both the texture limits and tensor sizes instead of using `pos_out_of_bounds()`. Passing in the texture limits is probably cheaper than using `pos_out_of_bounds()`. There are some exceptions though where I choose not to migrate to this pattern to avoid passing in too many variants of tensor metadata.

### What about `gpu_sizes_ubo`?

I will hold off on re-introducing `gpu_sizes_ubo` for now since converting `sizes` to `gpu_sizes` is much cheaper compared to `pos_out_of_bounds()`:

```
ivec4 sizes[packed_dim] = alignup4(sizes[packed_dim])
```

Will perform some additional benchmarking on this to see if the overhead of the alignment warrants an explicit API for passing in GPU sizes to shaders.

Differential Revision: [D56435574](https://our.internmc.facebook.com/intern/diff/D56435574/)

[ghstack-poisoned]
SS-JIA added a commit that referenced this pull request Apr 22, 2024
## Context

#3181 deprecated the `gpu_sizes_ubo()` and `extents_ubo()` functions of `vTensor` in order to standardize how shaders consume shape/size metadata of input tensors. However, this came at the cost of increasing the overhead required for bounds checking, which is needed to support dynamic shapes as shaders now needed to convert the input sizes to texture limits before checking if a given texel position is out of bounds.

Benchmarking revealed that this overhead can be quite significant especially on lower power mobile GPUs. In the interest of preserving performance, `extents_ubo()` is re-introduced since bounds checking is an operation that is common to every single shader. However, some improvements are made:

* instead of `extents`, the nomenclature `texture_limits` is used in order to differentiate from physical image extents of the texture.
* `texture_limits` is represented via an `ivec3` (previously `uvec4`); this means that to use it for bounds checking, there does not need to be an implicit cast to from `uvec` to `ivec` and there is also no need for swizzling.

Also introduced in this changeset is the convention of passing both the texture limits and tensor sizes instead of using `pos_out_of_bounds()`. Passing in the texture limits is probably cheaper than using `pos_out_of_bounds()`. There are some exceptions though where I choose not to migrate to this pattern to avoid passing in too many variants of tensor metadata.

### What about `gpu_sizes_ubo`?

I will hold off on re-introducing `gpu_sizes_ubo` for now since converting `sizes` to `gpu_sizes` is much cheaper compared to `pos_out_of_bounds()`:

```
ivec4 sizes[packed_dim] = alignup4(sizes[packed_dim])
```

Will perform some additional benchmarking on this to see if the overhead of the alignment warrants an explicit API for passing in GPU sizes to shaders.

Differential Revision: [D56435574](https://our.internmc.facebook.com/intern/diff/D56435574/)

ghstack-source-id: 223453651
Pull Request resolved: #3217
facebook-github-bot pushed a commit that referenced this pull request Apr 22, 2024
Summary:
Pull Request resolved: #3217

## Context

#3181 deprecated the `gpu_sizes_ubo()` and `extents_ubo()` functions of `vTensor` in order to standardize how shaders consume shape/size metadata of input tensors. However, this came at the cost of increasing the overhead required for bounds checking, which is needed to support dynamic shapes as shaders now needed to convert the input sizes to texture limits before checking if a given texel position is out of bounds.

Benchmarking revealed that this overhead can be quite significant especially on lower power mobile GPUs. In the interest of preserving performance, `extents_ubo()` is re-introduced since bounds checking is an operation that is common to every single shader. However, some improvements are made:

* instead of `extents`, the nomenclature `texture_limits` is used in order to differentiate from physical image extents of the texture.
* `texture_limits` is represented via an `ivec3` (previously `uvec4`); this means that to use it for bounds checking, there does not need to be an implicit cast to from `uvec` to `ivec` and there is also no need for swizzling.

Also introduced in this changeset is the convention of passing both the texture limits and tensor sizes instead of using `pos_out_of_bounds()`. Passing in the texture limits is probably cheaper than using `pos_out_of_bounds()`. There are some exceptions though where I choose not to migrate to this pattern to avoid passing in too many variants of tensor metadata.

### What about `gpu_sizes_ubo`?

I will hold off on re-introducing `gpu_sizes_ubo` for now since converting `sizes` to `gpu_sizes` is much cheaper compared to `pos_out_of_bounds()`:

```
ivec4 sizes[packed_dim] = alignup4(sizes[packed_dim])
```

Will perform some additional benchmarking on this to see if the overhead of the alignment warrants an explicit API for passing in GPU sizes to shaders.
ghstack-source-id: 223453651
exported-using-ghexport

Reviewed By: yipjustin, jorgep31415

Differential Revision: D56435574

fbshipit-source-id: 656f79eecbfc7c77cbe067df6c9ea54c51c50633
@mergennachin mergennachin mentioned this pull request Apr 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants