Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchamarking #1353

Merged
merged 1 commit into from
Nov 27, 2024
Merged

Benchamarking #1353

merged 1 commit into from
Nov 27, 2024

Conversation

metascroy
Copy link
Contributor

Summary: Add benchmarks for experimental torchao kernels.

Differential Revision: D66512859

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 26, 2024
Copy link

pytorch-bot bot commented Nov 26, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1353

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 2a4a964 with merge base 5eb6339 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D66512859

metascroy added a commit to metascroy/ao that referenced this pull request Nov 26, 2024
Summary:

Add benchmarks for experimental torchao kernels.

Differential Revision: D66512859
@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D66512859

@metascroy metascroy requested a review from jerryzh168 November 26, 2024 21:40
import os
import subprocess
import tempfile
def cmake_build_torchao_ops(temp_build_dir):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this required? why not add to torchao default build?

Copy link
Contributor

@jerryzh168 jerryzh168 Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this is required, I'd suggest to put these in a util and probably call it in int8_dynamic_activation_intx_weight

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to add it to a default build for ARM CPU if you have a code pointer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that would be great, here is our build code for kernels I think:

ao/setup.py

Line 53 in 5eb6339

def get_extensions():

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like cmake is being used there, and it would require more refactoring to use it. I factored out the install script into a utility, though.

Copy link
Contributor Author

@metascroy metascroy Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I'll land this for now. On creating build scripts in torchao/setup.py, I'll leave it up to you all. We already have build scripts for these kernels in ExecuTorch and torchchat, so they are not difficult to use on those surfaces.

cc @kimishpatel

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but if you add build scripts to torchao, will you still need to maintain these in torchchat and executorch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because both executorch/torchchat need to build executorch kernels, but if we added build scripts to torchao, they'd likely only build aten kernels (unless torchao takes a dependency on executorch).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so looks like it's the same kernel implementation, but different build script for torchao/executorch/torchchat in this case

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@jerryzh168 jerryzh168 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG thanks, had one question/comment

Summary:

Add benchmarks for experimental torchao kernels.

Differential Revision: D66512859
@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D66512859

@jerryzh168 jerryzh168 added the topic: improvement Use this tag if this PR is an improvement (doesn't fit into any of the other categories) label Nov 26, 2024
@metascroy metascroy merged commit 04a25e7 into pytorch:main Nov 27, 2024
18 of 21 checks passed
# Build kernels in temp location, and load them in torch
# This requires an ARM CPU
from torchao.experimental.temp_build import temp_build_and_load_torchao_ops
temp_build_and_load_torchao_ops(cmake_lists_path=os.path.dirname(os.path.realpath(__file__)) + "/../../experimental")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont follow why we are doing this way? Why can we not just try to load the ops and if not found raise exception with build steps needed? Ideally we can detect the platform and build this deps as part of some pre-req for running benchmarks under _model directory. When we are able to ship these kernels as part of pip package, we may not need this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we are able to ship these kernels as part of pip package, we may not need this

what is the blocker for this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to move out of the experimental. I think we need to follow up on this. I believe we have sufficient evidence now? @supriyar ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont follow why we are doing this way? Why can we not just try to load the ops and if not found raise exception with build steps needed? Ideally we can detect the platform and build this deps as part of some pre-req for running benchmarks under _model directory. When we are able to ship these kernels as part of pip package, we may not need this

I guess by "just try to load the ops" you mean something like this: https://github.com/pytorch/executorch/blob/main/extension/llm/custom_ops/sdpa_with_kv_cache.py#L21-L34 (but with the except block replaced by install instructions)

We could do that, but in the current setup, won't the try block always fail when running this benchmarking script (unless we build/load the ops in setup.py)? I'm also not sure what the instructions would say to make the script runnable without telling the user to modify the script by adding a torch.load_library line. I guess we could ask them to define an environment variable with the library location?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but with the except block replaced by install instructions

I'm also not sure what the instructions would say to make the script runnable without telling the user to modify the script by adding a torch.load_library line.

@metascroy Yes, but I dont follow the second part. Why user needs to modify the script. Can we not just import torchao.experimental.lowbit_ops which internally does try/except? But I do kind of get what you are doing here because any build scripts will ahve to figure out where to put the build artifact (.so) and then we need to load from there.

Ideally it should be installed as part of the setup instructions or pip package. So we can also follow something like https://github.com/pytorch/ao/blob/main/setup.py#L53 and add extra option to build with experimental lowbit quant features. So if user invokes the benchmarking script without building experimental kernels you can suggest please do python setup.py --use_experimental_low_bit_kernels or pip install . --use_experimental_low_bit_kernels. This feels a bit cleaner to me but curious to know your thoughts and also from @msaroufim

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concern is whether we can reuse the cmake setup stuff we already have (e.g., this function that sets up the parallel compute: https://github.com/pytorch/ao/blob/main/torchao/experimental/Utils.cmake#L7)? If we bring in KleidiAI/CPUInfo via CMake, that will be more stuff to worry about.

I haven't used the torch CppExtension in setup.py, but it looks fairly simplistic compared to cmake. Perhaps we could do something like what this blog does, if it does not already exist in PyTorch: https://martinopilia.com/posts/2018/09/15/building-python-extension.html

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I am missing something but Executorch's pybinding extension with xnnpack builds xnnpack which build with cpuinfo and pthreadpool and everything. Granted that is a whole lot to build but it does work in ET. So likely there is a nuance that you are worried about that I am not understanding

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this goes back to one of my earlier comments to @jerryzh168: the current setup.py in torchao does not really support cmake, and it would require a good amount of refactoring to support it. Currently setup.py in torchao is built on utilities in torch.utils.cpp_extension, which look somewhat simplistic and as far as I can tell, do not support cmake.

ET's setup.py defines a custom extension to support cmake and doing something similar to torchao looks like a sizable refactor to their setup?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I will have to rely on your answer for this since I havent looked at all the details of cpp_extension. I do remember it being simple but I dont know if there are ways to add deps as part of cpp_extension. I doubt but see if possible.

If not, I think it is worth proposing this for the sake of making ARM kernels available on mac builds. @drisspg any thoughts? discussion here is largely around more complex setup.py in order to allow us to build cpp package extensions that package cpu kernels and make it available as part of python package

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I edited setup.py to build the experimental kernels here: D67777662

Need feedback from torchao folks on whether the changes are acceptable.

yanbing-j pushed a commit to yanbing-j/ao that referenced this pull request Dec 9, 2024
Co-authored-by: Jack-Khuu <jack.khuu.7@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported topic: improvement Use this tag if this PR is an improvement (doesn't fit into any of the other categories)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants