Benchamarking #1353

metascroy · 2024-11-26T20:29:56Z

Summary: Add benchmarks for experimental torchao kernels.

Differential Revision: D66512859

pytorch-bot · 2024-11-26T20:30:03Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1353

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 2a4a964 with merge base 5eb6339 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2024-11-26T20:30:12Z

This pull request was exported from Phabricator. Differential Revision: D66512859

Summary: Add benchmarks for experimental torchao kernels. Differential Revision: D66512859

facebook-github-bot · 2024-11-26T20:38:02Z

This pull request was exported from Phabricator. Differential Revision: D66512859

jerryzh168 · 2024-11-26T21:55:11Z

torchao/_models/llama/generate.py

+            import os
+            import subprocess
+            import tempfile
+            def cmake_build_torchao_ops(temp_build_dir):


is this required? why not add to torchao default build?

is this is required, I'd suggest to put these in a util and probably call it in int8_dynamic_activation_intx_weight

Happy to add it to a default build for ARM CPU if you have a code pointer?

that would be great, here is our build code for kernels I think:

ao/setup.py

Line 53 in 5eb6339

def get_extensions():

It doesn't look like cmake is being used there, and it would require more refactoring to use it. I factored out the install script into a utility, though.

Ok I'll land this for now. On creating build scripts in torchao/setup.py, I'll leave it up to you all. We already have build scripts for these kernels in ExecuTorch and torchchat, so they are not difficult to use on those surfaces.

cc @kimishpatel

but if you add build scripts to torchao, will you still need to maintain these in torchchat and executorch?

Yes, because both executorch/torchchat need to build executorch kernels, but if we added build scripts to torchao, they'd likely only build aten kernels (unless torchao takes a dependency on executorch).

OK, so looks like it's the same kernel implementation, but different build script for torchao/executorch/torchchat in this case

Example in et for cmake https://github.com/pytorch/executorch/blob/main/setup.py#L496

jerryzh168

LG thanks, had one question/comment

Summary: Add benchmarks for experimental torchao kernels. Differential Revision: D66512859

facebook-github-bot · 2024-11-26T23:02:32Z

This pull request was exported from Phabricator. Differential Revision: D66512859

kimishpatel · 2024-12-02T02:04:55Z

torchao/_models/llama/generate.py

+            # Build kernels in temp location, and load them in torch
+            # This requires an ARM CPU
+            from torchao.experimental.temp_build import temp_build_and_load_torchao_ops
+            temp_build_and_load_torchao_ops(cmake_lists_path=os.path.dirname(os.path.realpath(__file__)) + "/../../experimental")


I dont follow why we are doing this way? Why can we not just try to load the ops and if not found raise exception with build steps needed? Ideally we can detect the platform and build this deps as part of some pre-req for running benchmarks under _model directory. When we are able to ship these kernels as part of pip package, we may not need this

When we are able to ship these kernels as part of pip package, we may not need this

what is the blocker for this?

we need to move out of the experimental. I think we need to follow up on this. I believe we have sufficient evidence now? @supriyar ?

I dont follow why we are doing this way? Why can we not just try to load the ops and if not found raise exception with build steps needed? Ideally we can detect the platform and build this deps as part of some pre-req for running benchmarks under _model directory. When we are able to ship these kernels as part of pip package, we may not need this

I guess by "just try to load the ops" you mean something like this: https://github.com/pytorch/executorch/blob/main/extension/llm/custom_ops/sdpa_with_kv_cache.py#L21-L34 (but with the except block replaced by install instructions)

We could do that, but in the current setup, won't the try block always fail when running this benchmarking script (unless we build/load the ops in setup.py)? I'm also not sure what the instructions would say to make the script runnable without telling the user to modify the script by adding a torch.load_library line. I guess we could ask them to define an environment variable with the library location?

but with the except block replaced by install instructions

I'm also not sure what the instructions would say to make the script runnable without telling the user to modify the script by adding a torch.load_library line.

@metascroy Yes, but I dont follow the second part. Why user needs to modify the script. Can we not just import torchao.experimental.lowbit_ops which internally does try/except? But I do kind of get what you are doing here because any build scripts will ahve to figure out where to put the build artifact (.so) and then we need to load from there.

Ideally it should be installed as part of the setup instructions or pip package. So we can also follow something like https://github.com/pytorch/ao/blob/main/setup.py#L53 and add extra option to build with experimental lowbit quant features. So if user invokes the benchmarking script without building experimental kernels you can suggest please do python setup.py --use_experimental_low_bit_kernels or pip install . --use_experimental_low_bit_kernels. This feels a bit cleaner to me but curious to know your thoughts and also from @msaroufim

The concern is whether we can reuse the cmake setup stuff we already have (e.g., this function that sets up the parallel compute: https://github.com/pytorch/ao/blob/main/torchao/experimental/Utils.cmake#L7)? If we bring in KleidiAI/CPUInfo via CMake, that will be more stuff to worry about.

I haven't used the torch CppExtension in setup.py, but it looks fairly simplistic compared to cmake. Perhaps we could do something like what this blog does, if it does not already exist in PyTorch: https://martinopilia.com/posts/2018/09/15/building-python-extension.html

Maybe I am missing something but Executorch's pybinding extension with xnnpack builds xnnpack which build with cpuinfo and pthreadpool and everything. Granted that is a whole lot to build but it does work in ET. So likely there is a nuance that you are worried about that I am not understanding

I guess this goes back to one of my earlier comments to @jerryzh168: the current setup.py in torchao does not really support cmake, and it would require a good amount of refactoring to support it. Currently setup.py in torchao is built on utilities in torch.utils.cpp_extension, which look somewhat simplistic and as far as I can tell, do not support cmake.

ET's setup.py defines a custom extension to support cmake and doing something similar to torchao looks like a sizable refactor to their setup?

Ok I will have to rely on your answer for this since I havent looked at all the details of cpp_extension. I do remember it being simple but I dont know if there are ways to add deps as part of cpp_extension. I doubt but see if possible.

If not, I think it is worth proposing this for the sake of making ARM kernels available on mac builds. @drisspg any thoughts? discussion here is largely around more complex setup.py in order to allow us to build cpp package extensions that package cpu kernels and make it available as part of python package

I edited setup.py to build the experimental kernels here: D67777662

Need feedback from torchao folks on whether the changes are acceptable.

Co-authored-by: Jack-Khuu <jack.khuu.7@gmail.com>

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 26, 2024

facebook-github-bot added the fb-exported label Nov 26, 2024

metascroy added a commit to metascroy/ao that referenced this pull request Nov 26, 2024

Benchamarking (pytorch#1353)

75a0c6d

Summary: Add benchmarks for experimental torchao kernels. Differential Revision: D66512859

metascroy force-pushed the export-D66512859 branch from de00766 to 75a0c6d Compare November 26, 2024 20:37

metascroy requested a review from jerryzh168 November 26, 2024 21:40

jerryzh168 reviewed Nov 26, 2024

View reviewed changes

jerryzh168 approved these changes Nov 26, 2024

View reviewed changes

Benchamarking (pytorch#1353)

2a4a964

Summary: Add benchmarks for experimental torchao kernels. Differential Revision: D66512859

metascroy force-pushed the export-D66512859 branch from 75a0c6d to 2a4a964 Compare November 26, 2024 23:02

jerryzh168 added the topic: improvement Use this tag if this PR is an improvement (doesn't fit into any of the other categories) label Nov 26, 2024

metascroy merged commit 04a25e7 into pytorch:main Nov 27, 2024
18 of 21 checks passed

kimishpatel reviewed Dec 2, 2024

View reviewed changes

yanbing-j pushed a commit to yanbing-j/ao that referenced this pull request Dec 9, 2024

Remove last references to use_distributed argument (pytorch#1353)

0f58543

Co-authored-by: Jack-Khuu <jack.khuu.7@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchamarking #1353

Benchamarking #1353

metascroy commented Nov 26, 2024

pytorch-bot bot commented Nov 26, 2024 •

edited

Loading

facebook-github-bot commented Nov 26, 2024

facebook-github-bot commented Nov 26, 2024

jerryzh168 Nov 26, 2024

jerryzh168 Nov 26, 2024 •

edited

Loading

metascroy Nov 26, 2024

jerryzh168 Nov 26, 2024

metascroy Nov 26, 2024

metascroy Nov 26, 2024 •

edited

Loading

jerryzh168 Nov 27, 2024

metascroy Nov 27, 2024

jerryzh168 Nov 27, 2024

kimishpatel Dec 2, 2024

jerryzh168 left a comment

facebook-github-bot commented Nov 26, 2024

kimishpatel Dec 2, 2024

jerryzh168 Dec 2, 2024

kimishpatel Dec 3, 2024

metascroy Dec 3, 2024

kimishpatel Dec 4, 2024

metascroy Dec 6, 2024

kimishpatel Dec 6, 2024

metascroy Dec 9, 2024

kimishpatel Dec 10, 2024

metascroy Jan 3, 2025

Benchamarking #1353

Benchamarking #1353

Conversation

metascroy commented Nov 26, 2024

pytorch-bot bot commented Nov 26, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1353

✅ No Failures

facebook-github-bot commented Nov 26, 2024

facebook-github-bot commented Nov 26, 2024

Choose a reason for hiding this comment

jerryzh168 Nov 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

metascroy Nov 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jerryzh168 left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Nov 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pytorch-bot bot commented Nov 26, 2024 •

edited

Loading

jerryzh168 Nov 26, 2024 •

edited

Loading

metascroy Nov 26, 2024 •

edited

Loading