-
Notifications
You must be signed in to change notification settings - Fork 207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchamarking #1353
Benchamarking #1353
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1353
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 2a4a964 with merge base 5eb6339 (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This pull request was exported from Phabricator. Differential Revision: D66512859 |
Summary: Add benchmarks for experimental torchao kernels. Differential Revision: D66512859
de00766
to
75a0c6d
Compare
This pull request was exported from Phabricator. Differential Revision: D66512859 |
torchao/_models/llama/generate.py
Outdated
import os | ||
import subprocess | ||
import tempfile | ||
def cmake_build_torchao_ops(temp_build_dir): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this required? why not add to torchao default build?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this is required, I'd suggest to put these in a util and probably call it in int8_dynamic_activation_intx_weight
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to add it to a default build for ARM CPU if you have a code pointer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that would be great, here is our build code for kernels I think:
Line 53 in 5eb6339
def get_extensions(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't look like cmake is being used there, and it would require more refactoring to use it. I factored out the install script into a utility, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I'll land this for now. On creating build scripts in torchao/setup.py, I'll leave it up to you all. We already have build scripts for these kernels in ExecuTorch and torchchat, so they are not difficult to use on those surfaces.
cc @kimishpatel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but if you add build scripts to torchao, will you still need to maintain these in torchchat and executorch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, because both executorch/torchchat need to build executorch kernels, but if we added build scripts to torchao, they'd likely only build aten kernels (unless torchao takes a dependency on executorch).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so looks like it's the same kernel implementation, but different build script for torchao/executorch/torchchat in this case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Example in et for cmake https://github.com/pytorch/executorch/blob/main/setup.py#L496
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG thanks, had one question/comment
Summary: Add benchmarks for experimental torchao kernels. Differential Revision: D66512859
75a0c6d
to
2a4a964
Compare
This pull request was exported from Phabricator. Differential Revision: D66512859 |
# Build kernels in temp location, and load them in torch | ||
# This requires an ARM CPU | ||
from torchao.experimental.temp_build import temp_build_and_load_torchao_ops | ||
temp_build_and_load_torchao_ops(cmake_lists_path=os.path.dirname(os.path.realpath(__file__)) + "/../../experimental") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont follow why we are doing this way? Why can we not just try to load the ops and if not found raise exception with build steps needed? Ideally we can detect the platform and build this deps as part of some pre-req for running benchmarks under _model directory. When we are able to ship these kernels as part of pip package, we may not need this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we are able to ship these kernels as part of pip package, we may not need this
what is the blocker for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to move out of the experimental. I think we need to follow up on this. I believe we have sufficient evidence now? @supriyar ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont follow why we are doing this way? Why can we not just try to load the ops and if not found raise exception with build steps needed? Ideally we can detect the platform and build this deps as part of some pre-req for running benchmarks under _model directory. When we are able to ship these kernels as part of pip package, we may not need this
I guess by "just try to load the ops" you mean something like this: https://github.com/pytorch/executorch/blob/main/extension/llm/custom_ops/sdpa_with_kv_cache.py#L21-L34 (but with the except block replaced by install instructions)
We could do that, but in the current setup, won't the try block always fail when running this benchmarking script (unless we build/load the ops in setup.py)? I'm also not sure what the instructions would say to make the script runnable without telling the user to modify the script by adding a torch.load_library line. I guess we could ask them to define an environment variable with the library location?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but with the except block replaced by install instructions
I'm also not sure what the instructions would say to make the script runnable without telling the user to modify the script by adding a torch.load_library line.
@metascroy Yes, but I dont follow the second part. Why user needs to modify the script. Can we not just import torchao.experimental.lowbit_ops
which internally does try/except? But I do kind of get what you are doing here because any build scripts will ahve to figure out where to put the build artifact (.so) and then we need to load from there.
Ideally it should be installed as part of the setup instructions or pip package. So we can also follow something like https://github.com/pytorch/ao/blob/main/setup.py#L53 and add extra option to build with experimental lowbit quant features. So if user invokes the benchmarking script without building experimental kernels you can suggest please do python setup.py --use_experimental_low_bit_kernels
or pip install . --use_experimental_low_bit_kernels
. This feels a bit cleaner to me but curious to know your thoughts and also from @msaroufim
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The concern is whether we can reuse the cmake setup stuff we already have (e.g., this function that sets up the parallel compute: https://github.com/pytorch/ao/blob/main/torchao/experimental/Utils.cmake#L7)? If we bring in KleidiAI/CPUInfo via CMake, that will be more stuff to worry about.
I haven't used the torch CppExtension in setup.py, but it looks fairly simplistic compared to cmake. Perhaps we could do something like what this blog does, if it does not already exist in PyTorch: https://martinopilia.com/posts/2018/09/15/building-python-extension.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I am missing something but Executorch's pybinding extension with xnnpack builds xnnpack which build with cpuinfo and pthreadpool and everything. Granted that is a whole lot to build but it does work in ET. So likely there is a nuance that you are worried about that I am not understanding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this goes back to one of my earlier comments to @jerryzh168: the current setup.py in torchao does not really support cmake, and it would require a good amount of refactoring to support it. Currently setup.py in torchao is built on utilities in torch.utils.cpp_extension, which look somewhat simplistic and as far as I can tell, do not support cmake.
ET's setup.py defines a custom extension to support cmake and doing something similar to torchao looks like a sizable refactor to their setup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I will have to rely on your answer for this since I havent looked at all the details of cpp_extension. I do remember it being simple but I dont know if there are ways to add deps as part of cpp_extension. I doubt but see if possible.
If not, I think it is worth proposing this for the sake of making ARM kernels available on mac builds. @drisspg any thoughts? discussion here is largely around more complex setup.py in order to allow us to build cpp package extensions that package cpu kernels and make it available as part of python package
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I edited setup.py to build the experimental kernels here: D67777662
Need feedback from torchao folks on whether the changes are acceptable.
Co-authored-by: Jack-Khuu <jack.khuu.7@gmail.com>
Summary: Add benchmarks for experimental torchao kernels.
Differential Revision: D66512859