-
Notifications
You must be signed in to change notification settings - Fork 10.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for accelerating with QNN on Windows on ARM #7541
Comments
#5079 is related and I'd agree it would be great to have NPU support for the systems that have them. Microsoft is pushing DirectML. |
@ggerganov Sorry for the trivial question, but the QNN backend doesn't support tensors with different dimensions (this is properly said in their docs). Is this a mandatory requirement of llama.cpp, or is it something that varies by model? |
How come? Pretty much all tensors have different dimensions |
I'm wondering if I interpreted this wrong @ggerganov . Look for the |
I see, I am not sure about the full implications, but I know that certain hardware have no or poor support for computations in which the shapes change after each call - this is the case for Transformer-based LLMs because some of the tensor shapes grow with the number of tokens in the context. In contrast for example, CNNs used in computer-vision usually have static shapes for any kind of input. There are tricks we can do to overcome this limitation, but it would make general support for this hardware more difficult, customized and in the realm of "proof-of-concept". Again, I'm not really familiar with the details - it's best if people working on this can analyze the limitations and propose what to do |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Feature Description
Please add support for accelerating with Qualcomm QNN on Windows.
Motivation
Every ARM64 laptop since the Surface Pro X has a NPU. It's not the shiny new 40+ TOPS that the Copilot+PCs have, but it's fast enough for llama in certain models. For instance, the Snapdragon 8cx Gen3 has a 15 TOPS NPU and it has support for the operators to accelerate local inference that the CPU lacks (like MATMUL). QNN will be blazing fast on Copilot+PCs too.
Possible Implementation
The QNN SDK is freely available at the Qualcomm Developer website (https://www.qualcomm.com/developer/software/qualcomm-ai-engine-direct-sdk) as Qualcomm AI Engine Direct SDK.
The text was updated successfully, but these errors were encountered: