Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for accelerating with QNN on Windows on ARM #7541

Closed
4 tasks done
hmartinez82 opened this issue May 26, 2024 · 6 comments
Closed
4 tasks done

Add support for accelerating with QNN on Windows on ARM #7541

hmartinez82 opened this issue May 26, 2024 · 6 comments
Labels
enhancement New feature or request stale

Comments

@hmartinez82
Copy link

hmartinez82 commented May 26, 2024

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Feature Description

Please add support for accelerating with Qualcomm QNN on Windows.

Motivation

Every ARM64 laptop since the Surface Pro X has a NPU. It's not the shiny new 40+ TOPS that the Copilot+PCs have, but it's fast enough for llama in certain models. For instance, the Snapdragon 8cx Gen3 has a 15 TOPS NPU and it has support for the operators to accelerate local inference that the CPU lacks (like MATMUL). QNN will be blazing fast on Copilot+PCs too.

Possible Implementation

The QNN SDK is freely available at the Qualcomm Developer website (https://www.qualcomm.com/developer/software/qualcomm-ai-engine-direct-sdk) as Qualcomm AI Engine Direct SDK.

@hmartinez82 hmartinez82 added the enhancement New feature or request label May 26, 2024
@wcwong
Copy link

wcwong commented May 28, 2024

#5079 is related and I'd agree it would be great to have NPU support for the systems that have them.

Microsoft is pushing DirectML.

@hmartinez82
Copy link
Author

hmartinez82 commented May 28, 2024

@ggerganov Sorry for the trivial question, but the QNN backend doesn't support tensors with different dimensions (this is properly said in their docs). Is this a mandatory requirement of llama.cpp, or is it something that varies by model?

@ggerganov
Copy link
Owner

QNN backend doesn't support tensors with different dimensions

How come? Pretty much all tensors have different dimensions

@hmartinez82
Copy link
Author

hmartinez82 commented May 29, 2024

I'm wondering if I interpreted this wrong @ggerganov . Look for the QNN_PROPERTY_TENSOR_SUPPORT_DYNAMIC_DIMENSIONS capability at https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-50/supported_capabilities.html

@ggerganov
Copy link
Owner

I see, I am not sure about the full implications, but I know that certain hardware have no or poor support for computations in which the shapes change after each call - this is the case for Transformer-based LLMs because some of the tensor shapes grow with the number of tokens in the context. In contrast for example, CNNs used in computer-vision usually have static shapes for any kind of input.

There are tricks we can do to overcome this limitation, but it would make general support for this hardware more difficult, customized and in the realm of "proof-of-concept". Again, I'm not really familiar with the details - it's best if people working on this can analyze the limitations and propose what to do

@github-actions github-actions bot added the stale label Jun 29, 2024
Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request stale
Projects
None yet
Development

No branches or pull requests

3 participants