Add support for accelerating with QNN on Windows on ARM #7541

hmartinez82 · 2024-05-26T09:21:00Z

Prerequisites

Please answer the following questions for yourself before submitting an issue.

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Feature Description

Please add support for accelerating with Qualcomm QNN on Windows.

Motivation

Every ARM64 laptop since the Surface Pro X has a NPU. It's not the shiny new 40+ TOPS that the Copilot+PCs have, but it's fast enough for llama in certain models. For instance, the Snapdragon 8cx Gen3 has a 15 TOPS NPU and it has support for the operators to accelerate local inference that the CPU lacks (like MATMUL). QNN will be blazing fast on Copilot+PCs too.

Possible Implementation

The QNN SDK is freely available at the Qualcomm Developer website (https://www.qualcomm.com/developer/software/qualcomm-ai-engine-direct-sdk) as Qualcomm AI Engine Direct SDK.

wcwong · 2024-05-28T14:21:41Z

#5079 is related and I'd agree it would be great to have NPU support for the systems that have them.

Microsoft is pushing DirectML.

hmartinez82 · 2024-05-28T23:46:28Z

@ggerganov Sorry for the trivial question, but the QNN backend doesn't support tensors with different dimensions (this is properly said in their docs). Is this a mandatory requirement of llama.cpp, or is it something that varies by model?

ggerganov · 2024-05-29T07:59:16Z

QNN backend doesn't support tensors with different dimensions

How come? Pretty much all tensors have different dimensions

hmartinez82 · 2024-05-29T08:48:36Z

I'm wondering if I interpreted this wrong @ggerganov . Look for the QNN_PROPERTY_TENSOR_SUPPORT_DYNAMIC_DIMENSIONS capability at https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-50/supported_capabilities.html

ggerganov · 2024-05-29T09:47:07Z

I see, I am not sure about the full implications, but I know that certain hardware have no or poor support for computations in which the shapes change after each call - this is the case for Transformer-based LLMs because some of the tensor shapes grow with the number of tokens in the context. In contrast for example, CNNs used in computer-vision usually have static shapes for any kind of input.

There are tricks we can do to overcome this limitation, but it would make general support for this hardware more difficult, customized and in the realm of "proof-of-concept". Again, I'm not really familiar with the details - it's best if people working on this can analyze the limitations and propose what to do

github-actions · 2024-07-13T01:06:52Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

hmartinez82 added the enhancement New feature or request label May 26, 2024

github-actions bot added the stale label Jun 29, 2024

github-actions bot closed this as completed Jul 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for accelerating with QNN on Windows on ARM #7541

Add support for accelerating with QNN on Windows on ARM #7541

hmartinez82 commented May 26, 2024 •

edited

Loading

wcwong commented May 28, 2024

hmartinez82 commented May 28, 2024 •

edited

Loading

ggerganov commented May 29, 2024

hmartinez82 commented May 29, 2024 •

edited

Loading

ggerganov commented May 29, 2024

github-actions bot commented Jul 13, 2024

Add support for accelerating with QNN on Windows on ARM #7541

Add support for accelerating with QNN on Windows on ARM #7541

Comments

hmartinez82 commented May 26, 2024 • edited Loading

Prerequisites

Feature Description

Motivation

Possible Implementation

wcwong commented May 28, 2024

hmartinez82 commented May 28, 2024 • edited Loading

ggerganov commented May 29, 2024

hmartinez82 commented May 29, 2024 • edited Loading

ggerganov commented May 29, 2024

github-actions bot commented Jul 13, 2024

hmartinez82 commented May 26, 2024 •

edited

Loading

hmartinez82 commented May 28, 2024 •

edited

Loading

hmartinez82 commented May 29, 2024 •

edited

Loading