will this project support on device npu like qualcomm hexagon？ #2687

AndreaChiChengdu · 2023-08-21T06:29:48Z

I am very interested in mobile side deployment and would like to see if there is an opportunity to use mobile NPU/GPU in android devices for acceleration？
thanks~

monatis · 2023-08-21T10:03:03Z

Running computation with Android NN API requires a compute backend for Android NN API. It's possible to devote some efforts to develop such a backend if enough interest in adoption can be found in the community for different usage scenarios. Please note that, however, this can take some time.

Dampfinchen · 2023-08-21T14:36:13Z

Running computation with Android NN API requires a compute backend for Android NN API. It's possible to devote some efforts to develop such a backend if enough interest in adoption can be found in the community for different usage scenarios. Please note that, however, this can take some time.

Qualcomm announced LLama 2 is coming to Snapdragon in 2024 and I highly suspect LLMs will become an integral part of the smartphone experience soon. Personally, I would be super excited to run such models mobile whereever I go and without the need for cellular connection.

So yes I think the interest is going to grow bigger and bigger in the upcoming months. Right now, its not really feasible especially as prompt processing is not yet accelerated by the GPU or NPU properly. So I would like to see that change. There's a lot of potential to tap in with the Hexagon and other ML accelerrators in these modern phones.

monatis · 2023-08-21T15:02:25Z

@ggerganov Would it be of interest to introduce Android NN API as a new backend?

To be on the same page:

Android NN API is a C library that provides ML op primitives that can be delegated to NPU, TPU, GPU or CPU on Android.
It provides scalar and tensor data types for integers and floats in 32, 16 and 8 bits.
You define a DAG with references to the buffers of tensors and then schedule the inference with one of the free power-speed tradeoff levels.

If we decide to do so, a possible plan of attack might be:

Implement inference a GGUF model with pure Android NN API.
If it's promising compared to running directly on CPU, strart implementing it as a compute backend in GGML.

ggerganov · 2023-08-21T18:49:11Z

It's definitely of interest. It has to be implemented as a new backend in llama.cpp, similar to CUDA, Metal, OpenCL, etc.
The ggml library has to remain backend agnostic.

Best option would be if the Android API allows implementation of custom kernels, so that we can leverage the quantization formats that we currently have. Otherwise, there might not be a good enough argument for integrating the backend with ggml - one could just straight up implement the neural network with the Android building blocks

monatis · 2023-08-21T20:45:42Z

It's definitely of interest.

Great. I'll give it a test drive this week.

Best option would be if the Android API allows implementation of custom kernels,

Its support for custom kernels may be limited and we may need to dequantize to 8-bit beforehand, but I'll dig more into this. If not we may still make use of it for 8/16/32-bit tensors but go to lower libraries for k-quants. Let me give it a try and see what's possible.

BarfingLemurs · 2023-09-06T10:55:08Z

@monatis did anything interesting come up?

monatis · 2023-09-06T11:55:15Z

@BarfingLemurs I digged into NPU specs, but it turned out to be that NPU's support for custom ops is limited. It supports a set of pre-defined quantization / dequantization ops and 8-bit / 16-bit tensors. Of course there might be workarounds such as dequantizing tensors before offloading, but I doubt that it'll give a performance boost due to becoming I/O-bound this way. We can run 8-bit GGML models there, but even 8B models are too big in 8-bit precision for most smartphones, I think. GPNPU seems to be a better option to support custom kernels required for GGML's real benefits, but I'm not aware of devices that come with a GPNPU yet. Until then, mobile GPUs seem to be our best bet on Android. With that said, I still want to play with it after some higher-priority work to see what's possible.

dfiru · 2023-09-29T17:01:37Z

Hey @monatis, thanks for the shout-out.

The good news is that our programming model is C++. so looking through how other GPU backends are supported here in ggml, it seems straightforward to enable support for GPNPU in ggml. We support 8W8A quantization in our currently release version of architecture -- looking at 4W8A and others in the next version.

We're open to getting GGML contributors access to our C++ SDK and figuring out ways to get support into GGML.

monatis · 2023-09-29T17:56:32Z

Hey @dfiru thanks for reaching out --great to have you here from Quadric!

I believe that GPNPU's approach is the right one, and I'm volunteer and definitely willing to explore possbilities and contribute to the implementation.

Should I contact you in PM or something to move this further?

BarfingLemurs · 2023-09-29T18:05:34Z

Would like to mention I'm still very much interested in an Android NN API, for generic android gpus, as clblast on android doesnt see any performance benefit for single batch or parallel decoding task. (Which would be useful for increasing t/s on potential medusa model implementation)

dfiru · 2023-09-30T15:01:46Z

@monatis
no dms on gh :/

dm me on twitter (attached to my gh profile) and we can figure something out

ggerganov · 2023-10-02T10:39:09Z

@monatis You mentioned Android GPU - what are the options to program for mobile GPUs? Vulkan?

monatis · 2023-10-02T11:09:42Z

Yes, Vulkan is the recommended approach https://developer.android.com/ndk/guides/graphics/getting-started

Apparently, the Nomic team for gpt4all implemented a Vulkan backend, but I'm not sure about the compatibility of their custom license.

nivibilla · 2023-10-05T15:50:04Z

Would anything change in the implementation to make use of the TPU in pixel devices?

xgdgsc · 2023-11-12T01:22:41Z

Would using https://developer.qualcomm.com/software/qualcomm-neural-processing-sdk/getting-started suffice for devices with 8cx gen3 and x elite next year? I' m more interested in Windows on arm support than android.

ghost · 2023-11-23T19:47:35Z

Coming in here late, and not a lot of experience in either, yet, but a lot ( most? ) of ARM CPUs now have NPUs, as do RISC-V. its not just phone devices anymore but embedded devices and mid-range desktops. Their power is increasing with each release. Also there are the google Coral style TPUs, which are accessible to us mere mortals.

dimopep · 2023-11-30T18:16:56Z

For SD8 gen 3 the claim is to inference a 7B Llama 2 "based" model at 20 tokens/sec. As there will be naive support for INT4 I would assume 4b quantisation. I assume this is the NPU
https://docs.qualcomm.com/bundle/publicresource/87-71408-1_REV_B_Snapdragon_8_gen_3_Mobile_Platform_Product_Brief.pdf

agonzalezm · 2023-12-11T14:16:12Z

also support for Intel NPU and AMD XNDA2 that are coming in new processors, from 2024 all consumer pcs will have a powefull NPU capable of 50TOPS as dictated per Windows 12 and windows will offload many AI task to this NPU.

dimopep · 2023-12-11T19:49:52Z

The SD8 performance metrics demistified

https://www.qualcomm.com/news/onq/2023/11/accelerating-generative-ai-at-the-edge

"We reduced the memory bandwidth through knowledge distillation, quantization-aware training, and speculative decoding...We use quantization-aware training with knowledge distillation to address these challenges and achieve an accurate and smaller INT4 model"

shifeiwen · 2024-03-05T03:29:33Z

Are there any new updates to this discussion currently?

EwoutH · 2024-04-19T11:43:53Z

Google recently published a guide and a blog about the new experimental MediaPipe LLM Inference API:

They also have a code example: mediapipe/examples/llm_inference

scarlettekk · 2024-04-21T18:02:11Z

Google recently published a guide and a blog about the new experimental MediaPipe LLM Inference API:

Guide: LLM Inference guide for Android

Blog: Large Language Models On-Device with MediaPipe and TensorFlow Lite

They also have a code example: mediapipe/examples/llm_inference

It seems like this is restricted to some handpicked models for some reason. I wonder if it is possible to expand this selection without Google's help.

github-actions · 2024-06-06T01:07:05Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

Yemaoxin · 2024-07-18T12:29:31Z

So, It's hard to use Qualcomm hexagon? Am I right?

EwoutH · 2024-07-20T08:55:36Z

Can we reopen this issue? With Hexagon NPUs finding their way to laptops and desktops, it’s only going to be more relevant.

Yemaoxin · 2024-07-26T03:04:52Z

Yes, I think this is a vital feature.

scarlettekk · 2024-08-09T22:16:14Z

Of note: NNAPI will be deprecated in Android 15 https://developer.android.com/ndk/guides/neuralnetworks/migration-guide

sparkleholic · 2024-09-04T09:01:59Z

In order to accelerate llama.cpp on Qualcomm, do we need to implement ggml parts to use 'Qualcomm neural processing SDK API' or 'Haxagon SDK API'?

AndreasKunar · 2024-09-21T13:38:50Z

In order to accelerate llama.cpp on Qualcomm, do we need to implement ggml parts to use 'Qualcomm neural processing SDK API' or 'Haxagon SDK API'?

To my knowledge, the furthest along Qualcomm/QNN work for llama.cpp is this fork https://github.com/chraac/llama.cpp

sasskialudin · 2024-11-23T05:25:49Z

It's about 5 months that the Qualcomm Snapdragon X Plus X1Pprocessor is available and | have now the limited opportunity to get one ASUS Vivobook S 15 OLED with 32Gb of RAM around it for 850 USD (Black Friday deal).

So, do we have at last NPU support for the Qualcomm Snapdragon X Plus X1P processor?

Alternatively, is there any NPU support for the [AMD Ryzen AI 9 HX 370?
The later one is supposed to offer 50 TOPS vs 45 TOPS for the Snapdragon chip.

And what is the situation for the latest LMStudio or Ollama builds?

AndreasKunar · 2024-11-23T16:17:15Z

So, do we have at last NPU support for the Qualcomm Snapdragon X Plus X1P processor?

TLDR: I think it makes little sense to do the additional work needed to support the Snapdragon X's GPU/NPU, and it is not supported yet. The Snapdragon X's CPU is (or soon will be) much faster with running llama.cpp and Q4..Q1 models than its GPU/NPU. Qualcomm's and Microsoft's NPU-running models (using QNN,...) are few and small.

More details see issue #8273.

sasskialudin · 2024-11-25T07:00:12Z

Thank you for the feedback.

Meanwhile, due to indecision I just missed the Black Friday offer that motivated my question but given your answers it is rather a chance. Besides I just found the same 50% Black Friday deal type from another ASUS reseller, but that time with the Qualcomm Snapdragon X Elite - X1E-78-100 (5% speedier :-).

But wait, if the NPU is not such a big deal, the AMD Ryzen AI 9 HX 370 looks even more appealing. It is 50% speedier than the Snapdragon X Elite - X1E-78-100 for the CPU side and is offering 50 TOPS vs 45 TOPS for its embedded NPU.

And, yes, there are also Black Friday 50% deals for notebooks featuring that AMD part :-)

So, same question but now addressing the AMD Ryzen AI 9 HX 370 (pending the release of the AMD Ryzen AI 9 HX 395, featuring up to 96Gb VRAM usage, hopefully).

Namely, are there specific llama.cpp builds for the AMD Ryzen AI 9 HX 370 or progress towards it?

AndreasKunar · 2024-11-25T07:43:15Z

So, same question but now addressing the AMD Ryzen AI 9 HX 370 (pending the release of the AMD Ryzen AI 9 HX 395, featuring up to 96Gb VRAM usage, hopefully).

Namely, are there specific llama.cpp builds for the AMD Ryzen AI 9 HX 370 or progress towards it?

This is the wrong conversation-thread to ask this, I don't have one of the new Intel/AMD computers. Its NPU is probably not supported (like all NPUs). Its GPU might be, but I don't know which llama.cpp backend is best for it. Its CPU might NOT be better than an arm-based chip for llama.cpp - the arm-team provided some optimized code for Q4...

I would NOT base a purchase-decision on Snapdragon X vs. AMD/Intel on llama.cpp. There is some other software which cannot run on the Snapdragon X at all (e.g. Adobe prohibits Lightroom classic installs on Snapdragon X, there is no support for Nikon Z8/9 compressed raw files in Photoshop, the Lightrooom-App version, and Nikon's own software refuses install at all ...). This is not because the Software cannot run emulated, it's because of bad software-vendor decisions. I would look more, if your needed software runs on Windows for arm.

sasskialudin · 2024-11-25T07:56:05Z

Given the CPU benchmark for the Snapdragon, vs AMD i.e. around. 23000 vs 35000 for multithread rating, the AMD should be the sensible choice IMHO.

Now, as I said, sooner or later, we'll have the Ryzen AI 9 HX 395 with much more leeway to execute locally large models, and sure, Qualcomm will not stay idle too. but I'm leaning toward AMD as for now.

etlweather · 2024-11-25T08:03:52Z

Random jump in, but just comparing "speed" of CPU A against CPU B isn't always the right metric. For some, it's speed/power. I don't know real numbers, but let's say CPU A does 5 gazillion cycles per second for 1 watt, versus 2.5 gazillion cycles for 1 watt, in some case it's more compelling to buy a CPU that performs more per watt than the other. Because at the end of the day, electricity does cost something.

It really depends on people's use case.

AndreasKunar · 2024-11-25T08:16:06Z

If you are looking for improved llama.cpp/ollama/LM-Studio/... performance I strongly recommend to look for real llama.cpp performance numbers like these in the discussion #4167. The geekbench,... numbers of the new Apple M4 Pro vs. the M1 Ultra paint a totally different picture than the real llama-perf measurements. I have not seen any Ryzen AI 9 llama-perf numbers yet. The Snapdragon X's CPU is now supported quite nicely with very optimized code, and will get even better with the new mixed-precision lookup-algorithm in the works.

But also with the Snapdragon X I learned the hard way, that some software-vendors artificially block their software from installing (even if it would run emulated) - e.g. Adobe Lightroom classic cannot be installed. This was an unexpected downside of my Surface and should factor into any purchase-decision!

sasskialudin · 2024-11-25T14:23:09Z

So a 50% higher score on CPU Benchmark for the AMD Ryzen AI 9 370 HX relatively to the Qualcomm Snapdragon X Elite - X1E-78-100 is meningless, really? At least regarding token/s output rate with llama.cpp?
Then the Qualcomm processor would be the better choice, due to better code optimization from the current llama.cpp builds. Is it really so?

Meanwhile, both configurations with the Black Friday promotion are selling like pop corns, soon enough I will miss those deals too, LOL

AndreasKunar · 2024-11-25T16:20:45Z

Please buy whatever best fits you, I can't advise you

The Snapdragon X still has some software-compatibility issues. e.g. Lightroom classic is not installable,... - it's not all just about llama.cpp performance.

Geekbench single/multi-core performance and other benchmarks commonly have no direct correlation with llama.cpp performance. It's dependent on RAM bandwidth (for tg), computing horsepower (for pp, depending on its hardware-support), and algorithm-optimizations for computation (GPU-support/optimizations, special hardware instructions...)

Here a comparison for my machines (based on the standard llama-bench llama2 Q4 results):

model	backend	test	token/s	vs M2	geekbench single-core	geekbench multi-core
Snapdragon X Elite (with newer llama.cpp builds!):
llama 7B Q4 (+re-order)	CPU	pp512	~170	~0,9x	~ 2.400	~14.000
llama 7B Q4 (+re-order)	CPU	tg128	~ 24	1	~0.9x	~1.4x
M2 10-GPU:
llama 7B Q4_0	Metal	pp512	~180		~2.600	~9.800
llama 7B Q4_0	Metal	tg128	~ 24
M2 Max 38-GPU:
llama 7B Q4_0	Metal	pp512	~670	~3.7x	~2.800	~15.000
llama 7B Q4_0	Metal	tg128	~ 66	~2.8x	~1.1x	~1.5x
M4 Pro 20-GPU:
llama 7B Q4_0	Metal	pp512	~440	~2.4x	~3.700	~13.000
llama 7B Q4_0	Metal	tg128	~ 51	~2.1x	~1.4x	~1.3x

The Snapdragon with no supported GPU should be slower than the M2, but due to the new algorithms it isn't (and it will get faster with the pending lookup-PR). Also the "normalized" Apple silicon llama-perf results don't reflect algorithmic improvements yet, they also got faster with better algorithms.

The M4 Pro should be similar or faster than the M2 Max according to geekbench,... - but the M2 Max is much faster (more GPUs, 400 vs. 273 GB/s memory-bandwidth).

I could not find any comparable llama.cpp results for the AMD yet.

sasskialudin · 2024-11-25T18:16:05Z

Those are amazing figures, thank you very much!
I was not aware there were so much progress on the Snapdragon.
Hopefully the AMD AI 300 series will follow, thanks again.

AndreasKunar · 2024-11-25T18:40:41Z

I was not aware there were so much progress on the Snapdragon.

FYI, it seems most of the innovation was/is not really built because of the Snapdragon X. It works to some extent on most modern arm CPUs. From Apple silicon, to amazon's cloud and partially the Raspberry PI and similar. The main innovation was using arm enhancements for matrix-multiplication (GEMM, GEMV). And I think there is on-going work to also do this for AMD (e.g. PR #9532). Besides accelerating GEMM/GEMV, memory-bandwidth always will be the limiting element for response-token generation speed - more CPU/GPU/NPU horsepower is mostly useless there (its mostly useful for accelerating prompt-processing and training speed).

Also FYI - the most commonly used Q4,... quantization is not just all 4-bit, a few of the parameters need to be larger. And therefore some matrix-multiplication,... needs to be mixed-precision. For strong quantiziations (4 bit and less) there seems to be a very fast lookup-table based acceleration coming with PR #10181 (see the PR details - they promise a signifficant acceleration and reduced power-consumption). I'm always fascinated, how new software/algorithms can out-smart things cast purely into hardware.

github-actions bot added the stale label Apr 6, 2024

github-actions bot removed the stale label Apr 20, 2024

github-actions bot added the stale label May 22, 2024

github-actions bot closed this as completed Jun 6, 2024

hpvd mentioned this issue Aug 30, 2024

Feature Request: NPU Support #9181

Closed

4 tasks

will this project support on device npu like qualcomm hexagon？ #2687

will this project support on device npu like qualcomm hexagon？ #2687

Comments

AndreaChiChengdu commented Aug 21, 2023

monatis commented Aug 21, 2023

Dampfinchen commented Aug 21, 2023 • edited Loading

monatis commented Aug 21, 2023

ggerganov commented Aug 21, 2023

monatis commented Aug 21, 2023 • edited Loading

BarfingLemurs commented Sep 6, 2023

monatis commented Sep 6, 2023

dfiru commented Sep 29, 2023

monatis commented Sep 29, 2023

BarfingLemurs commented Sep 29, 2023 • edited Loading

dfiru commented Sep 30, 2023

ggerganov commented Oct 2, 2023

monatis commented Oct 2, 2023

nivibilla commented Oct 5, 2023

xgdgsc commented Nov 12, 2023

ghost commented Nov 23, 2023

dimopep commented Nov 30, 2023 • edited Loading

agonzalezm commented Dec 11, 2023

dimopep commented Dec 11, 2023

shifeiwen commented Mar 5, 2024

EwoutH commented Apr 19, 2024

scarlettekk commented Apr 21, 2024

github-actions bot commented Jun 6, 2024

Yemaoxin commented Jul 18, 2024

EwoutH commented Jul 20, 2024

Yemaoxin commented Jul 26, 2024

scarlettekk commented Aug 9, 2024

sparkleholic commented Sep 4, 2024

AndreasKunar commented Sep 21, 2024

sasskialudin commented Nov 23, 2024 • edited Loading

AndreasKunar commented Nov 23, 2024

sasskialudin commented Nov 25, 2024

AndreasKunar commented Nov 25, 2024

sasskialudin commented Nov 25, 2024

etlweather commented Nov 25, 2024

AndreasKunar commented Nov 25, 2024

sasskialudin commented Nov 25, 2024

AndreasKunar commented Nov 25, 2024

sasskialudin commented Nov 25, 2024

AndreasKunar commented Nov 25, 2024 • edited Loading

Dampfinchen commented Aug 21, 2023 •

edited

Loading

monatis commented Aug 21, 2023 •

edited

Loading

BarfingLemurs commented Sep 29, 2023 •

edited

Loading

dimopep commented Nov 30, 2023 •

edited

Loading

sasskialudin commented Nov 23, 2024 •

edited

Loading

AndreasKunar commented Nov 25, 2024 •

edited

Loading