Performance of llama.cpp on Snapdragon X Elite/Plus #8273

AndreasKunar · 2024-07-03T09:53:02Z

AndreasKunar
Jul 3, 2024

I want to start a discussion on the performance of the new Qualcomm Snapdragon X similar to Apple M Silicon in #4167

This post got completely updated, because power-setting to "best performance" IS needed. Default it only uses 4 of the 10 cores fully, which prevents thermal throttling but gives much less performance.

I am agnostic to Apple/intel/AMD/... or any discussion on Windows/MacOS/Linux merits - please spare us any "religiosity" here on Operating-systems, etc. For me it's important to have good tools, and I think running LLMs/SLMs locally via llama.cpp is important. We need good llama.cpp benchmarking, to be able to decide. I am currently primarily a Mac user (MacBook Air M2, Mac Studio M2 Max), running MacOS, Windows and Linux.

I just got a Surface 11 Pro with the X Plus and these are my 1st benchmarks. The Surface Pros always had thermal constraints, so I got a Plus and not an Elite - even with the Plus it throttles quickly when its 10 CPUs are used fully. Also since there was an optimization of llama.cpp for Snapdragon-builds, I am NOT testing with build 8e672ef, but with the current build. But I'm trying to produce comparable results to Apple Silicon #4167.

Here are my results for my Surface 11 Pro Snapdragon(R) X 10-core X1P64100 @ 3.40 GHz, 16GB, running Windows 11 Enterprise 22H2 26100.1000 - with the 16GB, I could not test fp16 since it swaps.

llama-bench with -t 10 for Q8_0 and later after a bit of cool-down for Q4_0 (the throttled numbers were 40% !!! lower). F16 does swap with the 16GB RAM, so its not included.

model	size	params	backend	threads	test	t/s
llama 7B Q8_0	6.67 GiB	6.74 B	CPU	10	pp512	58.72 ± 2.50
llama 7B Q8_0	6.67 GiB	6.74 B	CPU	10	tg128	13.54 ± 1.12
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	10	pp512	58.59 ± 3.12
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	10	tg128	18.23 ± 6.23

build: a27152b (3285)

Update: Results for a Snapdragon X Elite (Surface Laptop 7 15"):

model	size	params	backend	threads	test	t/s
llama 7B Q8_0	6.67 GiB	6.74 B	CPU	12	pp512	63.51 ± 4.94
llama 7B Q8_0	6.67 GiB	6.74 B	CPU	12	tg128	12.65 ± 0.41
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	12	pp512	66.63 ± 3.90
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	12	tg128	20.72 ± 0.54

build: cddae48 (3646)

Update with new Q4_0_4_4 algorithm (lately automatically done on load of a Q4_0 model, no special model-file needed anymore):

model	size	params	backend	threads	test	t/s
llama 7B Q4_0_4_8	3.56 GiB	6.74 B	CPU	10	pp512	166.05 ± 1.83
llama 7B Q4_0_4_8	3.56 GiB	6.74 B	CPU	10	tg128	20.09 ± 3.86

build: 69b9945 (3425)

I think the new Qualcomm chips are interesting, the numbers are a bit faster than my M2 MacBook Air in CPU-only mode - feedback welcome!

It's early in the life of this SoC as well as with Windows for arm64, and a lot of optimizations are still needed. There is no GPU/NPU support (yet) and Windows/gcc arm64 is still work-in-progress. DirectML, QNN and ONNX seems to be the main optimization focus for Microsoft/Qualcomm, I will look into this later (maybe the llama.cpp QNN backend of #7541 would also help/be a starting-point). So this is work-in-progress.

I tested 2 llama.cpp build methods for Windows with MSVC, and the method in https://www.qualcomm.com/developer/blog/2024/04/big-performance-boost-for-llama-cpp-and-chatglm-cpp-with-windows got me a little better results, than the build-method in #7191. I still need to test building with clang, but I expect not much difference, since clang uses the MSVC backend on Windows.

Another update/extension - with WSL2/gcc using 10 CPUs / 8 GB RAM and Ubuntu 24.04, the numbers are very similar (all dependent on cooldowns/throttling):

model	size	params	backend	threads	test	t/s
llama 7B Q8_0	6.67 GiB	6.74 B	CPU	10	pp512	62.46 ± 2.69
llama 7B Q8_0	6.67 GiB	6.74 B	CPU	10	tg128	9.58 ± 3.04
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	10	pp512	61.93 ± 2.76
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	10	tg128	13.74 ± 10.70

build: a27152b (3285)

joShu001 · 2024-07-23T19:04:48Z

joShu001
Jul 23, 2024

On my 32gb Surface Pro 11 using LM Studio with 4 threads on Llama 3 Instruct 8B q4_k_m gguf I am seeing 12 - 20+ tok/s pretty consistently. Doh, will try bumping LM Studio to 10 threads. The rSnapdragon arm64 release version of LM Studio is here : https://lmstudio.ai/snapdragon

I don't understand how llama.cpp projects are prioritized and queued but LM Studio 0.3.0 (beta) supposedly has some snapdragon/npu something already somehow? (on waiting list to get the beta bits)

Excitedly anticipating future NPU support!

7 replies

AndreasKunar Jul 24, 2024
Author

My llama-bench command-line is derived from the same one which got used by ggerganov for the initial Apple M-Series benchmarking ./llama-bench -m <model-name> -p 512 -n 128 -t 10 (10 is for the Plus' 10 cores, for the Elite use -t 12, if llama.cpp can use a GPU, you add -ngl 99 or -ngl 0, if you don't want it to use the GPU).

As far as I know, there is no download of Q4_0_4_8 models. You just download the model you want, ideally as Q4_0 variant, and convert it via: ./llama-quantize --allow-requantize <name of the downloaded model> <name of the new Q4_0_4_8 model> Q4_0_4_8.

On AI performance, it's complicated. Any AI response has 2 parts. 1) the AI analyzes the prompt (which can be quite long), to see what the situation is, and what is required as answer. This "prompt processing" (PP) is done for the complete prompt-string at once. And for this a lot of compute-horsepower is needed. There GPU-acceleration or the new ARM CPU-optimizations with this Q4_0_4_8 gives a 2-3x acceleration. 2) once the prompt is processed completely, the LLM generates the response token-per-token. For this "token-generation" (TG), the LLM needs to calculate the next token from ALL the many billion parameters as well as the context (all the token of the prompt and the previous response). So with TG the LLM shuffles GB of data - FOR EACH AND EVERY TOKEN to be generated - from its RAM into the CPU/GPU's chip-internal ultra-fast cache-memory (which is much to small to hold everything, and needs to be re-used/re-loaded all the time). So for TG the RAM-bandwidth (and less the compute-horsepower) becomes the limiting factor - how fast it can pump all the billions of parameters,... into the calculation. Therefore we see little improvement for TG with more compute-horsepower. Apple's Max and Ultra chips have 4x to 8x the memory-bandwidth of the base M-chip, or even the Snapdragon X (which has a 33% faster RAM than the base M2/M3), and this influences the TG numbers. This is why llama-bench gives "pp" and "tg" numbers separately in its tables. pp512 means, it got a 512 token long prompt to analyze - a very long one, these long prompts are e.g. very important for retrieval-augmented-generation (RAG), where the LLM gets a lot of context-information in the prompt for a question.

SK is Microsoft's open-source framework for building their Copilots (similar to langchain,... but simpler/easier). Currently more of a programmer thing. It's weird how programming changes with AI. E.g. programming a Spanish translator functionality for your application becomes just a few SK calls and telling an AI, to translate to Spanish. Totally unlike the geeky/complex recipes of traditional programming. And with AI Agents, the AI gets an input, and then decides, which tools it should use, and how, in order to accomplish the task - e.g. web-search,... - its still very early. The new Llama-3-Groq-8B-Tool-Use is the first LOCAL LLM, which is very capable of generating a good plan for this tool-use of AI-agents, until then it was only possible with cloud-AI.

Very long-winded answer, but I hope it helps.

joShu001 Jul 24, 2024

You are a great explainer, thanks for your long-windedness!

To test your claims of 2 to 3x acceleration, I did the following:

Obtained Llama-3-Groq-8B-Tool-Use-Q4_K_M.gguf and quantized it using Q4_0_4_8.
Benched the original (llama 8B Q4_K - Medium) vs (llama 8B Q4_0_4_8) the quantized version using 10 and 12 threads, and the meager GPU on the SP11.

llama-bench -m "Llama-3-Groq-8B-Tool-Use-Q4_K_M.gguf" -p 512 -n 128 -t 12

model	size	params	backend	threads	test	t/s
llama 8B Q4_K - Medium	4.58 GiB	8.03 B	CPU	12	pp512	32.31 ± 0.40
llama 8B Q4_K - Medium	4.58 GiB	8.03 B	CPU	12	tg128	11.15 ± 1.26

llama-bench -m Groq_8B_Q4_0_4_8.gguf -p 512 -n 128 -t 10

model	size	params	backend	threads	test	t/s
llama 8B Q4_0_4_8	4.33 GiB	8.03 B	CPU	10	pp512	161.48 ± 11.56
llama 8B Q4_0_4_8	4.33 GiB	8.03 B	CPU	10	tg128	16.17 ± 2.24

llama-bench -m Groq_8B_Q4_0_4_8.gguf -p 512 -n 128 -t 12

model	size	params	backend	threads	test	t/s
llama 8B Q4_0_4_8	4.33 GiB	8.03 B	CPU	12	pp512	173.87 ± 34.26
llama 8B Q4_0_4_8	4.33 GiB	8.03 B	CPU	12	tg128	15.90 ± 3.48

llama-bench -m Groq_8B_Q4_0_4_8.gguf -p 512 -n 128 -t 12 -ngl 99

model	size	params	backend	threads	test	t/s
llama 8B Q4_0_4_8	4.33 GiB	8.03 B	CPU	12	pp512	174.54 ± 21.36
llama 8B Q4_0_4_8	4.33 GiB	8.03 B	CPU	12	tg128	16.37 ± 3.25

The benchmarks easily confirm the acceleration. In terms of pure t/s it is more like 5x, actually.

But how does the Q4_0_4_8 actually perform in a chat?

My interface is LM Studio. I copied the Q4_0_4_8 gguf to the models directory and, after restarting LM Studio, LM Studio did indeed show it as available but it wouldn't load it. I tried several different "preset files" (all the different settings, mlock etc etc) but nothing worked. The non-quantized version of the groq gguf worked fine, however. I will get on Discord and see if I can learn what is wrong.

On a side note, I spent some time at the Qualcomm AI hub where they say to "Bring your own model". I think the idea is use their hub to transform a model into something that uses the NPU? https://aihub.qualcomm.com/compute/models

Thanks again, for your good info.

AndreasKunar Jul 24, 2024
Author

OK, sorry, I never used LM Studio, so cannot help you there. I have used Ollama, but not yet with the Q4_0_4_8 models.

On chat performance with Q4_0_4_8, it's probably not a lot of improvement, since prompt-processing (pp) normally only plays a minor role in chats, tg is the major factor there, and for tg its not this much of an improvement. If you are mainly into chats, probably the new Lamma-3.1-8B-Instruct models of this week would be best. Llama-3-Groq-8B-Tool-Use is best for tool-use and not for chats. To my knowledge, the llama.cpp team is hard at work to try and support Llama-3.1.

To my knowledge, the Qualcomm AI hub (with its QNN technology) is all about small local models and power.savings, much smaller and less capable models than Llama 3.1 8B. There is an effort underway, to get llama.cpp support QNN, but I think it's still a long way off.

kormakurd Sep 1, 2024

To my knowledge, the Qualcomm AI hub (with its QNN technology) is all about small local models and power.savings, much smaller and less capable models than Llama 3.1 8B. There is an effort underway, to get llama.cpp support QNN, but I think it's still a long way off.

From the list of models they host, I believe that's mostly true, but they also have deployable versions of Llama 2 7B and Llama 3 8B with support for Snapdragon 8 Gen 3 Mobile and Snapdragon X Elite.

I haven't done anything with qai_hub yet because it looks (comparatively) convoluted to get deployed for a pleb like myself.
8B Model (github)

They write that their llama 3 8B model is quantized to w4a16(4-bit weights and 16-bit activations)

AndreasKunar Sep 2, 2024
Author

FYI, I just tried WebGL, which supports the Ardeno GPU and it is surprisingly fast for e.g. Phi-3.5-mini-instruct - 1/2 the performance of my MacBook Air M2 10-GPU llama.cpp native Q4_0 model, but when running only in Chrome!! I will try and dig deeper into this.

Thanks! Let's see what developments will happen around the NPU with QNN.

AndreasKunar · 2024-08-30T18:28:42Z

AndreasKunar
Aug 30, 2024
Author

Update for Surface Laptop 7 / Snapdragon X Elite - it might seem, that the Elite utilizes the memory-bandwidth better than the asymmetrical Plus (for token-generation):

model	size	params	backend	threads	test	t/s
llama 7B Q4_0_4_8	3.56 GiB	6.74 B	CPU	12	pp512	169.12 ± 8.85
llama 7B Q4_0_4_8	3.56 GiB	6.74 B	CPU	12	tg128	23.41 ± 1.35

build: cddae48 (3646)

0 replies

neozhang307 · 2024-10-08T07:50:11Z

neozhang307
Oct 8, 2024

What about the GPU and NPU backend?

3 replies

AndreasKunar Oct 8, 2024
Author

What about the GPU and NPU backend?

@neozhang307 - llama.cpp on the Snapdragon X's GPU should in theory work via Vulkan. But llama.cpp with enabled Vulkan currently hangs on load (both for the native Qualcomm driver and for Microsoft's DX12 driver via SET GGML_VK_VISIBLE_DEVICES=1 ). As for the NPU, there is currently some work being done to support it via QNN, also there is some initial discussion about supporting DirectML - but both not running.

Use the Q4_0_4_8 (or _4) quantization for your models on Snapdragon X CPU's with llama.cpp. It runs quite fast because it uses the CPU's matrix instructions. The Snapdragon X Elite's CPUs with Q4_0_4_8 are similar in performance to an Apple M3 running Q4_0 on its GPUs.

llama-bench -m <model>-p 512 -n 128 -t 12 with a Snapdragon X Elite Surface Laptop 7 on build fa42aa6 (3897) yields:

model	size	params	backend	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	12	pp512	63.05 ± 7.40
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	12	tg128	19.83 ± 1.33
llama 7B Q4_0_4_8	3.56 GiB	6.74 B	CPU	12	pp512	178.88 ± 11.31
llama 7B Q4_0_4_8	3.56 GiB	6.74 B	CPU	12	tg128	23.24 ± 0.84
llama 7B Q4_0_4_4	3.56 GiB	6.74 B	CPU	12	pp512	144.52 ± 14.16
llama 7B Q4_0_4_4	3.56 GiB	6.74 B	CPU	12	tg128	22.81 ± 1.28

Not sure how the Surface Laptop handles thermals with this stress-workload and how much it throttles (the CPUs are not always maxed out after the initial 100%).

There is a working Snapdragon X GPU Support via WebML in Chrome (e.g. via chat.webllm.ai). But llama.cpp Q4_0_4_8 on the CPU seems faster and much more versatile.

Also there is support for the Snapdragon X's NPU via ONNX's QNN drivers. I did not test the performance of this via the CPU (speed and power-consumption).

manuelpaulo Oct 9, 2024

Thanks a lot Andreas. Keep us updated, seems no one else is doing much on this. Can you please be so kind and show us a complete example of a /llama-quantize you made?

AndreasKunar Oct 10, 2024
Author

@manuelpaulo - I just download a llama-2 7B Q4_0 gguf model from huggingface and did ./llama-quantize --allow-requantize <name of the downloaded Q4_0 model>.gguf <name of the new Q4_0_4_8 model>.gguf Q4_0_4_8. I used llama-2 7B because then you can compare the results to the Apple Silicon (with GPU/Metal) llama.cpp performance numbers in discussion #4167

With newer models like e.g. llama 3.2, there already are ready-made Q4_0_4_8 quantized gguf-file versions available for direct download from huggingface.

sasskialudin · 2024-11-23T05:17:59Z

sasskialudin
Nov 23, 2024

It's about 5 months that this processor is available and | have the opportunity to get one ASUS Vivobook S 15 OLED with 32Gb of RAM around it for 850 USD (Black Friday deal).

So, do we have at last NPU support for the Qualcomm Snapdragon X Plus X1P processor?

Alternatively, is there any NPU support for the [AMD Ryzen AI 9 HX 370?
The later one is supposed to offer 50 TOPS vs 45 TOPS for the Snapdragon chip.

And what is the situation for the latest LMStudio or Ollama builds?

4 replies

manuelpaulo Nov 23, 2024

Welcome to the ghost land.

AndreasKunar Nov 23, 2024
Author

Below the status as far as I know it. FYI, I had a Surface Pro 11 (Snapdragon X Plus) at launch and later switched to a 15" Surface Laptop 7 (Surface X Elite). I am mainly a Mac user, even though I worked for Microsoft 1994-2009. And I am "multilingual" re OS - MacOS, Windows (native ARM and VM on Mac) , Linux (containers, VMs).

It's about 5 months that this processor is available and | have the opportunity to get one ASUS Vivobook S 15 OLED with 32Gb of RAM around it for 850 USD (Black Friday deal).

So, do we have at last NPU support for the Qualcomm Snapdragon X Plus X1P processor?

Qualcomm's NPU is a bit of an issue - a) its programming is strange and requires a dedicated, special Software (QNN). b) this software is mostly as-is and not very extensible (problematic for llama.cpp's always evolving quantization of parameters, the K/V-cache,...). Microsoft and some marketing-oriented people (yes, I'm a markter too) with strong opinions but a bit lacking in knowledge are pushing NPUs, where CPU-programming is already faster and rapidly evolving in speed - with dedicated SIMD instructions, new algorithms like mixed-precision math via lookup,...

Current status re Snapdragon X GPU/NPU as of late Nov 2024: In my opinion NPU support for the Snapdragon X makes little sense, it's slower, buggy (via Vulkan, Qualcomm's driver seems to have some issues) and consumes more power than the CPU. The NPU would also be slower than the CPU, and only available to run very limited models (very few quantization choices, and some size limits) even though it might save power-consumptions. Qualcomm/Microsoft are not putting any visible efforts into this. So there only is some llama.cpp with QNN work going on for mobile Snapdragon CPUs (see above).

Speed and recent llama.cpp innovations: with the Q4_0_4_4 CPU-optimizations, the Snapdragon X's CPU got 3x faster. Recent llama.cpp changes re-pack Q4_0 models automatically to accelerated Q4_0_4_4 when loading them on supporting arm CPUs (PR #9921). So now running llama.cpp on the Snapdragon X CPU is faster than on the GPU or NPU. Also the Microsoft open-source team around bitnet/T-MAC (they also seem to do the 1.58-Bit quantization efforts) is working on a very fast mixed-precision quantization math (PR #10181) which would accelerate this even faster. My Snapdragon X Elite with llama.cpp Q4_0 models currently runs approx. as fast on the CPU as my M2 10-GPU Mac.

TLDR: I don't think there will be, or needs to be a NPU/GPU support for Snapdragon X in llama.cpp. Because it makes little sense when it runs faster on the CPU. Currently this is true at least for Q4_0 models, more likely to come.

And what is the situation for the latest LMStudio or Ollama builds?

Snapdragon X on Windows on arm is now supported natively by ollama. LM-Studio to my knowledge also works. As for their support for the other acceleration efforts, I need to check them when I get back from vacation.

zez3 Nov 23, 2024

So now running llama.cpp on the Snapdragon X CPU is faster than on the GPU or NPU.

So one would question what is the purpose of a dedicated built-in NPU if the CPU is faster?

AndreasKunar Nov 23, 2024
Author

So now running llama.cpp on the Snapdragon X CPU is faster than on the GPU or NPU.

So one would question what is the purpose of a dedicated built-in NPU if the CPU is faster?

I try to answer this to the best of my knowledge:

The NPU is not supposed to be faster, its just fast enough and uses less power, can run in parallel to the CPU, but sadly it is also much less flexible and much more complicated to program. It's great for some specific small models (Microsoft calls them SLMs instead of LLMs). And it's also the same with Apple's ANE and their AI-software - even Apple's own excellent open-source MLX AI-framework does not run on their ANE, only on their GPUs. NPU/ANE are used by Microsoft/Qualcomm and Apple for comparatively small local AI things in their operating systems (some image-processing, live-transcription & -translation...). But I am disappointed how much marketing-hype and little available function there currently is.

Commonly it's better to use a larger parameter but heavily quantized AI model (e.g. Q4 or similar) over a smaller but full-precision model with similar memory-size/speed. Llama.cpp innovates rapidly with improvements in quantization and surrounding optimizations, but the inflexibel NPU/ANE don't seem to support the necessary custom code (GPUs with their custom shaders do). Also Vulkan has been establishing itself as a vendor-neutral GPU-programming (including their custom shaders), enabling llama.cpp to support multiple GPUs besides mainstream NVIDIA (CUDA) / Apple (Metal) with one back-end - nothing similar emerged yet for NPUs.

So if you want to currently use the Snapdragon X NPU, you have to use Qualcomm's QNN code and not llama.cpp. Qualcomm improved their QNN a bit since summer, but in my earlier tests, I could not get it to run even a good/modern 7B+ model (OK, I might be just plain stupid).

sasskialudin · 2024-11-25T06:46:36Z

sasskialudin
Nov 25, 2024

Thank you for the feedback.

Meanwhile, due to indecision I just missed the Black Friday offer that motivated my question but given your answers it is rather a chance. Besides I just found the same 50% Black Friday deal type from another ASUS reseller, but that time with the Qualcomm Snapdragon X Elite - X1E-78-100 (5% speedier :-).

But wait, if the NPU is not such a big deal, the AMD Ryzen AI 9 HX 370 looks even more appealing. It is 50% speedier than the Snapdragon X Elite - X1E-78-100 for the CPU side and is offering 50 TOPS vs 45 TOPS for its embedded NPU.

And, yes, there are also Black Friday 50% deals for notebooks featuring that AMD part :-)

So, same question but now addressing the AMD Ryzen AI 9 HX 370 (pending the release of the AMD Ryzen AI 9 HX 395, featuring up to 96Gb VRAM usage, hopefully).

Namely, are there specific llama.cpp builds for the AMD Ryzen AI 9 HX 370 or progress towards it?

1 reply

AndreasKunar Nov 25, 2024
Author

See my answer to your cross-post in the other thread. This thread is only about the Snapdragon X.

Just notice, that the synthetic benchmarks vendors/reviewers promote (e.g. NPU TOPS, geekbench) are completely useless in regard to llama.cpp/ollama/LM-Studio performance. E.g. for the new M4 base Macs, geekbench,... show it to be faster than the Ultra variant of the M1, but if you look at the measurements in discussion #4167, you see a completely different picture. So look in the github llama.cpp discussions for real performance number comparisons (best compared using llama-bench with the old llama2 model, Q4_0 and its derivatives are the most relevant numbers).

bartowski1182 · 2024-11-28T16:16:33Z

bartowski1182
Nov 28, 2024

@AndreasKunar

Would you be interested in participating in a roundtable discussion with some qualcomm engineers? They want to discuss (with no commitments or promises ofc) what they can do to better support open source

Pinging @slaren and @JohannesGaessler too since you may be interested as well, not sure how much you deal with the low level or if you're at all interested in better support for qualcomm/snapdragon

7 replies

slaren Nov 28, 2024
Collaborator

I don't have the time personally to work on a qualcomm NPU backend, so I will leave that to the people interested in working on that. That said, I think what we would need from qualcomm is quite simple, make their libraries easily accessible with open licenses, and flexible enough so that we can implement our own kernels.

JohannesGaessler Nov 28, 2024
Collaborator

To add to what Diego said, good documentation and developer tools would also be very much appreciated.

bartowski1182 Nov 28, 2024

Okay great, I'll make sure these concerns are noted, thank you! If you have any other thoughts or suggestions feel free to either add them here or message me directly somewhere (twitter I guess?), will definitely be helpful during conversations

AndreasKunar Nov 28, 2024
Author

I would be very interested - beginning next week, currently on vacation, I'm in MET but participating mornings/evenings is not an issue. You can reach me via e-mail (should be public in my profile).

BUT, to clarify, I'm not a developer anymore. I was good with C, but am very rusty, never worked productively in C++, which the new llama.cpp/GGML backends are in. I'm more of a marketer/communicator with a strong technical background. My interests are in distributing AI to edge-devices, but mainly to modern laptops and modern desktops (small, low-power-consumption), not to mobiles/tablets. I currently have a Surface Laptop 7 Elite/16GB, an M2 10-GPU/24GB MacBook Air (getting replaced next week by an M4 Pro MacBook Pro 48GB) and a M2 Max 96GB Mac Studio. I am also very interested in running llama.cpp,... in containers/VMs, mainly for security reasons - and this currently means CPU-only, or with podman/krunkit/Vulkan-remoting.

About me: I am retired, used to be a marketer/business-leader at Microsoft in Europe (developer-tools, internet-technologies, servers, and technical-audience marketing, 1994-2009). Before this I was for a few years an "Advanced Technical Specialist" on behalf Intel in Austria+Easter-Europe, doing pre-sales for developer-tools and processor-architectures (386 to P5). And before this I was a software-developer for 10+ yrs - mainly system-level C and highly-optimized assembler programming. My formal meducation is computer-science (MSc) and electronics (BSc).

ggerganov Nov 28, 2024
Maintainer

@bartowski1182 In addition to these comments, the qualcomm engineers can consider to open a discussion in the llama.cpp repository where we can discuss any topic that they are interested in. Similar discussions have already been done with teams from Intel (#3965), Nvidia (#6763), Arm (#5780) and others and these have resulted in better support for the hardware. The llama.cpp project is generally open to add support for all kinds of hardware, as long as there are developers that can help with the implementation and the maintenance.

max-krasnyansky · 2024-11-29T18:31:32Z

max-krasnyansky
Nov 29, 2024
Collaborator

Apologies for the delayed response to this thread.
Qualcomm engineers have been participating in llama.cpp development for some time now ;-)
Keen observers would have noticed that our ( @max-krasnyansky and @fmz ) PRs use QuIC (Qualcomm Innovation Center) ids.

We've been focusing primarily on the CPU so far, things like Windows on ARM64 build, Threading Performance and advanced features to take advantage of Snapdragon X-Elite CPU clusters, etc. We had our own version of Q4_0_X_X which we were going to contribute but Arm folks beat us to it with Q4_0_4_X :) so we just switched to that.
Q4_0_4_8 is the best performing layout on the current generation Snapdragon CPUs (8 Gen3, X Elite, 8 Elite).
You can find detailed perf reports for Snapdragon Gen 3 and X-Elite in the threadpool related PRs #8672.
Just search for Snapdragon in the PRs.

Some of our GPU folks ( @lhez ) are about to join the party. We're getting ready to submit OpenCL-based Backend with Adreno support for the current gen Snapdragons. I finished rebasing it on top of dynamic backend load updates yesterday and we should be able to start an official PR after some more testing.
Here is the fork/branch we're using for staging: CodeLinaro/Adreno

The NPU support will take more effort, sorry no ETA at this point. I'm fully aware of what is needed for this.
@bartowski1182 Feel free to connect those Qualcomm engineers you were referring to with me.

0 replies

joShu001 · 2024-11-30T08:39:52Z

joShu001
Nov 30, 2024

user-report and endorsement , fwiw :

I've been traveling for the last 3 months in remote parts of Italy, often offline, and am not permitted to use online AI tools . I use LM Studio on a Surface Pro 11 , 32gb model with the 12-core Snapdragon X Elite . Performance and battery life using LM Studio have been excellent .

My layman's understanding is that the NPU is not just about efficiency / less battery but , because it is optimized for matrix operations and activation functions , increases performance for some AI/ML tasks .

Snapdragon CPU work so far : Great ! NPU work so far : still only promising , becoming irritating . (same for Microsoft although 2024 Ignite indicates they are making progress)

I applaud @bartowski1182 's roundtable recommendation , including @AndreasKunar .

3 replies

manuelpaulo Dec 1, 2024

user-report and endorsement , fwiw :

I've been traveling for the last 3 months in remote parts of Italy, often offline, and am not permitted to use online AI tools . I use LM Studio on a Surface Pro 11 , 32gb model with the 12-core Snapdragon X Elite . Performance and battery life using LM Studio have been excellent .

My layman's understanding is that the NPU is not just about efficiency / less battery but , because it is optimized for matrix operations and activation functions , increases performance for some AI/ML tasks .

Snapdragon CPU work so far : Great ! NPU work so far : still only promising , becoming irritating . (same for Microsoft although 2024 Ignite indicates they are making progress)

I applaud @bartowski1182 's roundtable recommendation , including @AndreasKunar .

You should try https://chat.webllm.ai It's much faster, and it will work offline once the model is loaded to the cache.

bartowski1182 Dec 1, 2024

Does it somehow run on the NPU? No indication of such on their GitHub..

Yes the NPU should be quite a bit more efficient, GPU as well I would think. Though the improvements made through optimizing the memory loading is quite a great step in the right direction!

sebastienbo Feb 2, 2025

I've just tried it, but it only runs on the CPU, not the NPU.
I do noticed that this week Microsoft published their AI studio tool, which allows to run models on the ARM NPU (like snapdragon)
Maybe we can learn something from there?

purudpd · 2024-12-02T16:22:32Z

purudpd
Dec 2, 2024

It seems like NPU support for LM Studio is coming soon, as seen here. - It's possible they have built a new backend, similar to MLX-engine for apple.

7 replies

vikrantSinghOnGithub Jan 18, 2025

Yes, that's correct currently using CPU is better choice. Don't expect mlx like performance on npu.

mvx-team Jan 19, 2025

I think the PowerServe project uses both Qualcomm HTP (Hexagon NPU) and CPU. Prompt processing/eval or pre-fill is much faster on NPU, something like 500 t/s for Llama 8B Q4 on Snapdragon 8 Gen 3. That's faster than the <200 t/s on Snapdragon X Elite CPU using Q4 matmul.

I don't know how much work would be needed in llama.cpp to allow offloading layers to the NPU and CPU. Implementing this on X Elite could be huge because it might allow for MLX-equaling prompt processing speeds.

Edit: too bad PowerServe is Android only and it requires building SOC-specific NPU model weights.

AndreasKunar Jan 19, 2025
Author

chraac/llama.cpp seem to be still be improving their QNN llama.cpp backend, but they seem to focus on Android (did not have the time to dig deeper). AnythingLLM (Tim Carambat) have a working QNN backend, but it's not open-source.

I'm shifting my attention more to Nvidia with their project DIGITS, and got me a Jetson Orin Nano Super dev kit to learn CUDA, GPU-containers,... until DIGITS becomes available. I will still watch the Snapdragon developments (keep my Surface Laptop) and always work on Macs. But the ideas behind DIGITS and their software-stack seem extremely interesting.

mvx-team Jan 19, 2025

I'm shifting my attention more to Nvidia with their project DIGITS, and got me a Jetson Orin Nano Super dev kit to learn CUDA, GPU-containers,... until DIGITS becomes available. I will still watch the Snapdragon developments (keep my Surface Laptop) and always work on Macs. But the ideas behind DIGITS and their software-stack seem extremely interesting.

I agree with you. Snapdragon ironically has great CPU performance for LLMs but QNN/HTP/NPU usage is a nightmare for coders who don't work with embedded mobile platforms daily.

A working QNN llama.cpp backend needs to deal with unified memory access from both CPU and NPU while offloading prompt processing only to the NPU, leaving token generation to the CPU's matmul pipeline. I don't know if RAM usage would also increase because of duplicated data: the NPU could require a specific model weight format compared to Q4_0_4_8 on the CPU (assuming a Snapdragon X Elite is used).

sebastienbo Feb 2, 2025

Microssoft just released their AI studio, which allows to run models on copilot snapdragon arm cpu's.
But from llama.cpp I didn't see anything moving, in LMstudio it is still CPU only

Performance of llama.cpp on Snapdragon X Elite/Plus #8273

Replies: 9 comments · 32 replies

AndreasKunar Jul 24, 2024 Author

AndreasKunar Jul 24, 2024 Author

AndreasKunar Sep 2, 2024 Author

AndreasKunar Aug 30, 2024 Author

AndreasKunar Oct 8, 2024 Author

AndreasKunar Oct 10, 2024 Author

AndreasKunar Nov 23, 2024 Author

AndreasKunar Nov 23, 2024 Author

AndreasKunar Nov 25, 2024 Author

slaren Nov 28, 2024 Collaborator

JohannesGaessler Nov 28, 2024 Collaborator

AndreasKunar Nov 28, 2024 Author

ggerganov Nov 28, 2024 Maintainer

max-krasnyansky Nov 29, 2024 Collaborator

AndreasKunar Jan 19, 2025 Author

Replies: 9 comments 32 replies

AndreasKunar Jul 24, 2024
Author

AndreasKunar Jul 24, 2024
Author

AndreasKunar Sep 2, 2024
Author

AndreasKunar
Aug 30, 2024
Author

AndreasKunar Oct 8, 2024
Author

AndreasKunar Oct 10, 2024
Author

AndreasKunar Nov 23, 2024
Author

AndreasKunar Nov 23, 2024
Author

AndreasKunar Nov 25, 2024
Author

slaren Nov 28, 2024
Collaborator

JohannesGaessler Nov 28, 2024
Collaborator

AndreasKunar Nov 28, 2024
Author

ggerganov Nov 28, 2024
Maintainer

max-krasnyansky
Nov 29, 2024
Collaborator

AndreasKunar Jan 19, 2025
Author