Performance of llama.cpp on Snapdragon X Elite/Plus #8273
Replies: 9 comments 32 replies
-
On my 32gb Surface Pro 11 using LM Studio with 4 threads on Llama 3 Instruct 8B q4_k_m gguf I am seeing 12 - 20+ tok/s pretty consistently. Doh, will try bumping LM Studio to 10 threads. The rSnapdragon arm64 release version of LM Studio is here : https://lmstudio.ai/snapdragon I don't understand how llama.cpp projects are prioritized and queued but LM Studio 0.3.0 (beta) supposedly has some snapdragon/npu something already somehow? (on waiting list to get the beta bits) Excitedly anticipating future NPU support! |
Beta Was this translation helpful? Give feedback.
-
Update for Surface Laptop 7 / Snapdragon X Elite - it might seem, that the Elite utilizes the memory-bandwidth better than the asymmetrical Plus (for token-generation):
build: cddae48 (3646) |
Beta Was this translation helpful? Give feedback.
-
What about the GPU and NPU backend? |
Beta Was this translation helpful? Give feedback.
-
It's about 5 months that this processor is available and | have the opportunity to get one ASUS Vivobook S 15 OLED with 32Gb of RAM around it for 850 USD (Black Friday deal). So, do we have at last NPU support for the Qualcomm Snapdragon X Plus X1P processor? Alternatively, is there any NPU support for the [AMD Ryzen AI 9 HX 370? And what is the situation for the latest LMStudio or Ollama builds? |
Beta Was this translation helpful? Give feedback.
-
Thank you for the feedback. Meanwhile, due to indecision I just missed the Black Friday offer that motivated my question but given your answers it is rather a chance. Besides I just found the same 50% Black Friday deal type from another ASUS reseller, but that time with the Qualcomm Snapdragon X Elite - X1E-78-100 (5% speedier :-). But wait, if the NPU is not such a big deal, the AMD Ryzen AI 9 HX 370 looks even more appealing. It is 50% speedier than the Snapdragon X Elite - X1E-78-100 for the CPU side and is offering 50 TOPS vs 45 TOPS for its embedded NPU. And, yes, there are also Black Friday 50% deals for notebooks featuring that AMD part :-) So, same question but now addressing the AMD Ryzen AI 9 HX 370 (pending the release of the AMD Ryzen AI 9 HX 395, featuring up to 96Gb VRAM usage, hopefully). Namely, are there specific llama.cpp builds for the AMD Ryzen AI 9 HX 370 or progress towards it? |
Beta Was this translation helpful? Give feedback.
-
Would you be interested in participating in a roundtable discussion with some qualcomm engineers? They want to discuss (with no commitments or promises ofc) what they can do to better support open source Pinging @slaren and @JohannesGaessler too since you may be interested as well, not sure how much you deal with the low level or if you're at all interested in better support for qualcomm/snapdragon |
Beta Was this translation helpful? Give feedback.
-
Apologies for the delayed response to this thread. We've been focusing primarily on the CPU so far, things like Windows on ARM64 build, Threading Performance and advanced features to take advantage of Snapdragon X-Elite CPU clusters, etc. We had our own version of Some of our GPU folks ( @lhez ) are about to join the party. We're getting ready to submit OpenCL-based Backend with Adreno support for the current gen Snapdragons. I finished rebasing it on top of dynamic backend load updates yesterday and we should be able to start an official PR after some more testing. The NPU support will take more effort, sorry no ETA at this point. I'm fully aware of what is needed for this. |
Beta Was this translation helpful? Give feedback.
-
user-report and endorsement , fwiw : I've been traveling for the last 3 months in remote parts of Italy, often offline, and am not permitted to use online AI tools . I use LM Studio on a Surface Pro 11 , 32gb model with the 12-core Snapdragon X Elite . Performance and battery life using LM Studio have been excellent . My layman's understanding is that the NPU is not just about efficiency / less battery but , because it is optimized for matrix operations and activation functions , increases performance for some AI/ML tasks . Snapdragon CPU work so far : Great ! NPU work so far : still only promising , becoming irritating . (same for Microsoft although 2024 Ignite indicates they are making progress) I applaud @bartowski1182 's roundtable recommendation , including @AndreasKunar . |
Beta Was this translation helpful? Give feedback.
-
It seems like NPU support for LM Studio is coming soon, as seen here. - It's possible they have built a new backend, similar to MLX-engine for apple. |
Beta Was this translation helpful? Give feedback.
-
I want to start a discussion on the performance of the new Qualcomm Snapdragon X similar to Apple M Silicon in #4167
This post got completely updated, because power-setting to "best performance" IS needed. Default it only uses 4 of the 10 cores fully, which prevents thermal throttling but gives much less performance.
I am agnostic to Apple/intel/AMD/... or any discussion on Windows/MacOS/Linux merits - please spare us any "religiosity" here on Operating-systems, etc. For me it's important to have good tools, and I think running LLMs/SLMs locally via llama.cpp is important. We need good llama.cpp benchmarking, to be able to decide. I am currently primarily a Mac user (MacBook Air M2, Mac Studio M2 Max), running MacOS, Windows and Linux.
I just got a Surface 11 Pro with the X Plus and these are my 1st benchmarks. The Surface Pros always had thermal constraints, so I got a Plus and not an Elite - even with the Plus it throttles quickly when its 10 CPUs are used fully. Also since there was an optimization of llama.cpp for Snapdragon-builds, I am NOT testing with build 8e672ef, but with the current build. But I'm trying to produce comparable results to Apple Silicon #4167.
Here are my results for my Surface 11 Pro Snapdragon(R) X 10-core X1P64100 @ 3.40 GHz, 16GB, running Windows 11 Enterprise 22H2 26100.1000 - with the 16GB, I could not test fp16 since it swaps.
llama-bench with -t 10 for Q8_0 and later after a bit of cool-down for Q4_0 (the throttled numbers were 40% !!! lower). F16 does swap with the 16GB RAM, so its not included.
build: a27152b (3285)
Update: Results for a Snapdragon X Elite (Surface Laptop 7 15"):
build: cddae48 (3646)
Update with new Q4_0_4_4 algorithm (lately automatically done on load of a Q4_0 model, no special model-file needed anymore):
build: 69b9945 (3425)
I think the new Qualcomm chips are interesting, the numbers are a bit faster than my M2 MacBook Air in CPU-only mode - feedback welcome!
It's early in the life of this SoC as well as with Windows for arm64, and a lot of optimizations are still needed. There is no GPU/NPU support (yet) and Windows/gcc arm64 is still work-in-progress. DirectML, QNN and ONNX seems to be the main optimization focus for Microsoft/Qualcomm, I will look into this later (maybe the llama.cpp QNN backend of #7541 would also help/be a starting-point). So this is work-in-progress.
I tested 2 llama.cpp build methods for Windows with MSVC, and the method in https://www.qualcomm.com/developer/blog/2024/04/big-performance-boost-for-llama-cpp-and-chatglm-cpp-with-windows got me a little better results, than the build-method in #7191. I still need to test building with clang, but I expect not much difference, since clang uses the MSVC backend on Windows.
Another update/extension - with WSL2/gcc using 10 CPUs / 8 GB RAM and Ubuntu 24.04, the numbers are very similar (all dependent on cooldowns/throttling):
build: a27152b (3285)
Beta Was this translation helpful? Give feedback.
All reactions