This is our full multimodal language model (LLM) Android app
-
Multimodal Support: Enables functionality across diverse tasks, including text-to-text, image-to-text, audio-to-text, and text-to-image generation (via diffusion models).
-
CPU Inference Optimization: MNN-LLM demonstrates exceptional performance in CPU benchmarking in Android, achieving prefill speed improvements of 8.6x over llama.cpp and 20.5x over fastllm, with decoding speeds that are 2.3x and 8.9x faster, respectively. the following is a comparison between llama.cpp and MNN-LLM on Android inferencing qwen-7b.
-
Broad Model Compatibility: Supports multiple leading model providers, such as Qwen, Gemma, Llama (including TinyLlama and MobileLLM), Baichuan, Yi, DeepSeek, InternLM, Phi, ReaderLM, and Smolm.
-
Privacy First: Runs entirely on-device, ensuring complete data privacy with no information uploaded to external servers.
- you can download the app from Releases or build it yourself;
- After installing the application, you can browse all supported models, download them, and interact with them directly within the app.;
- Additionally, you can access your chat history in the sidebar and revisit previous conversations seamlessly.
!!!warning!!! This version has been tested exclusively on the OnePlus 13 and Xiaomi 14 Ultra, Due to the demanding performance requirements of large language models (LLMs), many budget or low-spec devices may experience issues such as slow inference speeds, application instability, or even failure to run entirely. and its stability on other devices cannot be guaranteed. If you encounter any issues, please feel free to open an issue for assistance.
- Clone the repository:
git clone https://github.com/alibaba/MNN.git
- Build library:
cd project/android mkdir build_64 ../build_64.sh "-DMNN_LOW_MEMORY=true -DMNN_CPU_WEIGHT_DEQUANT_GEMM=true -DMNN_BUILD_LLM=true -DMNN_SUPPORT_TRANSFORMER_FUSE=true -DMNN_ARM82=true -DMNN_USE_LOGCAT=true -DMNN_OPENCL=true -DLLM_SUPPORT_VISION=true -DMNN_BUILD_OPENCV=true -DMNN_IMGCODECS=true -DLLM_SUPPORT_AUDIO=true -DMNN_BUILD_AUDIO=true -DMNN_BUILD_DIFFUSION=ON -DMNN_SEP_BUILD=ON"
- copy to llm android app project
find . -name "*.so" -exec cp {} ../apps/MnnLlmApp/app/src/main/jniLibs/arm64-v8a/ \;
- build android app project and install
cd ../apps/MnnLlmApp/ ./gradlew installDebug
- Click here to download
- Support for ModelScope downloads
- Optimization of DeepSeek's multi-turn conversation capabilities and UI presentation
- Added support for including debug information when submitting feedback or issues
- Click here to download
- this is our first public released version; you can :
- search all our supported models, download and chat with it in the app;
- diffusion model:
- stable-diffusion-v1-5
- audio model:
- qwen2-audio-7b
- visual models:
- qwen-vl-chat
- qwen2-vl-2b
- qwen2-vl-7b
MNN-LLM is a versatile inference framework designed to optimize and accelerate the deployment of large language models on both mobile devices and local PCs, addressing challenges like high memory consumption and computational costs through innovations such as model quantization, hybrid storage, and hardware-specific optimizations. In CPU benchmarking, MNN-LLM excels, achieving prefill speed boosts of 8.6x over llama.cpp and 20.5x over fastllm, complemented by decoding speeds that are 2.3x and 8.9x faster, respectively. In GPU-based assessments, MNN-LLM’s performance slightly declines compared to MLC-LLM, particularly when using Qwen2-7B with shorter prompts, due to MLC-LLM’s advantageous symmetric quantization technique. MNN-LLM excels, achieving up to 25.3x faster prefill and 7.1x faster decoding than llama.cpp, and 2.8x and 1.7x improvements over MLC-LLM, respectively. For more detailed information, please refer to the paper:MNN-LLM: A Generic Inference Engine for Fast Large LanguageModel Deployment on Mobile Devices
This project is built upon the following open-source projects: