| Documentation |
xLLM is an efficient and user-friendly LLM intelligent inference framework that provides enterprise-level service guarantees and high-performance engine computing capabilities for model inference on domestic AI accelerators.
LLM with parameter scales ranging from tens of billions to trillions are being rapidly deployed in core business scenarios such as intelligent customer service, real-time recommendation, and content generation. Efficient support for domestic computing hardware has become a core requirement for low-cost inference deployment. Existing inference engines struggle to effectively adapt to the architectural characteristics of dedicated accelerators like domestic chips. Performance issues such as low utilization of computing units, load imbalance and communication overhead bottlenecks under the MoE architecture, and difficulties in kv cache management have restricted the efficient inference of requests and the scalability of the system. The xLLM inference engine improves the resource efficiency of the entire "communication-computation-storage" performance link and currently supports JD.com's online services across multiple scenarios and with multiple models.
xLLM delivers robust intelligent computing capabilities. By leveraging hardware system optimization and algorithm-driven decision control, it jointly accelerates the inference process, enabling high-throughput, low-latency distributed inference services.
Full Graph Pipeline Execution Orchestration
- Asynchronous decoupled scheduling at the requests scheduling layer, to reduce computational bubbles.
- Asynchronous parallelism of computation and communication at the model graph layer, overlapping computation and communication.
- Pipelining of heterogeneous computing units at the operator kernel layer, overlapping computation and memory access.
Graph Optimization for Dynamic Shapes
- Dynamic shape adaptation based on parameterization and multi-graph caching methods to enhance the flexibility of static graph.
- Controlled tensor memory pool to ensure address security and reusability.
- Integration and adaptation of performance-critical custom operators (e.g., PageAttention, AllReduce).
Kernel Optimization
- GroupMatmul optimization to improve computational efficiency.
- Chunked Prefill optimization to support long-sequence inputs.
Efficient Memory Optimization
- Mapping management between discrete physical memory and continuous virtual memory.
- On-demand memory allocation to reduce memory fragmentation.
- Intelligent scheduling of memory pages to increase memory reusability.
- Adaptation of corresponding operators for domestic accelerators.
Global KV Cache Management
- Intelligent offloading and prefetching of KV in hierarchical caches.
- KV cache-centric distributed storage architecture.
- Intelligent KV routing among computing nodes.
Algorithm-driven Acceleration
- Speculative decoding optimization to improve efficiency through multi-core parallelism.
- Dynamic load balancing of MoE experts to achieve efficient adjustment of expert distribution.
├── xllm/
| : main source folder
│ ├── api_service/ # code for api services
│ ├── core/
│ │ : xllm core features folder
│ │ ├── common/
│ │ ├── distributed_runtime/ # code for distributed and pd serving
│ │ ├── framework/ # code for execution orchestration
│ │ ├── kernels/ # adaption for npu kernels adaption
│ │ ├── layers/ # model layers impl
│ │ ├── runtime/ # code for worker and executor
│ │ ├── scheduler/ # code for batch and pd scheduler
│ │ └── util/
│ ├── models/ # models impl
│ ├── processors/ # code for vlm pre-processing
│ ├── proto/ # communication protocol
| └── server/ # xLLM server
├── examples/ # examples of calling xLLM
├── tools/ # code for npu time generations
└── xllm.cpp # entrypoint of xLLM
Supported models list:
- DeepSeek-V3/R1
- DeepSeek-R1-Distill-Qwen
- Kimi-k2
- Llama2/3
- MiniCPM-V
- Qwen2/2.5/QwQ
- Qwen2.5-VL
- Qwen3 / Qwen3-MoE
First, download the image we provide:
docker pull xllm/xllm-ai:xllm-0.6.0-dev-800I-A2-py3.11-openeuler24.03-lts
Then create the corresponding container:
sudo docker run -it --ipc=host -u 0 --privileged --name mydocker --network=host --device=/dev/davinci0 --device=/dev/davinci_manager --device=/dev/devmm_svm --device=/dev/hisi_hdc -v /var/queue_schedule:/var/queue_schedule -v /mnt/cfs/9n-das-admin/llm_models:/mnt/cfs/9n-das-admin/llm_models -v /usr/local/Ascend/driver:/usr/local/Ascend/driver -v /usr/local/Ascend/add-ons/:/usr/local/Ascend/add-ons/ -v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi -v /usr/local/sbin/:/usr/local/sbin/ -v /var/log/npu/conf/slog/slog.conf:/var/log/npu/conf/slog/slog.conf -v /var/log/npu/slog/:/var/log/npu/slog -v /export/home:/export/home -w /export/home -v ~/.ssh:/root/.ssh -v /var/log/npu/profiling/:/var/log/npu/profiling -v /var/log/npu/dump/:/var/log/npu/dump -v /home/:/home/ -v /runtime/:/runtime/ xllm/xllm-ai:xllm-0.6.0-dev-800I-A2-py3.11-openeuler24.03-lts
Install official repo and submodules:
git clone https://github.com/jd-opensource/xllm
cd xllm
git submodule init
git submodule update
When compiling, vcpkg
will be downloaded by default. Alternatively, you can download vcpkg
in advance and then set the environment variable:
git clone https://github.com/microsoft/vcpkg.git
export VCPKG_ROOT=/your/path/to/vcpkg
Install python dependencies:
cd xllm
pip install -r cibuild/requirements-dev.txt -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
pip install --upgrade setuptools wheel
When compiling, generate executable files build/xllm/core/server/xllm
under build/
:
python setup.py build
Or, compile directly using the following command to generate the whl package under dist/
:
python setup.py bdist_wheel
Run the following command to start xLLM engine:
./build/xllm/core/server/xllm \ # launch xllm server
--model=/path/to/your/llm \ # model path(to replace with your own path)
--backend=llm \ # indicate the LLM backend
--port=9977 \ # set service port to 9977
--max_memory_utilization 0.90 # set the maximal utilization of device memory
There are several ways you can contribute to xLLM:
- Reporting Issues (Bugs & Errors)
- Suggesting Enhancements
- Improving Documentation
- Fork the repository
- Add your view in document
- Send your pull request
- Writing Code
- Fork the repository
- Create a new branch
- Add your feature or improvement
- Send your pull request
We appreciate all kinds of contributions! 🎉🎉🎉 If you have problems about development, please check our document: Document
If you encounter any issues along the way, you are welcomed to submit reproducible steps and log snippets in the project's Issues area, or contact the xLLM Core team directly via your internal Slack.
Welcome to contact us:
This project was made possible thanks to the following open-source projects:
- ScaleLLM - xLLM draws inspiration from ScaleLLM's graph construction method and references its runtime execution.
- MindIE - Develop xLLM kernels based on kernel optimization and ATB of MindIE.
- Mooncake - Build xLLM hybrid KV cache management based on Mooncake.
- brpc - Build high-performance http service based on brpc.
- tokenizers-cpp - Build C++ tokenizer based on tokenizers-cpp.
- safetensors - xLLM relies on the C binding safetensors capability.
- Partial JSON Parser - Implement xLLM's C++ JSON parser with insights from Python and Go implementations.
Thanks to all the following developers who have contributed to xLLM.