GitHub - jd-opensource/xllm: A high-performance inference engine for LLMs, optimized for diverse AI accelerators.

1. Project Overview

xLLM is an efficient LLM inference framework, specifically optimized for Chinese AI accelerators, enabling enterprise-grade deployment with enhanced efficiency and reduced cost. The framework adopts a service-engine decoupled inference architecture, achieving breakthrough efficiency through several technologies: at the service layer, including elastic scheduling of online/offline requests, dynamic PD disaggregation, a hybrid EPD mechanism for multimodal and high-availability fault tolerance; and at the engine layer, combined with technologies such as multi-stream parallel computing, graph fusion optimization, speculative inference, dynamic load balancing and global KV cache management. The overall architecture is shown below:

xLLM already supports efficient deployment of mainstream large models (such as DeepSeek-V3.1, Qwen2/3, etc.) on Chinese AI accelerators, empowering enterprises to implement high-performance, low-cost AI large model applications. xLLM has been fully deployed in JD.com’s real core retail businesses, covering a variety of scenarios including intelligent customer service, risk control, supply chain optimization, ad recommendation, and more.

2. Core Features

xLLM delivers robust intelligent computing capabilities. By leveraging hardware system optimization and algorithm-driven decision control, it jointly accelerates the inference process, enabling high-throughput, low-latency distributed inference services.

Full Graph Pipeline Execution Orchestration

Asynchronous decoupled scheduling at the requests scheduling layer, to reduce computational bubbles.
Asynchronous parallelism of computation and communication at the model graph layer, overlapping computation and communication.
Pipelining of heterogeneous computing units at the operator kernel layer, overlapping computation and memory access.

Graph Optimization for Dynamic Shapes

Dynamic shape adaptation based on parameterization and multi-graph caching methods to enhance the flexibility of static graph.
Controlled tensor memory pool to ensure address security and reusability.
Integration and adaptation of performance-critical custom operators (e.g., PageAttention, AllReduce).

Efficient Memory Optimization

Mapping management between discrete physical memory and continuous virtual memory.
On-demand memory allocation to reduce memory fragmentation.
Intelligent scheduling of memory pages to increase memory reusability.
Adaptation of corresponding operators for domestic accelerators.

Global KV Cache Management

Intelligent offloading and prefetching of KV in hierarchical caches.
KV cache-centric distributed storage architecture.
Intelligent KV routing among computing nodes.

Algorithm-driven Acceleration

Speculative decoding optimization to improve efficiency through multi-core parallelism.
Dynamic load balancing of MoE experts to achieve efficient adjustment of expert distribution.

3. Code Architecture

├── xllm/
|   : main source folder
│   ├── api_service/               # code for api services
│   ├── core/  
│   │   : xllm core features folder
│   │   ├── common/                
│   │   ├── distributed_runtime/   # code for distributed and pd serving
│   │   ├── framework/             # code for execution orchestration
│   │   ├── kernels/               # adaption for npu kernels adaption
│   │   ├── layers/                # model layers impl
│   │   ├── runtime/               # code for worker and executor
│   │   ├── scheduler/             # code for batch and pd scheduler
│   │   └── util/
│   ├── models/                    # models impl
│   ├── processors/                # code for vlm pre-processing
│   ├── proto/                     # communication protocol
|   └── server/                    # xLLM server
├── examples/                      # examples of calling xLLM
├── tools/                         # code for npu time generations
└── xllm.cpp                       # entrypoint of xLLM

Supported models list:

DeepSeek-V3/R1
DeepSeek-R1-Distill-Qwen
Kimi-k2
Llama2/3
MiniCPM-V
MiMo-VL
Qwen2/2.5/QwQ
Qwen2.5-VL
Qwen3 / Qwen3-MoE

4. Quick Start

Installation

First, download the image we provide:

# A2 x86
docker pull xllm/xllm-ai:xllm-0.6.0-dev-hb-rc2-py3.11-oe24.03-lts
# A2 arm
docker pull xllm/xllm-ai:xllm-0.6.0-dev-hb-rc2-py3.11-oe24.03-lts-aarch64
# A3 arm
docker pull xllm/xllm-ai:xllm-0.6.0-dev-hc-rc2-py3.11-oe24.03-lts-aarch64
# or
# A2 x86
docker pull quay.io/jd_xllm/xllm-ai:xllm-0.6.0-dev-hb-rc2-py3.11-oe24.03-lts
# A2 arm
docker pull quay.io/jd_xllm/xllm-ai:xllm-0.6.0-dev-hb-rc2-py3.11-oe24.03-lts-aarch64
# A3 arm
docker pull quay.io/jd_xllm/xllm-ai:xllm-0.6.0-dev-hc-rc2-py3.11-oe24.03-lts-aarch64

Then create the corresponding container:

sudo docker run -it --ipc=host -u 0 --privileged --name mydocker --network=host  --device=/dev/davinci0  --device=/dev/davinci_manager --device=/dev/devmm_svm --device=/dev/hisi_hdc -v /var/queue_schedule:/var/queue_schedule -v /usr/local/Ascend/driver:/usr/local/Ascend/driver -v /usr/local/Ascend/add-ons/:/usr/local/Ascend/add-ons/ -v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi -v /usr/local/sbin/:/usr/local/sbin/ -v /var/log/npu/conf/slog/slog.conf:/var/log/npu/conf/slog/slog.conf -v /var/log/npu/slog/:/var/log/npu/slog -v /export/home:/export/home -w /export/home -v ~/.ssh:/root/.ssh  -v /var/log/npu/profiling/:/var/log/npu/profiling -v /var/log/npu/dump/:/var/log/npu/dump -v /home/:/home/  -v /runtime/:/runtime/ xllm/xllm-ai:xllm-0.6.0-dev-hb-rc2-py3.11-oe24.03-lts

Install official repo and submodules：

git clone https://github.com/jd-opensource/xllm
cd xllm 
git submodule init
git submodule update

The compilation depends on vcpkg. The Docker image already includes VCPKG_ROOT preconfigured. If you want to manually set it up, you can:

git clone https://gitcode.com/xLLM-AI/vcpkg.git
cd vcpkg && git checkout ffc42e97c866ce9692f5c441394832b86548422c
export VCPKG_ROOT=/your/path/to/vcpkg

Install python dependencies:

cd xllm
pip install -r cibuild/requirements-dev.txt -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
pip install --upgrade setuptools wheel

Compilation

When compiling, generate executable files build/xllm/core/server/xllm under build/:

python setup.py build

Or, compile directly using the following command to generate the whl package under dist/:

python setup.py bdist_wheel

Launch

Run the following command to start xLLM engine:

./build/xllm/core/server/xllm \    # launch xllm server
    --model=/path/to/your/llm  \   # model path（to replace with your own path）
    --backend=llm \                # indicate the LLM backend
    --port=9977 \                  # set service port to 9977
    --max_memory_utilization 0.90  # set the maximal utilization of device memory

5. Contributing

There are several ways you can contribute to xLLM:

Reporting Issues (Bugs & Errors)
Suggesting Enhancements
Improving Documentation
- Fork the repository
- Add your view in document
- Send your pull request
Writing Code
- Fork the repository
- Create a new branch
- Add your feature or improvement
- Send your pull request

We appreciate all kinds of contributions! 🎉🎉🎉 If you have problems about development, please check our document: Document

6. Community & Support

If you encounter any issues along the way, you are welcomed to submit reproducible steps and log snippets in the project's Issues area, or contact the xLLM Core team directly via your internal Slack. Moreover, we have established a WeChat user group. You can find our group chat QR code image here or visit the following live QR code. Welcome to contact us!

7. Acknowledgment

This project was made possible thanks to the following open-source projects:

ScaleLLM - xLLM draws inspiration from ScaleLLM's graph construction method and references its runtime execution.
Mooncake - Build xLLM hybrid KV cache management based on Mooncake.
brpc - Build high-performance http service based on brpc.
tokenizers-cpp - Build C++ tokenizer based on tokenizers-cpp.
safetensors - xLLM relies on the C binding safetensors capability.
Partial JSON Parser - Implement xLLM's C++ JSON parser with insights from Python and Go implementations.
concurrentqueue - A fast multi-producer, multi-consumer lock-free concurrent queue for C++11.

Thanks to the following collaborating university laboratories:

THU-MIG (School of Software, BNRist, Tsinghua University)
USTC-Cloudlab (Cloud Computing Lab, University of Science and Technology of China)
Beihang-HiPO (Beihang HiPO research group)
PKU-DS-LAB (Data Structure Laboratory, Peking University)
PKU-NetSys-LAB (NetSys Lab, Peking University)

Thanks to all the following developers who have contributed to xLLM.

8. License

Apache License

Name		Name	Last commit message	Last commit date
Latest commit History 87 Commits
.github/workflows		.github/workflows
cibuild		cibuild
cmake		cmake
docs		docs
third_party		third_party
tools		tools
xllm		xllm
.clang-format		.clang-format
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
.style.yapf		.style.yapf
CMakeLists.txt		CMakeLists.txt
CONTRIBUTING.md		CONTRIBUTING.md
CONTRIBUTING_zh.md		CONTRIBUTING_zh.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
NOTICE_Third_Party.md		NOTICE_Third_Party.md
README.md		README.md
README_zh.md		README_zh.md
RELEASE.md		RELEASE.md
mkdocs_en.yml		mkdocs_en.yml
mkdocs_zh.yml		mkdocs_zh.yml
setup.py		setup.py
vcpkg.json		vcpkg.json
version.txt		version.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

1. Project Overview

2. Core Features

3. Code Architecture

4. Quick Start

Installation

Compilation

Launch

5. Contributing

6. Community & Support

7. Acknowledgment

8. License

xLLM is provided by JD.com

Thanks for your Contributions!

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 18

Languages

License

jd-opensource/xllm

Folders and files

Latest commit

History

Repository files navigation

1. Project Overview

2. Core Features

3. Code Architecture

4. Quick Start

Installation

Compilation

Launch

5. Contributing

6. Community & Support

7. Acknowledgment

8. License

xLLM is provided by JD.com

Thanks for your Contributions!

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 18

Languages

Packages