[WIP] Disaggregated prefilling support X prefill + Y decode #9537

zeroorhero · 2024-10-21T02:40:59Z

Hi,

I am implementing the feature of Disaggregated prefilling support X prefill + Y decode based on this #8498. Currently, the one prefill + one decode form in this PR has been implemented. I think that to implement the X prefill + Y decode form, a kv database must be introduced. Fortunately, I have already integrated valkey (redis over rdma) #8724. In the next step of my work, I will introduce this kv database and then conduct tests. Now I want to introduce my design solution, and everyone can provide some suggestions.

New components:

PDEngine (Prefill Decode Engine): Similar to MQLLMEngine, it includes client and server.
RabbitMQ: Used for the transmission of requests between prefill nodes and decode nodes. There are two directions. The first direction is the request direction from prefill nodes to decode nodes. In this direction, there is only one queue in RabbitMQ. All prefill nodes act as producers and all decode nodes act as consumers. The second direction is the return of results from decode nodes to prefill nodes. In this direction, each prefill node needs to contain a Rabbitmq queue responsible for accepting the results of previous requests from that node. At this time, prefill nodes are consumers and decode nodes are producers. (How to find the previous node? Use a rather tricky method. When a prefill node sends a request to a decode node, add a string of the node information after the req-id of the request. When returning the result, transmit the result to the corresponding node according to this information).

Data flow:

First, the API server of the prefill node accepts the request and then sends the request to the engine client in the prefill node. Then, the engine client forwards the request to the engine server in the prefill node through zmq (similar to MQLLEngine). However, at this time, the max_tokens needs to be set to 1 for only the prefill operation of the request, and then the result is returned to the engine client of the prefill node.
Then, the engine client in the prefill node forwards the request to the engine server in the decode node through the globally unique Req Rabbitmq. At the same time, the req-id in this request contains the information of the source prefill node.
After the engine server in the decode node completes the inference, according to the information of the original prefill node contained in the request, it sends the result to RabbitMQ of that prefill node.
The engine client of this prefill node consumes the inferred result and then returns the result.

Benchmark

python3 benchmark_serving.py --backend vllm --dataset-name random --model /root/lcq/model/Llama-2-7b-hf --tokenizer /root/lcq/model/Llama-2-7b-hf --random-input-len 128 --random-output-len 8 --request-rate 4 --num-prompts 64

The performance is basically the same.

…vllm into kuntai-disagg-refactor

Signed-off-by: Changqi Lu <luchangqi.123@bytedance.com>

github-actions · 2024-10-21T02:41:11Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

KuntaiDu added 30 commits July 18, 2024 20:08

add sleep when initializing parallel state

cb6d6a5

only log when rank%4==0

fe8fb47

only log when rank%4==0

cc89bfb

bug fix

531bdf3

also only log when rank=4 in custom all reduce

1804656

add debuging statement around broadcast

81c8640

debug init_world_group

5ba142c

put the log inside a text file

cc939cf

init DISAGG first

8ac9266

init DISAGG before global

58849fa

put it behind world_size

08797e2

add more debug information in pynccl

4ff4cd6

typo fix

b09e4e6

more debug

583de97

more debug info

74bcfff

put every output

2175825

remove unnecessary sleep

3e07770

add sucess statement

a22e5cd

add debug statement

2c0c27d

log rank in success message

a783787

sleep based on rank to avoid message overlapping

79f0b06

increase torch debug level

b17f20f

sleep

025f209

set gloo debugging level to trace

32292f1

reduce debugging commands

389fb24

avoid initializing NCCL first

1b38b29

check

bb8c08a

locate the hanging line

25a7cf3

add rank to CPU group

999bd72

narrow case

3428ea6

KuntaiDu and others added 29 commits September 20, 2024 01:13

remove empty file

1d7a1c9

fix bug when world_size == -1

10ad09c

adjust comments

38e3a57

make yapf and ruff happy

e2bd481

relaunch CI

4979337

change get_open_port so that it is easier to understand

a2007dc

adjust comment

ce434f5

make format checker happy

f224c71

adjust model runner docstring

5d9b007

make format checker happy

6255dca

change data == [] to not data (thanks Cody)

71ae275

fix misleading to available

80164ea

add new line and run format checker

52c2d10

add docstring for lookup buffer

09478ef

align docstring syntax

06cb15c

add docstring for abstract classes

7c11a39

put assertion at the end of the function

37bac34

add fp8 support to pipe

111abb4

adjust docstrings

394afaa

bug fix: check isinstance(torch.Tensor) before checking NOne

76019f1

make format check happy

93ec62b

Merge branch 'main' into kuntai-disagg-refactor

87b82cc

Adjust to latest changes of kv_caches: it is now always a tensor.

c5bdf64

debug

596eb64

bug fix: kv_caches will be list of torch.tensor([]) in profile run.

683bd9c

Merge branch 'vllm-project:main' into kuntai-disagg-refactor

81aa825

Relax server start timeout limit

521daba

Merge branch 'kuntai-disagg-refactor' of https://github.com/KuntaiDu/…

516f9ca

…vllm into kuntai-disagg-refactor

[wip] disaggregated prefilling support X prefill + Y decode.

d9db3e7

Signed-off-by: Changqi Lu <luchangqi.123@bytedance.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Disaggregated prefilling support X prefill + Y decode #9537

[WIP] Disaggregated prefilling support X prefill + Y decode #9537

zeroorhero commented Oct 21, 2024 •

edited

Loading

github-actions bot commented Oct 21, 2024

[WIP] Disaggregated prefilling support X prefill + Y decode #9537

Are you sure you want to change the base?

[WIP] Disaggregated prefilling support X prefill + Y decode #9537

Conversation

zeroorhero commented Oct 21, 2024 • edited Loading

New components:

Data flow:

Benchmark

github-actions bot commented Oct 21, 2024

zeroorhero commented Oct 21, 2024 •

edited

Loading