Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Disaggregated prefilling support X prefill + Y decode #9537

Draft
wants to merge 297 commits into
base: main
Choose a base branch
from

Conversation

zeroorhero
Copy link

@zeroorhero zeroorhero commented Oct 21, 2024

Hi,

I am implementing the feature of Disaggregated prefilling support X prefill + Y decode based on this #8498. Currently, the one prefill + one decode form in this PR has been implemented. I think that to implement the X prefill + Y decode form, a kv database must be introduced. Fortunately, I have already integrated valkey (redis over rdma) #8724. In the next step of my work, I will introduce this kv database and then conduct tests. Now I want to introduce my design solution, and everyone can provide some suggestions.

splitwise方案

New components:

  • PDEngine (Prefill Decode Engine): Similar to MQLLMEngine, it includes client and server.
  • RabbitMQ: Used for the transmission of requests between prefill nodes and decode nodes. There are two directions. The first direction is the request direction from prefill nodes to decode nodes. In this direction, there is only one queue in RabbitMQ. All prefill nodes act as producers and all decode nodes act as consumers. The second direction is the return of results from decode nodes to prefill nodes. In this direction, each prefill node needs to contain a Rabbitmq queue responsible for accepting the results of previous requests from that node. At this time, prefill nodes are consumers and decode nodes are producers. (How to find the previous node? Use a rather tricky method. When a prefill node sends a request to a decode node, add a string of the node information after the req-id of the request. When returning the result, transmit the result to the corresponding node according to this information).

Data flow:

  • First, the API server of the prefill node accepts the request and then sends the request to the engine client in the prefill node. Then, the engine client forwards the request to the engine server in the prefill node through zmq (similar to MQLLEngine). However, at this time, the max_tokens needs to be set to 1 for only the prefill operation of the request, and then the result is returned to the engine client of the prefill node.
  • Then, the engine client in the prefill node forwards the request to the engine server in the decode node through the globally unique Req Rabbitmq. At the same time, the req-id in this request contains the information of the source prefill node.
  • After the engine server in the decode node completes the inference, according to the information of the original prefill node contained in the request, it sends the result to RabbitMQ of that prefill node.
  • The engine client of this prefill node consumes the inferred result and then returns the result.

Benchmark

python3 benchmark_serving.py --backend vllm --dataset-name random --model /root/lcq/model/Llama-2-7b-hf --tokenizer /root/lcq/model/Llama-2-7b-hf --random-input-len 128 --random-output-len 8 --request-rate 4 --num-prompts 64

The performance is basically the same.

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants