[ Frontend ] Multiprocessing for OpenAI Server with `zeromq` #6883

robertgshaw2-neuralmagic · 2024-07-29T02:45:27Z

SUMMARY:

Use multiprocessing for OpenAI Server such that it does not contend with AsyncLLMEngine
Use zeromq for networking between OAI process and AsyncLLMEngine
Currently uses pickling for python serialization, to update this to use protobufs in separate PR
Gives ~20% TPOT improvement for QPS 10 for llama-8b-instruct on H100 (550 input tokens, 150 output tokens)

TODO:

PERFORMANCE:

zeromq based protocol is delivering speedup in high QPS scenario:
Llama-3-8B-Instruct, 1xH100, 550 input tokens | 150 output tokens, 10 QPS, chunked prefill

Metric	`main`	`grpc`	`zmq + pickle` (this PR)	`zmq + protobuf`
TPOT (mean)	70.74	64.15	57.35	54.54
TTFT (mean)	687.69	450.05	298.09	269.01

Clear win + clear we should move forward with zeromq

Benchmark script for reference:

MODEL="meta-llama/Meta-Llama-3-8B-Instruct"
TOTAL_SECONDS=60
QPS_RATES=("10")

for QPS in ${QPS_RATES[@]}; do
    NUM_PROMPTS=$((TOTAL_SECONDS * QPS))
    echo "===== RUNNING NUM_PROMPTS = $NUM_PROMPTS QPS = $QPS ====="

    python3 benchmarks/benchmark_serving.py \
        --model $MODEL \
        --dataset-name sonnet --sonnet-input-len 550 --sonnet-output-len 150 --dataset-path benchmarks/sonnet.txt \
        --num-prompts $NUM_PROMPTS --request-rate $QPS
done

FOLLOW UP WORK:

Refactor how logits processors are handled. They should not belong to the API layer anymore, rather below the engine. Currently there are hacks around making it serializable.
Switch to using a protobuf communication layer for the RPC Server
Move detokenization to the OAI layer
Profile using N uvicorn processes + expose a flag
Support embedding models via RPC

Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>

simon-mo

LGTM over all. This is a good direction. Can we check one edge case?

Send request generate 2K toks
Drop the connection
Check the generate coroutine is properly cancelled, and the abort it called, and there is no leakage of coroutines in both client and server (by printing out something...)

robertgshaw2-neuralmagic · 2024-08-01T21:15:13Z

vllm/engine/protocol.py

+    ) -> PreTrainedTokenizer:
+        """Get the appropriate Tokenizer for the request"""
+
+    async def is_tracing_enabled(self) -> bool:


@joerunde why are these pass?

Co-authored-by: Simon Mo <simon.mo@hey.com>

robertgshaw2-neuralmagic · 2024-08-01T21:34:19Z

LGTM over all. This is a good direction. Can we check one edge case?

Send request generate 2K toks

Drop the connection

Check the generate coroutine is properly cancelled, and the abort it called, and there is no leakage of coroutines in both client and server (by printing out something...)

Do you want this in automation?

robertgshaw2-neuralmagic · 2024-08-03T00:59:53Z

TPOT speedup:

QPS	2	6	10	14
main	14.55	37.02	72.31	82.56
pr	13.34	31.88	61.03	75.18

Nice win + more to go in the follow ups (especially protobufs)

simon-mo

youkaichao · 2024-08-03T01:45:45Z

vllm/entrypoints/openai/rpc/client.py

+    async def _send_one_way_rpc_request(self, request: RPC_REQUEST_TYPE,
+                                        error_message: str):
+        """Send one-way RPC request to trigger an action."""
+        with self.socket() as socket:


I'm trying to understand this. We create a socket every time we call this function?

Yes. I believe this is a common paradigm for zeromq based clients

also, this function is mostly used in setup

Also, this does not mean a new Unix socket is created. This is a zeromq socket. how zeromq handles this internally is opaque to us

youkaichao · 2024-08-03T01:52:00Z

vllm/entrypoints/openai/api_server.py

+        port = get_open_port(envs.VLLM_RPC_PORT)
+        rpc_server_process = Process(target=run_rpc_server,
+                                     args=(engine_args,
+                                           UsageContext.OPENAI_API_SERVER,
+                                           port))


why do we need a new env var here? isn't get_open_port() enough?

I would be open to reverting this. Will just need to check how it interacts with tp

youkaichao · 2024-08-03T01:54:11Z

tests/entrypoints/openai/test_disable_mp.py

@@ -0,0 +1,715 @@
+"""
+Repeat of tests in test_completion.py with the non-mp backend.


why do we copy this file instead of adding several new lines to add a new arg for the server, like @pytest.mark.parametrize ?

Because we need to modify default server args which is a fixture for the whole file so unfortunately I think this has to be a separate file

youkaichao

Thanks for the great work! I just left some nit comments that I don't understand. If you have time, could you please briefly draw some diagrams to show how many processes do we have use and how do they interact with each other?

…oject#6883) Signed-off-by: Joe Runde <Joseph.Runde@ibm.com> Co-authored-by: Joe Runde <Joseph.Runde@ibm.com> Co-authored-by: Joe Runde <joe@joerun.de> Co-authored-by: Nick Hill <nickhill@us.ibm.com> Co-authored-by: Simon Mo <simon.mo@hey.com>

joerunde and others added 30 commits July 25, 2024 11:44

⚗️ add backend proto file

bed649a

Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>

♻️ move proto to grpc/pb

7de9d49

Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>

✨ add proto compilation

9394a62

Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>

updated

dd8bf96

kinda working

5c7fbff

🚧 more wip

952e8ef

Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>

fixed

e8eac95

🐛 fixup race condition

938a843

Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>

🐛 remove timeout

2b8d7cd

Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>

format

ea02d39

streaming

4a2dc46

removed breaks

30f2bc9

pushing current state

c718b68

⚗️ try unix sockets

b3d25c6

Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>

⚡ no background loop

2765b17

Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>

spurious change

b219778

remove spurious change

932ea23

spurious changes

f029114

spurioous change

6854758

🐛 whoops

3b5ff66

Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>

📝 log stuff

79247c3

Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>

stash

a39ebc0

pushing up

ef257f1

stash

a6c9bc5

actually working

d7490bc

cleanup

f68fd60

more cleanup

38b5b9c

cleanup

bc54311

stash

3cccebb

more cleanup

4b78e29

simon-mo reviewed Aug 1, 2024

View reviewed changes

robertgshaw2-neuralmagic added 2 commits August 1, 2024 21:13

rename

62036ad

cleaning

a177d87

robertgshaw2-neuralmagic commented Aug 1, 2024

View reviewed changes

robertgshaw2-neuralmagic and others added 3 commits August 1, 2024 21:15

ordering

9ca3b93

fix embedding model feedback

f8b5fb1

Update vllm/entrypoints/openai/rpc/server.py

fca5a71

Co-authored-by: Simon Mo <simon.mo@hey.com>

format

5f07f86

joerunde mentioned this pull request Aug 1, 2024

[Frontend] Kill the server on engine death #6594

Merged

njhill mentioned this pull request Aug 1, 2024

[BugFix] Fix multiprocessing shutdown errors #7041

Closed

DarkLight1337 mentioned this pull request Aug 2, 2024

[Frontend] Add readiness and liveness endpoints to OpenAI API server #7078

Closed

Merge branch 'main' into isolate-oai-server-process

bd0fd76

simon-mo approved these changes Aug 3, 2024

View reviewed changes

simon-mo merged commit ed812a7 into vllm-project:main Aug 3, 2024
63 checks passed

youkaichao reviewed Aug 3, 2024

View reviewed changes

youkaichao mentioned this pull request Aug 3, 2024

[ci][distributed] disable ray dag tests #7099

Merged

VastoLorde95 mentioned this pull request Aug 4, 2024

[BugFix] Use args.trust_remote_code #7121

Merged

dtrifiro mentioned this pull request Aug 5, 2024

Sync with upstream@v0.5.4-7-g9118217f opendatahub-io/vllm#120

Closed

DarkLight1337 mentioned this pull request Aug 6, 2024

[Bugfix][Frontend] Fix missing /metrics endpoint #6463

Merged

ccdv-ai mentioned this pull request Aug 6, 2024

[Bug]: ZMQError: Address already in use (addr='tcp://127.0.0.1:5570') #7196

Closed

maxdebayser mentioned this pull request Aug 6, 2024

[BugFix] Fix frontend multiprocessing hang #7217

Merged

SolitaryThinker mentioned this pull request Aug 9, 2024

[core] Multi Step Scheduling #7000

Merged

27 tasks

mgoin mentioned this pull request Sep 19, 2024

API Server Performance #1677

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ Frontend ] Multiprocessing for OpenAI Server with `zeromq` #6883

[ Frontend ] Multiprocessing for OpenAI Server with `zeromq` #6883

robertgshaw2-neuralmagic commented Jul 29, 2024 •

edited

Loading

simon-mo left a comment

robertgshaw2-neuralmagic Aug 1, 2024

robertgshaw2-neuralmagic commented Aug 1, 2024

robertgshaw2-neuralmagic commented Aug 3, 2024 •

edited

Loading

simon-mo left a comment

youkaichao Aug 3, 2024

robertgshaw2-neuralmagic Aug 3, 2024

robertgshaw2-neuralmagic Aug 3, 2024

youkaichao Aug 3, 2024

robertgshaw2-neuralmagic Aug 3, 2024

youkaichao Aug 3, 2024

robertgshaw2-neuralmagic Aug 3, 2024

youkaichao left a comment

		@@ -0,0 +1,715 @@
		"""
		Repeat of tests in test_completion.py with the non-mp backend.

[ Frontend ] Multiprocessing for OpenAI Server with zeromq #6883

[ Frontend ] Multiprocessing for OpenAI Server with zeromq #6883

Conversation

robertgshaw2-neuralmagic commented Jul 29, 2024 • edited Loading

simon-mo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robertgshaw2-neuralmagic commented Aug 1, 2024

robertgshaw2-neuralmagic commented Aug 3, 2024 • edited Loading

simon-mo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

youkaichao left a comment

Choose a reason for hiding this comment

[ Frontend ] Multiprocessing for OpenAI Server with `zeromq` #6883

[ Frontend ] Multiprocessing for OpenAI Server with `zeromq` #6883

robertgshaw2-neuralmagic commented Jul 29, 2024 •

edited

Loading

robertgshaw2-neuralmagic commented Aug 3, 2024 •

edited

Loading