Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ Frontend ] Multiprocessing for OpenAI Server with zeromq #6883

Merged
merged 84 commits into from
Aug 3, 2024

Conversation

robertgshaw2-neuralmagic
Copy link
Collaborator

@robertgshaw2-neuralmagic robertgshaw2-neuralmagic commented Jul 29, 2024

RFC: #6797

SUMMARY:

  • Use multiprocessing for OpenAI Server such that it does not contend with AsyncLLMEngine
  • Use zeromq for networking between OAI process and AsyncLLMEngine
  • Currently uses pickling for python serialization, to update this to use protobufs in separate PR
  • Gives ~20% TPOT improvement for QPS 10 for llama-8b-instruct on H100 (550 input tokens, 150 output tokens)

TODO:

  • enabling aborting
  • enable do_log_stats
  • error handling
  • refactor to use single socket
  • use open port
  • turn on the other OAI endpoints
  • add flag to disable mp
  • tokenizer
  • guided decoding
  • tracing
  • make CI green

PERFORMANCE:

  • zeromq based protocol is delivering speedup in high QPS scenario:
  • Llama-3-8B-Instruct, 1xH100, 550 input tokens | 150 output tokens, 10 QPS, chunked prefill
Metric main grpc zmq + pickle (this PR) zmq + protobuf
TPOT (mean) 70.74 64.15 57.35 54.54
TTFT (mean) 687.69 450.05 298.09 269.01

Clear win + clear we should move forward with zeromq


Benchmark script for reference:

MODEL="meta-llama/Meta-Llama-3-8B-Instruct"
TOTAL_SECONDS=60
QPS_RATES=("10")

for QPS in ${QPS_RATES[@]}; do
    NUM_PROMPTS=$((TOTAL_SECONDS * QPS))
    echo "===== RUNNING NUM_PROMPTS = $NUM_PROMPTS QPS = $QPS ====="

    python3 benchmarks/benchmark_serving.py \
        --model $MODEL \
        --dataset-name sonnet --sonnet-input-len 550 --sonnet-output-len 150 --dataset-path benchmarks/sonnet.txt \
        --num-prompts $NUM_PROMPTS --request-rate $QPS
done

FOLLOW UP WORK:

  • Refactor how logits processors are handled. They should not belong to the API layer anymore, rather below the engine. Currently there are hacks around making it serializable.
  • Switch to using a protobuf communication layer for the RPC Server
  • Move detokenization to the OAI layer
  • Profile using N uvicorn processes + expose a flag
  • Support embedding models via RPC

joerunde and others added 30 commits July 25, 2024 11:44
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
Copy link
Collaborator

@simon-mo simon-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM over all. This is a good direction. Can we check one edge case?

  • Send request generate 2K toks
  • Drop the connection
  • Check the generate coroutine is properly cancelled, and the abort it called, and there is no leakage of coroutines in both client and server (by printing out something...)

) -> PreTrainedTokenizer:
"""Get the appropriate Tokenizer for the request"""

async def is_tracing_enabled(self) -> bool:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joerunde why are these pass?

@robertgshaw2-neuralmagic
Copy link
Collaborator Author

LGTM over all. This is a good direction. Can we check one edge case?

  • Send request generate 2K toks
  • Drop the connection
  • Check the generate coroutine is properly cancelled, and the abort it called, and there is no leakage of coroutines in both client and server (by printing out something...)

Do you want this in automation?

@robertgshaw2-neuralmagic
Copy link
Collaborator Author

robertgshaw2-neuralmagic commented Aug 3, 2024

TPOT speedup:

QPS 2 6 10 14
main 14.55 37.02 72.31 82.56
pr 13.34 31.88 61.03 75.18

Nice win + more to go in the follow ups (especially protobufs)

Copy link
Collaborator

@simon-mo simon-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@simon-mo simon-mo merged commit ed812a7 into vllm-project:main Aug 3, 2024
63 checks passed
async def _send_one_way_rpc_request(self, request: RPC_REQUEST_TYPE,
error_message: str):
"""Send one-way RPC request to trigger an action."""
with self.socket() as socket:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to understand this. We create a socket every time we call this function?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I believe this is a common paradigm for zeromq based clients

also, this function is mostly used in setup

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this does not mean a new Unix socket is created. This is a zeromq socket. how zeromq handles this internally is opaque to us

Comment on lines +111 to +115
port = get_open_port(envs.VLLM_RPC_PORT)
rpc_server_process = Process(target=run_rpc_server,
args=(engine_args,
UsageContext.OPENAI_API_SERVER,
port))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need a new env var here? isn't get_open_port() enough?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be open to reverting this. Will just need to check how it interacts with tp

@@ -0,0 +1,715 @@
"""
Repeat of tests in test_completion.py with the non-mp backend.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we copy this file instead of adding several new lines to add a new arg for the server, like @pytest.mark.parametrize ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we need to modify default server args which is a fixture for the whole file so unfortunately I think this has to be a separate file

Copy link
Member

@youkaichao youkaichao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great work! I just left some nit comments that I don't understand. If you have time, could you please briefly draw some diagrams to show how many processes do we have use and how do they interact with each other?

@SolitaryThinker SolitaryThinker mentioned this pull request Aug 9, 2024
27 tasks
sfc-gh-mkeralapura pushed a commit to sfc-gh-mkeralapura/vllm that referenced this pull request Aug 12, 2024
…oject#6883)

Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
Co-authored-by: Joe Runde <Joseph.Runde@ibm.com>
Co-authored-by: Joe Runde <joe@joerun.de>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
kylesayrs pushed a commit to neuralmagic/vllm that referenced this pull request Aug 17, 2024
…oject#6883)

Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
Co-authored-by: Joe Runde <Joseph.Runde@ibm.com>
Co-authored-by: Joe Runde <joe@joerun.de>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
@mgoin mgoin mentioned this pull request Sep 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants