Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add vllm serve to wrap vllm.entrypoints.openai.api_server #4167

Closed
wants to merge 5 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions benchmarks/benchmark_serving.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@

On the server side, run one of the following commands:
vLLM OpenAI API server
python -m vllm.entrypoints.openai.api_server \
--model <your_model> --swap-space 16 \
vllm serve <your_model> \
simon-mo marked this conversation as resolved.
Show resolved Hide resolved
--swap-space 16 \
--disable-log-requests

(TGI backend)
Expand Down
6 changes: 2 additions & 4 deletions docs/source/getting_started/quickstart.rst
Original file line number Diff line number Diff line change
Expand Up @@ -73,15 +73,13 @@ Start the server:

.. code-block:: console

$ python -m vllm.entrypoints.openai.api_server \
$ --model facebook/opt-125m
$ vllm serve facebook/opt-125m

By default, the server uses a predefined chat template stored in the tokenizer. You can override this template by using the ``--chat-template`` argument:

.. code-block:: console

$ python -m vllm.entrypoints.openai.api_server \
$ --model facebook/opt-125m \
$ vllm serve facebook/opt-125m \
$ --chat-template ./examples/template_chatml.jinja

This server can be queried in the same format as OpenAI API. For example, list the models:
Expand Down
2 changes: 1 addition & 1 deletion docs/source/models/adding_model.rst
Original file line number Diff line number Diff line change
Expand Up @@ -110,7 +110,7 @@ Just add the following lines in your code:
from your_code import YourModelForCausalLM
ModelRegistry.register_model("YourModelForCausalLM", YourModelForCausalLM)

If you are running api server with `python -m vllm.entrypoints.openai.api_server args`, you can wrap the entrypoint with the following code:
If you are running api server with `vllm serve args`, you can wrap the entrypoint with the following code:

.. code-block:: python

Expand Down
3 changes: 1 addition & 2 deletions docs/source/models/lora.rst
Original file line number Diff line number Diff line change
Expand Up @@ -58,8 +58,7 @@ LoRA adapted models can also be served with the Open-AI compatible vLLM server.

.. code-block:: bash

python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-2-7b-hf \
vllm serve meta-llama/Llama-2-7b-hf \
--enable-lora \
--lora-modules sql-lora=~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/

Expand Down
3 changes: 1 addition & 2 deletions docs/source/serving/distributed_serving.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,7 @@ To run multi-GPU serving, pass in the :code:`--tensor-parallel-size` argument wh

.. code-block:: console

$ python -m vllm.entrypoints.api_server \
$ --model facebook/opt-13b \
$ vllm serve facebook/opt-13b \
$ --tensor-parallel-size 4

To scale vLLM beyond a single machine, start a `Ray runtime <https://docs.ray.io/en/latest/ray-core/starting-ray.html>`_ via CLI before running vLLM:
Expand Down
5 changes: 2 additions & 3 deletions docs/source/serving/openai_compatible_server.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ vLLM provides an HTTP server that implements OpenAI's [Completions](https://plat

You can start the server using Python, or using [Docker](deploying_with_docker.rst):
```bash
python -m vllm.entrypoints.openai.api_server --model mistralai/Mistral-7B-Instruct-v0.2 --dtype auto --api-key token-abc123
vllm serve mistralai/Mistral-7B-Instruct-v0.2 --dtype auto --api-key token-abc123
```

To call the server, you can use the official OpenAI Python client library, or any other HTTP client.
Expand Down Expand Up @@ -95,8 +95,7 @@ template, or the template in string form. Without a chat template, the server wi
and all chat requests will error.

```bash
python -m vllm.entrypoints.openai.api_server \
--model ... \
vllm serve ... \
--chat-template ./path-to-chat-template.jinja
```

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on #4709, the :prog: value under CLI args (line 111) should be updated to vllm serve.

Expand Down
5 changes: 5 additions & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -410,4 +410,9 @@ def _read_requirements(filename: str) -> List[str]:
},
cmdclass={"build_ext": cmake_build_ext} if not _is_neuron() else {},
package_data=package_data,
entry_points={
"console_scripts": [
"vllm=vllm.scripts:main",
],
},
)
3 changes: 1 addition & 2 deletions tests/entrypoints/test_openai_server.py
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ def __init__(self, args):
env = os.environ.copy()
env["PYTHONUNBUFFERED"] = "1"
self.proc = subprocess.Popen(
["python3", "-m", "vllm.entrypoints.openai.api_server"] + args,
["vllm", "serve"] + args,
env=env,
stdout=sys.stdout,
stderr=sys.stderr,
Expand Down Expand Up @@ -123,7 +123,6 @@ def zephyr_lora_files():
def server(zephyr_lora_files):
ray.init()
server_runner = ServerRunner.remote([
"--model",
MODEL_NAME,
# use half precision for speed and memory savings in CI environment
"--dtype",
Expand Down
59 changes: 36 additions & 23 deletions vllm/entrypoints/openai/api_server.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

import fastapi
import uvicorn
from fastapi import Request
from fastapi import APIRouter, Request
from fastapi.exceptions import RequestValidationError
from fastapi.middleware.cors import CORSMiddleware
from fastapi.responses import JSONResponse, Response, StreamingResponse
Expand All @@ -26,6 +26,8 @@

TIMEOUT_KEEP_ALIVE = 5 # seconds

engine: AsyncLLMEngine = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shoudln't this be optional?

engine_args: AsyncEngineArgs = None
openai_serving_chat: OpenAIServingChat = None
openai_serving_completion: OpenAIServingCompletion = None
logger = init_logger(__name__)
Expand All @@ -45,45 +47,33 @@ async def _force_log():
yield


app = fastapi.FastAPI(lifespan=lifespan)


def parse_args():
parser = make_arg_parser()
return parser.parse_args()

router = APIRouter()

# Add prometheus asgi middleware to route /metrics requests
metrics_app = make_asgi_app()
app.mount("/metrics", metrics_app)

router.mount("/metrics", metrics_app)

@app.exception_handler(RequestValidationError)
async def validation_exception_handler(_, exc):
err = openai_serving_chat.create_error_response(message=str(exc))
return JSONResponse(err.model_dump(), status_code=HTTPStatus.BAD_REQUEST)


@app.get("/health")
@router.get("/health")
async def health() -> Response:
"""Health check."""
await openai_serving_chat.engine.check_health()
return Response(status_code=200)


@app.get("/v1/models")
@router.get("/v1/models")
async def show_available_models():
models = await openai_serving_chat.show_available_models()
return JSONResponse(content=models.model_dump())


@app.get("/version")
@router.get("/version")
async def show_version():
ver = {"version": vllm.__version__}
return JSONResponse(content=ver)


@app.post("/v1/chat/completions")
@router.post("/v1/chat/completions")
async def create_chat_completion(request: ChatCompletionRequest,
raw_request: Request):
generator = await openai_serving_chat.create_chat_completion(
Expand All @@ -98,7 +88,7 @@ async def create_chat_completion(request: ChatCompletionRequest,
return JSONResponse(content=generator.model_dump())


@app.post("/v1/completions")
@router.post("/v1/completions")
async def create_completion(request: CompletionRequest, raw_request: Request):
generator = await openai_serving_completion.create_completion(
request, raw_request)
Expand All @@ -112,8 +102,10 @@ async def create_completion(request: CompletionRequest, raw_request: Request):
return JSONResponse(content=generator.model_dump())


if __name__ == "__main__":
args = parse_args()
def build_app(args):
app = fastapi.FastAPI(lifespan=lifespan)
app.include_router(router)
app.root_path = args.root_path

app.add_middleware(
CORSMiddleware,
Expand All @@ -123,6 +115,12 @@ async def create_completion(request: CompletionRequest, raw_request: Request):
allow_headers=args.allowed_headers,
)

@app.exception_handler(RequestValidationError)
async def validation_exception_handler(_, exc):
err = openai_serving_chat.create_error_response(message=str(exc))
return JSONResponse(err.model_dump(),
status_code=HTTPStatus.BAD_REQUEST)

if token := os.environ.get("VLLM_API_KEY") or args.api_key:

@app.middleware("http")
Expand All @@ -146,13 +144,21 @@ async def authentication(request: Request, call_next):
raise ValueError(f"Invalid middleware {middleware}. "
f"Must be a function or a class.")

return app


def run_server(args):
app = build_app(args)

logger.info(f"vLLM API server version {vllm.__version__}")
logger.info(f"args: {args}")

if args.served_model_name is not None:
served_model_names = args.served_model_name
else:
served_model_names = [args.model]

global engine_args, engine, openai_serving_chat, openai_serving_completion
engine_args = AsyncEngineArgs.from_cli_args(args)
engine = AsyncLLMEngine.from_engine_args(
engine_args, usage_context=UsageContext.OPENAI_API_SERVER)
Expand All @@ -163,7 +169,6 @@ async def authentication(request: Request, call_next):
openai_serving_completion = OpenAIServingCompletion(
engine, served_model_names, args.lora_modules)

app.root_path = args.root_path
uvicorn.run(app,
host=args.host,
port=args.port,
Expand All @@ -173,3 +178,11 @@ async def authentication(request: Request, call_next):
ssl_certfile=args.ssl_certfile,
ssl_ca_certs=args.ssl_ca_certs,
ssl_cert_reqs=args.ssl_cert_reqs)


if __name__ == "__main__":
# NOTE(simon):
# This section should be in sync with vllm/scripts.py for CLI entrypoints.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any way to add a simple regression test for this?

parser = make_arg_parser()
args = parser.parse_args()
run_server(args)
7 changes: 4 additions & 3 deletions vllm/entrypoints/openai/cli_args.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,9 +22,10 @@ def __call__(self, parser, namespace, values, option_string=None):
setattr(namespace, self.dest, lora_list)


def make_arg_parser():
parser = argparse.ArgumentParser(
description="vLLM OpenAI-Compatible RESTful API server.")
def make_arg_parser(parser=None):
if parser is None:
parser = argparse.ArgumentParser(
description="vLLM OpenAI-Compatible RESTful API server.")
parser.add_argument("--host", type=str, default=None, help="host name")
parser.add_argument("--port", type=int, default=8000, help="port number")
parser.add_argument(
Expand Down
29 changes: 29 additions & 0 deletions vllm/scripts.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# The CLI entrypoint to vLLM.
import argparse

from vllm.entrypoints.openai.api_server import run_server
from vllm.entrypoints.openai.cli_args import make_arg_parser


def main():
parser = argparse.ArgumentParser(description="vLLM CLI")
subparsers = parser.add_subparsers()

serve_parser = subparsers.add_parser(
"serve",
help="Start the vLLM OpenAI Compatible API server",
usage="vllm serve <model_tag> [options]")
make_arg_parser(serve_parser)
# Override the `--model` optional argument, make it positional.
serve_parser.add_argument("model", type=str, help="The model tag to serve")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's happening if vllm serve --model ?

serve_parser.set_defaults(func=run_server)

args = parser.parse_args()
if hasattr(args, "func"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this part of code is confusing. Add a comment to explain what this does?

args.func(args)
else:
parser.print_help()


if __name__ == "__main__":
main()
Loading