Multi-process RESTful API #328

mrwyattii · 2023-11-27T22:29:12Z

Our RESTful API currently only processes a single request at a time. For example, if we run the following script on our current main:

import mii
import json
import subprocess
import time

# Stand up a MII deployment
model = "meta-llama/Llama-2-7b-hf"
client = mii.serve(
    model,
    deployment_name="test-dep",
    tensor_parallel=1,
    enable_restful_api=True,
    restful_api_port=8000,
)

# Define some queries
queries = [
    "Hello world!",
    "My name is",
    "DeepSpeed is",
    "Seattle is",
    "One day",
    "I like to",
    "My favorite food is",
    "The world is",
]

# Run with Python API
gen_tokens = 0
start_time = time.time()
outputs = client.generate(queries, ignore_eos=True, max_length=128)
end_time = time.time()
python_time = end_time - start_time
for output in outputs:
    gen_tokens += output.generated_length


# Run with RESTful API
procs = []
start_time = time.time()
for i in range(len(queries)):
    p = subprocess.Popen(
        [
            "curl",
            "-X",
            "POST",
            "-H",
            "Content-Type: application/json",
            "-d",
            f'{{"prompts": "{queries[i]}", "ignore_eos": true, "max_length": 128}}',
            "http://localhost:8000/mii/test-dep",
        ],
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE,
    )
    procs.append(p)

# Check the outputs, verify we have generated text
rest_gen_tokens = 0
for p in procs:
    output, error = p.communicate()
    output = json.loads(output.decode("utf-8"))
    assert "generated_text" in output[0], "No generated text"
    rest_gen_tokens += output[0]["generated_length"]
end_time = time.time()
rest_time = end_time - start_time

assert rest_gen_tokens == gen_tokens, "RESTful API generated different number of tokens"

# Print results
print("Python API Results:")
print(f"\tTotal Time: {python_time:0.2f} seconds")
print(f"\tTotal Generated Tokens: {gen_tokens}")
print(f"\tTokens per second: {gen_tokens/python_time:0.2f}")
print("RESTful API Results:")
print(f"\tTotal Time: {rest_time:0.2f} seconds")
print(f"\tTotal Generated Tokens: {gen_tokens}")
print(f"\tTokens per second: {gen_tokens/rest_time:0.2f}")

client.terminate_server()

We see the following output:

Python API Results:
        Total Time: 3.14 seconds
        Total Generated Tokens: 993
        Tokens per second: 316.61
RESTful API Results:
        Total Time: 21.45 seconds
        Total Generated Tokens: 993
        Tokens per second: 46.29

With this PR, the RESTful API performance matches the Python API:

Python API Results:
        Total Time: 3.13 seconds
        Total Generated Tokens: 993
        Tokens per second: 316.92
RESTful API Results:
        Total Time: 3.19 seconds
        Total Generated Tokens: 993
        Tokens per second: 311.39

We use a default of 32 processes for serving the RESTful API. This can be changed with mii.serve(..., restful_processes=8). We found that more than 32 did not provide improved performance in our benchmarks.

This PR also fixes a bug with older versions of Flask, where the returned object from the RESTful API could not be parsed into a python dict with json.loads.

enable threading

0c7cfda

mrwyattii changed the title ~~Multi-threaded RESTful API~~ Multi-process RESTful API Nov 27, 2023

make proc number configurable

5b4a81f

mrwyattii marked this pull request as ready for review November 27, 2023 23:27

mrwyattii requested review from jeffra and awan-10 as code owners November 27, 2023 23:27

mrwyattii mentioned this pull request Nov 27, 2023

can not test with restful_api #308

Open

remove debug code

6dc1a96

tohtana approved these changes Nov 27, 2023

View reviewed changes

mrwyattii merged commit a5d2f26 into main Nov 28, 2023

mrwyattii deleted the mrwyattii/threaded-rest-api branch November 28, 2023 00:36

This was referenced Nov 28, 2023

Why is the throughput of mii lower than that of vllm in actual measurements? #314

Closed

0.6 req /s is kinda low ,for real? #323

Open

Performance of RESTful API #324

Open

Low throughput (0.61 reqs/sec) when served with RESTful API #325

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-process RESTful API #328

Multi-process RESTful API #328

mrwyattii commented Nov 27, 2023 •

edited

Loading

Multi-process RESTful API #328

Multi-process RESTful API #328

Conversation

mrwyattii commented Nov 27, 2023 • edited Loading

mrwyattii commented Nov 27, 2023 •

edited

Loading