Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

can not test with restful_api #308

Open
irasin opened this issue Nov 15, 2023 · 14 comments
Open

can not test with restful_api #308

irasin opened this issue Nov 15, 2023 · 14 comments
Assignees

Comments

@irasin
Copy link

irasin commented Nov 15, 2023

Great job!

But I have some problems with restful_apt test, hope to get some help here.

test with commit : ddbc6fc
gpu: NVidia A10

launch service

import mii

model_name_or_path = "/dataset/huggyllama/llama-7b"
max_model_length = 2048

mii.serve(
    model_name_or_path=model_name_or_path,
    max_length=max_model_length,
    deployment_name="mii_test",
    tensor_parallel=1,
    replica_num=1,
    enable_restful_api=True,
    restful_api_port=8000,
    )

test with curl

curl --header "Content-Type: application/json" --request POST  -d '{"prompts": "[DeepSpeed is]", "max_length": 128}' http://127.0.0.1:8000/mii/mii_test

And I got error

[2023-11-15 10:54:46,067] ERROR in app: Exception on /mii/mii_test [POST]
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/flask/app.py", line 867, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python3.10/dist-packages/flask/app.py", line 852, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
  File "/usr/local/lib/python3.10/dist-packages/flask_restful/__init__.py", line 489, in wrapper
    resp = resource(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/flask/views.py", line 109, in view
    return current_app.ensure_sync(self.dispatch_request)(**kwargs)
  File "/usr/local/lib/python3.10/dist-packages/flask_restful/__init__.py", line 604, in dispatch_request
    resp = meth(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/mii/grpc_related/restful_gateway.py", line 31, in post
KeyError: 'request'
127.0.0.1 - - [15/Nov/2023 10:54:46] "POST /mii/mii_test HTTP/1.1" 500 -

And the python script is the same result

import json
import requests
url = f"http://localhost:8000/mii/mii_test"
params = {"prompts": ["DeepSpeed is", "Seattle is"], "max_length": 128}
json_params = json.dumps(params)
output = requests.post(
    url, data=json_params, headers={"Content-Type": "application/json"}
)

print(output)

just wonder if I'm missing any hyper parameters setting?

@ChristineSeven
Copy link

The same issue, do not know where is wrong.

@mrwyattii
Copy link
Contributor

Hi @irasin it looks like you are using an older version of MII. Your error message for line 31 of mii/grpc_related/restful_gateway.py indicates you are trying to get the request key from the dictionary, but this was changed in ddbc6fc:
image

Can you please update to the latest source build of DeepSpeed and DeepSpeed-MII?

pip uninstall deepspeed deepspeed-mii -y
pip install git+https://github.com/microsoft/deepspeed.git
pip install git+https://github.com/microsoft/deepspeed-mii.git

@ChristineSeven
Copy link

ChristineSeven commented Nov 17, 2023

@mrwyattii yes, this solved! But when I do requests, another issue cames. Would you help to check this?
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/usr/lib/python3.8/threading.py", line 870, in run
self.run()
File "/usr/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.8/dist-packages/mii/batching/ragged_batching.py", line 812, in call
File "/usr/local/lib/python3.8/dist-packages/mii/batching/ragged_batching.py", line 812, in call
self.generate()
self.generate()
File "/usr/local/lib/python3.8/dist-packages/mii/batching/utils.py", line 31, in wrapper
File "/usr/local/lib/python3.8/dist-packages/mii/batching/utils.py", line 31, in wrapper
return func(self, *args, **kwargs)
return func(self, *args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/mii/batching/ragged_batching.py", line 379, in generate
File "/usr/local/lib/python3.8/dist-packages/mii/batching/ragged_batching.py", line 379, in generate
next_token_logits = self.put(
next_token_logits = self.put(
File "/usr/local/lib/python3.8/dist-packages/mii/batching/ragged_batching.py", line 717, in put
File "/usr/local/lib/python3.8/dist-packages/mii/batching/ragged_batching.py", line 717, in put
return self.inference_engine.put(uids, tokenized_input)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/v2/engine_v2.py", line 127, in put
return self.inference_engine.put(uids, tokenized_input)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/v2/engine_v2.py", line 127, in put
self.model.maybe_allocate_kv(host_seq_desc, tokens.numel())
File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/v2/model_implementations/inference_transformer_base.py", line 357, in maybe_allocate_kv
self.model.maybe_allocate_kv(host_seq_desc, tokens.numel())
File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/v2/model_implementations/inference_transformer_base.py", line 357, in maybe_allocate_kv
sequence.extend_kv_cache(new_blocks)
sequence.extend_kv_cache(new_blocks)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/v2/ragged/sequence_descriptor.py", line 259, in extend_kv_cache
File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/v2/ragged/sequence_descriptor.py", line 259, in extend_kv_cache
shadow_alloc_group[cur_blocks:cur_blocks + new_blocks].copy
(new_group_ids)
shadow_alloc_group[cur_blocks:cur_blocks + new_blocks].copy
(new_group_ids)
RuntimeError: The size of tensor a (8) must match the size of tensor b (12) at non-singleton dimension 0
RuntimeError: The size of tensor a (8) must match the size of tensor b (12) at non-singleton dimension 0

@irasin
Copy link
Author

irasin commented Nov 20, 2023

Hi @irasin it looks like you are using an older version of MII. Your error message for line 31 of mii/grpc_related/restful_gateway.py indicates you are trying to get the request key from the dictionary, but this was changed in ddbc6fc: image

Can you please update to the latest source build of DeepSpeed and DeepSpeed-MII?

pip uninstall deepspeed deepspeed-mii -y
pip install git+https://github.com/microsoft/deepspeed.git
pip install git+https://github.com/microsoft/deepspeed-mii.git

Hi, @mrwyattii, many thanks for your reply.

After using the latest source build of DeepSpeed and DeepSpeed-MII, now it works with restful api now.
But maybe because the reuslt contains some escape characters, it can not be parsed to json format. Here is the example

  • launch service
import mii

model_name_or_path = /dataset/huggyllama/llama-7b"
max_model_length = 2048


mii.serve(
    model_name_or_path=model_name_or_path,
    max_length=max_model_length,
    deployment_name="mii_test",
    tensor_parallel=1,
    replica_num=1,
    enable_restful_api=True,
    restful_api_port=8000,
    )
  • test with python
import json
import requests
url = f"http://localhost:8000/mii/mii_test"
params = {"prompts": ["DeepSpeed is", "Seattle is a place"], "max_length": 128}
json_params = json.dumps(params)
output = requests.post(
    url, data=json_params, headers={"Content-Type": "application/json"}
)

text = output.text
print(text)
json_res = json.loads(text) 
assert isinstance(json_res, str) ## it's still a string because some escape characters?
print(json_res)

the result is as below, json_res is still a string.

"{\n  \"response\": [\n    \"the solution for low speed, high-current IGBT switching applications that involve controlling high power from a series of IGBT modules, such as output inverters for PV, wind, motor drives, UPS, or Xenon lighting applications.\\nThe platform provides an open and modular solution for achieving fast switching times, meeting the rapid rise in demand for higher power modules. This is enabled through the modular design of the DeepSpeed core, which offers high-speed operation, reducing the number of components and improving size and cost.\\nDeepSpeed is fully compli\",\n    \"I've had the pleasure of knowing, through the virtual ether, for over 15 years. I have also been fortunate enough to visit Seattle on several occasions over the years as well as being able to collaborate and visit artists' studios in the Northwest. When opportunity knocked and the folks at Art Informel extended an invitation to show at their space, I felt the stars were aligned, that this was meant to be. I hope you'll join me in Seattle for the opening this Saturday, December 10th, from 5-9PM, at\"\n  ]\n}"

{
  "response": [
    "the solution for low speed, high-current IGBT switching applications that involve controlling high power from a series of IGBT modules, such as output inverters for PV, wind, motor drives, UPS, or Xenon lighting applications.\nThe platform provides an open and modular solution for achieving fast switching times, meeting the rapid rise in demand for higher power modules. This is enabled through the modular design of the DeepSpeed core, which offers high-speed operation, reducing the number of components and improving size and cost.\nDeepSpeed is fully compli",
    "I've had the pleasure of knowing, through the virtual ether, for over 15 years. I have also been fortunate enough to visit Seattle on several occasions over the years as well as being able to collaborate and visit artists' studios in the Northwest. When opportunity knocked and the folks at Art Informel extended an invitation to show at their space, I felt the stars were aligned, that this was meant to be. I hope you'll join me in Seattle for the opening this Saturday, December 10th, from 5-9PM, at"
  ]
}

Hope to get answer again.

@mrwyattii
Copy link
Contributor

@mrwyattii yes, this solved! But when I do requests, another issue cames. Would you help to check this? Exception in thread Thread-1: Traceback (most recent call last): File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner Exception in thread Thread-1: Traceback (most recent call last): File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner self.run() File "/usr/lib/python3.8/threading.py", line 870, in run self.run() File "/usr/lib/python3.8/threading.py", line 870, in run self._target(*self._args, **self._kwargs) self._target(*self._args, **self._kwargs) File "/usr/local/lib/python3.8/dist-packages/mii/batching/ragged_batching.py", line 812, in call File "/usr/local/lib/python3.8/dist-packages/mii/batching/ragged_batching.py", line 812, in call self.generate() self.generate() File "/usr/local/lib/python3.8/dist-packages/mii/batching/utils.py", line 31, in wrapper File "/usr/local/lib/python3.8/dist-packages/mii/batching/utils.py", line 31, in wrapper return func(self, *args, **kwargs) return func(self, *args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/mii/batching/ragged_batching.py", line 379, in generate File "/usr/local/lib/python3.8/dist-packages/mii/batching/ragged_batching.py", line 379, in generate next_token_logits = self.put( next_token_logits = self.put( File "/usr/local/lib/python3.8/dist-packages/mii/batching/ragged_batching.py", line 717, in put File "/usr/local/lib/python3.8/dist-packages/mii/batching/ragged_batching.py", line 717, in put return self.inference_engine.put(uids, tokenized_input) File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/v2/engine_v2.py", line 127, in put return self.inference_engine.put(uids, tokenized_input) File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/v2/engine_v2.py", line 127, in put self.model.maybe_allocate_kv(host_seq_desc, tokens.numel()) File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/v2/model_implementations/inference_transformer_base.py", line 357, in maybe_allocate_kv self.model.maybe_allocate_kv(host_seq_desc, tokens.numel()) File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/v2/model_implementations/inference_transformer_base.py", line 357, in maybe_allocate_kv sequence.extend_kv_cache(new_blocks) sequence.extend_kv_cache(new_blocks) File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/v2/ragged/sequence_descriptor.py", line 259, in extend_kv_cache File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/v2/ragged/sequence_descriptor.py", line 259, in extend_kv_cache shadow_alloc_group[cur_blocks:cur_blocks + new_blocks].copy(new_group_ids) shadow_alloc_group[cur_blocks:cur_blocks + new_blocks].copy(new_group_ids) RuntimeError: The size of tensor a (8) must match the size of tensor b (12) at non-singleton dimension 0 RuntimeError: The size of tensor a (8) must match the size of tensor b (12) at non-singleton dimension 0

@ChristineSeven can you share the full script that you are using to deploy MII? Specifically, I would like to know what model, tensor parallel settings, etc.

@mrwyattii
Copy link
Contributor

@irasin can you please try the following instead?

import json
import requests
url = f"http://localhost:8000/mii/mii_test"
params = {"prompts": ["DeepSpeed is", "Seattle is a place"], "max_length": 128}
json_params = json.dumps(params)
output = requests.post(
    url, data=json_params, headers={"Content-Type": "application/json"}
)

print(output.json())

@irasin
Copy link
Author

irasin commented Nov 22, 2023

@irasin can you please try the following instead?

import json
import requests
url = f"http://localhost:8000/mii/mii_test"
params = {"prompts": ["DeepSpeed is", "Seattle is a place"], "max_length": 128}
json_params = json.dumps(params)
output = requests.post(
    url, data=json_params, headers={"Content-Type": "application/json"}
)

print(output.json())

Hi, @mrwyattii , the results are the same.

@cableyang
Copy link

i also face the same question, internel error please help us

@mrwyattii
Copy link
Contributor

@irasin can you please try the following instead?

import json
import requests
url = f"http://localhost:8000/mii/mii_test"
params = {"prompts": ["DeepSpeed is", "Seattle is a place"], "max_length": 128}
json_params = json.dumps(params)
output = requests.post(
    url, data=json_params, headers={"Content-Type": "application/json"}
)

print(output.json())

Hi, @mrwyattii , the results are the same.

@irasin Is your Flask version <3.0.0? If so, I think I have the solution in #328. Can you try with that PR? You can install it with pip install git+https://github.com/Microsoft/DeepSpeed-MII@mrwyattii/threaded-rest-api

@mrwyattii
Copy link
Contributor

i also face the same question, internel error please help us

@cableyang can you please share the full script that you are running so that I can try to reproduce the error? Thanks

@mrwyattii mrwyattii self-assigned this Nov 27, 2023
@irasin
Copy link
Author

irasin commented Nov 28, 2023

@irasin can you please try the following instead?

import json
import requests
url = f"http://localhost:8000/mii/mii_test"
params = {"prompts": ["DeepSpeed is", "Seattle is a place"], "max_length": 128}
json_params = json.dumps(params)
output = requests.post(
    url, data=json_params, headers={"Content-Type": "application/json"}
)

print(output.json())

Hi, @mrwyattii , the results are the same.

@irasin Is your Flask version <3.0.0? If so, I think I have the solution in #328. Can you try with that PR? You can install it with pip install git+https://github.com/Microsoft/DeepSpeed-MII@mrwyattii/threaded-rest-api

With the latest DeepSpeed-MII commit, I can get the json format output now. Thanks a lot, @mrwyattii

BTW, I wonder where can I get the benchmark scripts you used in the Performance Evaluation of https://github.com/microsoft/DeepSpeed/blob/master/blogs/deepspeed-fastgen/README.md.
It seems that DeepSpeed-MII has much higher throughput and lower latency than vllm, which is very amazing. If possible, I would like to test some other models in my local env.

I test with benchmark_server.py script in vllm repo, which sends 1000 request to the server in the same time, and I keep getting SYN flood error messages in demsg output like

[1021332.329430] TCP: request_sock_TCP: Possible SYN flooding on port 8000. Dropping request.  Check SNMP counters.

I'm curious if there is any limit on the maximum number of connections on the server side or restful_api

@ChristineSeven
Copy link

ChristineSeven commented Nov 28, 2023

@mrwyattii yes, this solved! But when I do requests, another issue cames. Would you help to check this? Exception in thread Thread-1: Traceback (most recent call last): File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner Exception in thread Thread-1: Traceback (most recent call last): File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner self.run() File "/usr/lib/python3.8/threading.py", line 870, in run self.run() File "/usr/lib/python3.8/threading.py", line 870, in run self._target(*self._args, **self._kwargs) self._target(*self._args, **self._kwargs) File "/usr/local/lib/python3.8/dist-packages/mii/batching/ragged_batching.py", line 812, in call File "/usr/local/lib/python3.8/dist-packages/mii/batching/ragged_batching.py", line 812, in call self.generate() self.generate() File "/usr/local/lib/python3.8/dist-packages/mii/batching/utils.py", line 31, in wrapper File "/usr/local/lib/python3.8/dist-packages/mii/batching/utils.py", line 31, in wrapper return func(self, *args, **kwargs) return func(self, *args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/mii/batching/ragged_batching.py", line 379, in generate File "/usr/local/lib/python3.8/dist-packages/mii/batching/ragged_batching.py", line 379, in generate next_token_logits = self.put( next_token_logits = self.put( File "/usr/local/lib/python3.8/dist-packages/mii/batching/ragged_batching.py", line 717, in put File "/usr/local/lib/python3.8/dist-packages/mii/batching/ragged_batching.py", line 717, in put return self.inference_engine.put(uids, tokenized_input) File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/v2/engine_v2.py", line 127, in put return self.inference_engine.put(uids, tokenized_input) File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/v2/engine_v2.py", line 127, in put self.model.maybe_allocate_kv(host_seq_desc, tokens.numel()) File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/v2/model_implementations/inference_transformer_base.py", line 357, in maybe_allocate_kv self.model.maybe_allocate_kv(host_seq_desc, tokens.numel()) File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/v2/model_implementations/inference_transformer_base.py", line 357, in maybe_allocate_kv sequence.extend_kv_cache(new_blocks) sequence.extend_kv_cache(new_blocks) File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/v2/ragged/sequence_descriptor.py", line 259, in extend_kv_cache File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/v2/ragged/sequence_descriptor.py", line 259, in extend_kv_cache shadow_alloc_group[cur_blocks:cur_blocks + new_blocks].copy(new_group_ids) shadow_alloc_group[cur_blocks:cur_blocks + new_blocks].copy(new_group_ids) RuntimeError: The size of tensor a (8) must match the size of tensor b (12) at non-singleton dimension 0 RuntimeError: The size of tensor a (8) must match the size of tensor b (12) at non-singleton dimension 0

@ChristineSeven can you share the full script that you are using to deploy MII? Specifically, I would like to know what model, tensor parallel settings, etc.

sorry for late reply.

import argparse
import asyncio
import json
import random
import time
from typing import AsyncGenerator, List, Tuple, Union

import aiohttp
import numpy as np
import codecs
from time import sleep

global token_num
token_num=0


def sample_requests() -> List[Tuple[str, dict]]:
    # Load the dataset.
    content_list = []
    num_all=0
    with open("457.json","r",encoding='utf-8') as f:
        lines = f.readlines()
        print(len(lines))
        for line in lines:
            if line:
                data = json.loads(line)
                content_list.append(data)
    print(num_all)
    print(len(content_list))
    print(content_list[0])
    print("read data set finish")
    prompts = [content['question'] for content in content_list]
    
    tokenized_dataset = []
    for i in range(len(content_list)):
        tokenized_dataset.append((prompts[i], content_list[i]))

    return tokenized_dataset



async def send_request(
    prompt: str,
    origin_json: dict
) -> None:
    global token_num
    request_start_time = time.time()
    headers = {'Content-Type': 'application/json'}
    headers = {"User-Agent": "Benchmark Client"}
    url = "http://10.10.10.10:28093/mii/mistral-deployment" 
    output_list = []
    params = {"prompts": [prompt], "max_length": 4096}
    json_params = json.dumps(params)

    timeout = aiohttp.ClientTimeout(total=3 * 3600)
    async with aiohttp.ClientSession(timeout=timeout) as session:
        while True:
            async with session.post(url, headers=headers, data=json_params) as response:
                chunks = []
                async for chunk, _ in response.content.iter_chunks():
                    chunks.append(chunk)
            output = b"".join(chunks).decode("utf-8")
            print(output)
            try:
                result=json.loads(output.json())
                origin_json['model_answer'] = result['response'][0]
            except:
                origin_json['model_answer'] = ''
            token_num+=1
            print(token_num)
            if "error" not in output:
                break
    return origin_json


async def batchmark(
    input_requests: List[Tuple[str, dict]],
) -> None:
    tasks: List[asyncio.Task] = []
    async for request in get_request(input_requests):
        prompt, origin_json = request
        task = asyncio.create_task(send_request(prompt,
                                                origin_json))
        tasks.append(task)
    results=await asyncio.gather(*tasks)
    return results


def main(args: argparse.Namespace):
    print(args)
    random.seed(args.seed)
    np.random.seed(args.seed)
    input_requests = sample_requests()

    batch_start_time = time.time()
    for i in range(0, len(input_requests), 50):
        total_results=asyncio.run(batchmark(input_requests[i:i+50]))
        with open('457_deepspeed_out.json', 'a+', encoding='utf-8') as f1:
            for origin_json in total_results:            
                json_data = json.dumps(origin_json, ensure_ascii=False)
                f1.write(json_data + "\n")
                f1.flush()

    batch_end_time = time.time()
    print(batch_end_time-batch_start_time)

if __name__ == "__main__":
    parser = argparse.ArgumentParser(
        description="Batchmark the online serving throughput.")
    parser.add_argument("--seed", type=int, default=0)
    args = parser.parse_args()
    main(args)

@ChristineSeven
Copy link

ChristineSeven commented Nov 28, 2023

the server code is like this:

client = mii.serve(
    "mistralai/Mistral-7B-v0.1",
    deployment_name="mistral-deployment",
    enable_restful_api=True,
    restful_api_port=28080,
)

@mrwyattii
Copy link
Contributor

mrwyattii commented Nov 28, 2023

With the latest DeepSpeed-MII commit, I can get the json format output now. Thanks a lot, @mrwyattii

BTW, I wonder where can I get the benchmark scripts you used in the Performance Evaluation of https://github.com/microsoft/DeepSpeed/blob/master/blogs/deepspeed-fastgen/README.md. It seems that DeepSpeed-MII has much higher throughput and lower latency than vllm, which is very amazing. If possible, I would like to test some other models in my local env.

I test with benchmark_server.py script in vllm repo, which sends 1000 request to the server in the same time, and I keep getting SYN flood error messages in demsg output like

[1021332.329430] TCP: request_sock_TCP: Possible SYN flooding on port 8000. Dropping request.  Check SNMP counters.

I'm curious if there is any limit on the maximum number of connections on the server side or restful_api

@irasin

The benchmarks we ran to collect data for our FastGen blog post can be found here: https://github.com/microsoft/DeepSpeedExamples/tree/master/benchmarks/inference/mii

Note that we did not use the RESTful API in our benchmarks and instead use the python API (i.e., mii.client). I imagine that sending 1000 requests at once is overloading the flask server we stand up for the RESTful API. I will investigate how we might be able to better handle a large number of requests like this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants