-
-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Successfully deployed embedding model 'gte-Qwen2-7B-instruct', but got "TypeError: 'async for' requires an object with __aiter__ method, got coroutine" when calling it #7389
Comments
TypeError: 'async for' requires an object with __aiter__ method, got coroutine
when calling it
TypeError: 'async for' requires an object with __aiter__ method, got coroutine
when calling it
Same error. |
I've got the same issue with the dunzhang/stella_en_1.5B_v5 model, which is based on Qwen2. I use poetry with python3.10 inside this docker image: My cuda setup (commands ran from within docker image):
My pyproject.toml file:
I ran it using this script
Running this works fine, and gives me this output:
Taking a look at the warnings and trying to run it with
I've tried to run different Qwen2 architecture models and get the same result when calling the |
Maybe this bug is caused by line 128 in generator = self.async_engine_client.encode(
{"prompt_token_ids": prompt_inputs["prompt_token_ids"]},
pooling_params,
request_id_item,
lora_request=lora_request,
) I add an generator = await self.async_engine_client.encode(
{"prompt_token_ids": prompt_inputs["prompt_token_ids"]},
pooling_params,
request_id_item,
lora_request=lora_request,
)
```.
However, I met another problem after solve this bug. I encountered `NotImplementedError("Embeddings not supported with multiprocessing backend")` in file `vllm/entrypoints/openai/rpc/client.py` in function `encode` of class `AsyncEngineRPCClient`.
```python
async def encode(self, *args,
**kwargs) -> AsyncIterator[EmbeddingRequestOutput]:
raise NotImplementedError(
"Embeddings not supported with multiprocessing backend") Therefore, I noticed there are some feature request issues for embedding results, e.g., #5600, #5950, #6947, and the conclusion was that embedding feature has been supported for no model except I have an urgent need to use the embedding feature, which is also a crucial function as mentioned in #5950. If the vllm team could add this feature, I would be extremely grateful. Additionally, my usecase does not have the capacity to run a model as large as e5-mistral-7b-instruct, which is 7B in size. |
Same error running 'TheBloke/Mistral-7B-Instruct-v0.2-AWQ' with OpenAI compatible server Docker image. services:
vllm-service:
container_name: vllm_mistral
image: vllm/vllm-openai:latest
ports:
- "8000:8000"
deploy: # Enable GPU resources
resources:
reservations:
devices:
- capabilities: ["gpu"]
volumes:
- vllm-volume:/root/.cache/huggingface
command:
--model TheBloke/openinstruct-mistral-7B-AWQ
--quantization awq
--max-model-len 2048 |
Qwen2 is not supported for Embeddings at the current moment. We need to improve the error message here |
Your current environment
The output of `python collect_env.py`
🐛 Describe the bug
Deployed Embedding model named 'gte-Qwen2-7B-instruct' successfully via command:
python -m vllm.entrypoints.openai.api_server --served-model-name gte-Qwen2-7B-instruct --model /data1/iic/gte_Qwen2-7B-instruct --port 9990 --gpu-memory-utilization 0.3
And it ran good, got following logs:
`
INFO 08-10 16:06:52 config.py:820] Chunked prefill is enabled with max_num_batched_tokens=512.
INFO 08-10 16:06:52 llm_engine.py:174] Initializing an LLM engine (v0.5.4) with config: model='/data1/iic/gte_Qwen2-7B-instruct', speculative_config=None, tokenizer='/data1/iic/gte_Qwen2-7B-instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=gte-Qwen2-7B-instruct, use_v2_block_manager=False, enable_prefix_caching=False)
[rank0]:[W810 16:07:03.460779226 ProcessGroupGloo.cpp:712] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
INFO 08-10 16:07:03 model_runner.py:720] Starting to load model /data/iic/gte_Qwen2-7B-instruct...
Loading safetensors checkpoint shards: 0% Completed | 0/7 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 14% Completed | 1/7 [00:00<00:04, 1.44it/s]
Loading safetensors checkpoint shards: 29% Completed | 2/7 [00:01<00:03, 1.33it/s]
Loading safetensors checkpoint shards: 43% Completed | 3/7 [00:02<00:03, 1.26it/s]
Loading safetensors checkpoint shards: 57% Completed | 4/7 [00:03<00:02, 1.27it/s]
Loading safetensors checkpoint shards: 71% Completed | 5/7 [00:03<00:01, 1.54it/s]
Loading safetensors checkpoint shards: 86% Completed | 6/7 [00:04<00:00, 1.39it/s]
Loading safetensors checkpoint shards: 100% Completed | 7/7 [00:05<00:00, 1.25it/s]
Loading safetensors checkpoint shards: 100% Completed | 7/7 [00:05<00:00, 1.31it/s]
INFO 08-10 16:07:09 model_runner.py:732] Loading model weights took 14.2655 GB
INFO 08-10 16:07:09 gpu_executor.py:102] # GPU blocks: 9476, # CPU blocks: 4681
INFO 08-10 16:07:12 model_runner.py:1024] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 08-10 16:07:12 model_runner.py:1028] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing
gpu_memory_utilization
or enforcing eager mode. You can also reduce themax_num_seqs
as needed to decrease memory usage.INFO 08-10 16:07:26 model_runner.py:1225] Graph capturing finished in 14 secs.
WARNING 08-10 16:07:27 serving_embedding.py:171] embedding_mode is False. Embedding API will not work.
INFO 08-10 16:07:27 launcher.py:14] Available routes are:
INFO 08-10 16:07:27 launcher.py:22] Route: /openapi.json, Methods: HEAD, GET
INFO 08-10 16:07:27 launcher.py:22] Route: /docs, Methods: HEAD, GET
INFO 08-10 16:07:27 launcher.py:22] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 08-10 16:07:27 launcher.py:22] Route: /redoc, Methods: HEAD, GET
INFO 08-10 16:07:27 launcher.py:22] Route: /health, Methods: GET
INFO 08-10 16:07:27 launcher.py:22] Route: /tokenize, Methods: POST
INFO 08-10 16:07:27 launcher.py:22] Route: /detokenize, Methods: POST
INFO 08-10 16:07:27 launcher.py:22] Route: /v1/models, Methods: GET
INFO 08-10 16:07:27 launcher.py:22] Route: /version, Methods: GET
INFO 08-10 16:07:27 launcher.py:22] Route: /v1/chat/completions, Methods: POST
INFO 08-10 16:07:27 launcher.py:22] Route: /v1/completions, Methods: POST
INFO 08-10 16:07:27 launcher.py:22] Route: /v1/embeddings, Methods: POST
INFO: Started server process [319217]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:9990 (Press CTRL+C to quit)
`
While I called it with following request body:
{ "input": "Your text string goes here", "model": "gte_Qwen2-7B-instruct" }
An Error occurred, error info were:
NFO: 10.136.102.114:62632 - "POST /v1/embeddings HTTP/1.1" 500 Internal Server Error ERROR: Exception in ASGI application Traceback (most recent call last): File "/home/yangjie/.conda/qwen4jay/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 419, in run_asgi result = await app( # type: ignore[func-returns-value] File "/home/yangjie/.conda/qwen4jay/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__ return await self.app(scope, receive, send) File "/home/yangjie/.conda/qwen4jay/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__ await super().__call__(scope, receive, send) File "/home/yangjie/.conda/qwen4jay/lib/python3.10/site-packages/starlette/applications.py", line 123, in __call__ await self.middleware_stack(scope, receive, send) File "/home/yangjie/.conda/qwen4jay/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in __call__ raise exc File "/home/yangjie/.conda/qwen4jay/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in __call__ await self.app(scope, receive, _send) File "/home/yangjie/.conda/qwen4jay/lib/python3.10/site-packages/starlette/middleware/cors.py", line 85, in __call__ await self.app(scope, receive, send) File "/home/yangjie/.conda/qwen4jay/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 65, in __call__ await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send) File "/home/yangjie/.conda/qwen4jay/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app raise exc File "/home/yangjie/.conda/qwen4jay/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app await app(scope, receive, sender) File "/home/yangjie/.conda/qwen4jay/lib/python3.10/site-packages/starlette/routing.py", line 756, in __call__ await self.middleware_stack(scope, receive, send) File "/home/yangjie/.conda/qwen4jay/lib/python3.10/site-packages/starlette/routing.py", line 776, in app await route.handle(scope, receive, send) File "/home/yangjie/.conda/qwen4jay/lib/python3.10/site-packages/starlette/routing.py", line 297, in handle await self.app(scope, receive, send) File "/home/yangjie/.conda/qwen4jay/lib/python3.10/site-packages/starlette/routing.py", line 77, in app await wrap_app_handling_exceptions(app, request)(scope, receive, send) File "/home/yangjie/.conda/qwen4jay/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app raise exc File "/home/yangjie/.conda/qwen4jay/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app await app(scope, receive, sender) File "/home/yangjie/.conda/qwen4jay/lib/python3.10/site-packages/starlette/routing.py", line 72, in app response = await func(request) File "/home/yangjie/.conda/qwen4jay/lib/python3.10/site-packages/fastapi/routing.py", line 278, in app raw_response = await run_endpoint_function( File "/home/yangjie/.conda/qwen4jay/lib/python3.10/site-packages/fastapi/routing.py", line 191, in run_endpoint_function return await dependant.call(**values) File "/data1/yangjie/vllm/vllm/entrypoints/openai/api_server.py", line 218, in create_embedding generator = await openai_serving_embedding.create_embedding( File "/data1/yangjie/vllm/vllm/entrypoints/openai/serving_embedding.py", line 147, in create_embedding async for i, res in result_generator: File "/data1/yangjie/vllm/vllm/utils.py", line 346, in consumer raise e File "/data1/yangjie/vllm/vllm/utils.py", line 337, in consumer raise item File "/data1/yangjie/vllm/vllm/utils.py", line 312, in producer async for item in iterator: TypeError: 'async for' requires an object with __aiter__ method, got coroutine
In contrast, when it comes to model Qwen2-7B, everything is good, as I can deploye Qwen2-7B model successfully with the same command, and it can return desired result when calling it.
I don't why I got an error when calling my Embedding model deployed by vllm
Thanks for helping!
The text was updated successfully, but these errors were encountered: