[Bug] baichuan-13b-chat Service exception after long run #677

Tomorrowxxy · 2023-08-05T04:09:17Z

Start command

python -m vllm.entrypoints.openai.api_server --model baichuan-inc/Baichuan-13B-Chat --host 0.0.0.0 --port 8777 --trust-remote-code --dtype half

After about 12 hours of operation, the inference service stopped working

GPU：V100
CUDA：11.4

Screenshot of the problem：

The text was updated successfully, but these errors were encountered:

zhuohan123 · 2023-08-08T07:00:17Z

Can you describe in more detail what exactly happened? For example, does all future requests fail, or just one specific request fail?

From the screenshot it feels like it's maybe because the client disconnects and thus the server stops the running request.

Tomorrowxxy · 2023-08-08T09:40:27Z

@zhuohan123

All future requests will fail, there is no Avg prompt throughput: xxxx reasoning and other related information after Received request, directly Aborted request
That is, the current machine is not processing any inference requests and must manually kill the process and restart the service
This happened after running for about 6 hours

Tomorrowxxy · 2023-08-08T09:45:55Z

All future requests will fail, there is no Avg prompt throughput: xxxx reasoning and other related information after Received request, directly Aborted request

That is, the current machine is not processing any inference requests and must manually kill the process and restart the service

This happened after running for about 6 hours

I think it should be caused by insufficient CUDA memory. As shown in the picture, the occupancy has reached 95%, resulting in no more reasoning
I tried to start baichuan-13b with gpu_memory_utilization = 0.8 on v100, sorry for the failure to start, it must be 0.9.

Tomorrowxxy · 2023-08-18T07:38:48Z

After running for a while, no more inference. vllm's service is still there
Often appear, how should it be solved? @zhuohan123

xiaocode337317439 · 2023-11-29T11:45:23Z

+1

zhuohan123 added the bug Something isn't working label Aug 8, 2023

zhuohan123 mentioned this issue Aug 8, 2023

[Roadmap] vLLM Development Roadmap: H2 2023 #244

Closed

76 tasks

Tomorrowxxy mentioned this issue Aug 18, 2023

GPU Stuck #338

Closed

ann-lab52 mentioned this issue Dec 28, 2023

API server abort all request for no reason #2297

Closed

hmellor closed this as not planned Won't fix, can't repro, duplicate, stale Mar 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] baichuan-13b-chat Service exception after long run #677

[Bug] baichuan-13b-chat Service exception after long run #677

Tomorrowxxy commented Aug 5, 2023

zhuohan123 commented Aug 8, 2023

Tomorrowxxy commented Aug 8, 2023 •

edited

Loading

Tomorrowxxy commented Aug 8, 2023

Tomorrowxxy commented Aug 18, 2023

xiaocode337317439 commented Nov 29, 2023

[Bug] baichuan-13b-chat Service exception after long run #677

[Bug] baichuan-13b-chat Service exception after long run #677

Comments

Tomorrowxxy commented Aug 5, 2023

zhuohan123 commented Aug 8, 2023

Tomorrowxxy commented Aug 8, 2023 • edited Loading

Tomorrowxxy commented Aug 8, 2023

Tomorrowxxy commented Aug 18, 2023

xiaocode337317439 commented Nov 29, 2023

Tomorrowxxy commented Aug 8, 2023 •

edited

Loading