[Bug]: Error Running Llama 3.2 1B on CPU #9037

kunalmohan · 2024-10-03T07:46:43Z

Your current environment

The output of `python collect_env.py`

PyTorch version: 2.4.0+cpu
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.5 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.30.4
Libc version: glibc-2.35

Python version: 3.10.14 (main, May  6 2024, 19:42:50) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-6.5.0-1025-azure-x86_64-with-glibc2.35
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      48 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             4
On-line CPU(s) list:                0-3
Vendor ID:                          AuthenticAMD
Model name:                         AMD EPYC 7763 64-Core Processor
CPU family:                         25
Model:                              1
Thread(s) per core:                 2
Core(s) per socket:                 2
Socket(s):                          1
Stepping:                           1
BogoMIPS:                           4890.86
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat npt nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload umip vaes vpclmulqdq rdpid fsrm
Virtualization:                     AMD-V
Hypervisor vendor:                  Microsoft
Virtualization type:                full
L1d cache:                          64 KiB (2 instances)
L1i cache:                          64 KiB (2 instances)
L2 cache:                           1 MiB (2 instances)
L3 cache:                           32 MiB (1 instance)
NUMA node(s):                       1
NUMA node0 CPU(s):                  0-3
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Vulnerable: Safe RET, no microcode
Vulnerability Spec store bypass:    Vulnerable
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] intel_extension_for_pytorch==2.4.0
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.1.3.1
[pip3] nvidia-cuda-cupti-cu12==12.1.105
[pip3] nvidia-cuda-nvrtc-cu12==12.1.105
[pip3] nvidia-cuda-runtime-cu12==12.1.105
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.0.2.54
[pip3] nvidia-curand-cu12==10.3.2.106
[pip3] nvidia-cusolver-cu12==11.4.5.107
[pip3] nvidia-cusparse-cu12==12.1.0.106
[pip3] nvidia-ml-py==12.560.30
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] nvidia-nvjitlink-cu12==12.6.77
[pip3] nvidia-nvtx-cu12==12.1.105
[pip3] pyzmq==26.2.0
[pip3] torch==2.4.0+cpu
[pip3] torchaudio==2.4.0+cpu
[pip3] torchvision==0.19.0+cpu
[pip3] transformers==4.46.0.dev0
[pip3] triton==3.0.0
[conda] intel-extension-for-pytorch 2.4.0                    pypi_0    pypi
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] nvidia-cublas-cu12        12.1.3.1                 pypi_0    pypi
[conda] nvidia-cuda-cupti-cu12    12.1.105                 pypi_0    pypi
[conda] nvidia-cuda-nvrtc-cu12    12.1.105                 pypi_0    pypi
[conda] nvidia-cuda-runtime-cu12  12.1.105                 pypi_0    pypi
[conda] nvidia-cudnn-cu12         9.1.0.70                 pypi_0    pypi
[conda] nvidia-cufft-cu12         11.0.2.54                pypi_0    pypi
[conda] nvidia-curand-cu12        10.3.2.106               pypi_0    pypi
[conda] nvidia-cusolver-cu12      11.4.5.107               pypi_0    pypi
[conda] nvidia-cusparse-cu12      12.1.0.106               pypi_0    pypi
[conda] nvidia-ml-py              12.560.30                pypi_0    pypi
[conda] nvidia-nccl-cu12          2.20.5                   pypi_0    pypi
[conda] nvidia-nvjitlink-cu12     12.6.77                  pypi_0    pypi
[conda] nvidia-nvtx-cu12          12.1.105                 pypi_0    pypi
[conda] pyzmq                     26.2.0                   pypi_0    pypi
[conda] torch                     2.4.0+cpu                pypi_0    pypi
[conda] torchaudio                2.4.0+cpu                pypi_0    pypi
[conda] torchvision               0.19.0+cpu               pypi_0    pypi
[conda] transformers              4.46.0.dev0              pypi_0    pypi
[conda] triton                    3.0.0                    pypi_0    pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.6.3.dev65+g7f60520d
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
Could not collect

Model Input Dumps

No response

🐛 Describe the bug

Installed vllm for CPU as mentioned in https://docs.vllm.ai/en/latest/getting_started/cpu-installation.html#

Error when running Llama-3.2-1B on CPU.
Command:

vllm serve meta-llama/Llama-3.2-1B-Instruct --port 8080 --api-key qwerty123  --max-model-len=30000

Error:

INFO 10-03 07:38:10 logger.py:36] Received request chat-3361fcfb910c48b08a00e19e7f79c6e0: prompt: '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 03 Oct 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nExplain the significance of AI in modern technology.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=0.9, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1000, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), guided_decoding=GuidedDecodingParams(json=None, regex=None, choice=None, grammar=None, json_object=None, backend=None, whitespace_pattern=None), prompt_token_ids: [128000, 128006, 9125, 128007, 271, 38766, 1303, 33025, 2696, 25, 6790, 220, 2366, 18, 198, 15724, 2696, 25, 220, 2839, 5020, 220, 2366, 19, 271, 128009, 128006, 882, 128007, 271, 849, 21435, 279, 26431, 315, 15592, 304, 6617, 5557, 13, 128009, 128006, 78191, 128007, 271], lora_request: None, prompt_adapter_request: None.
INFO 10-03 07:38:12 engine.py:288] Added request chat-3361fcfb910c48b08a00e19e7f79c6e0.
ERROR 10-03 07:38:34 client.py:247] RuntimeError('Engine loop has died')
ERROR 10-03 07:38:34 client.py:247] Traceback (most recent call last):
ERROR 10-03 07:38:34 client.py:247]   File "/home/azureuser/miniconda3/envs/myenv/lib/python3.10/site-packages/vllm-0.6.3.dev65+g7f60520d.cpu-py3.10-linux-x86_64.egg/vllm/engine/multiprocessing/client.py", line 147, in run_heartbeat_loop
ERROR 10-03 07:38:34 client.py:247]     await self._check_success(
ERROR 10-03 07:38:34 client.py:247]   File "/home/azureuser/miniconda3/envs/myenv/lib/python3.10/site-packages/vllm-0.6.3.dev65+g7f60520d.cpu-py3.10-linux-x86_64.egg/vllm/engine/multiprocessing/client.py", line 311, in _check_success
ERROR 10-03 07:38:34 client.py:247]     raise response
ERROR 10-03 07:38:34 client.py:247] RuntimeError: Engine loop has died
CRITICAL 10-03 07:38:38 launcher.py:99] MQLLMEngine is already dead, terminating server process

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

kunalmohan · 2024-10-03T08:00:28Z

Same issue with vllm version 0.6.2+cpu

DarkLight1337 · 2024-10-03T14:53:40Z

Can you check whether it's fixed by #9044?

DarkLight1337 · 2024-10-03T14:54:00Z

If the issue persists, please show the full stack trace of the error.

kunalmohan · 2024-10-04T06:38:04Z

Nope, it does not fix the issue. I've pasted the complete stacktrace in the issue description. Pasting it again:

INFO:     Uvicorn running on http://0.0.0.0:8081 (Press CTRL+C to quit)
INFO 10-04 06:33:09 logger.py:36] Received request chat-32e960f7889940598a1ac749e5726a01: prompt: '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 04 Oct 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nExplain the significance of AI in modern technology.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=0.9, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1000, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), guided_decoding=GuidedDecodingParams(json=None, regex=None, choice=None, grammar=None, json_object=None, backend=None, whitespace_pattern=None), prompt_token_ids: [128000, 128006, 9125, 128007, 271, 38766, 1303, 33025, 2696, 25, 6790, 220, 2366, 18, 198, 15724, 2696, 25, 220, 2371, 5020, 220, 2366, 19, 271, 128009, 128006, 882, 128007, 271, 849, 21435, 279, 26431, 315, 15592, 304, 6617, 5557, 13, 128009, 128006, 78191, 128007, 271], lora_request: None, prompt_adapter_request: None.
INFO 10-04 06:33:15 engine.py:288] Added request chat-32e960f7889940598a1ac749e5726a01.
ERROR 10-04 06:33:36 client.py:247] RuntimeError('Engine loop has died')
ERROR 10-04 06:33:36 client.py:247] Traceback (most recent call last):
ERROR 10-04 06:33:36 client.py:247]   File "/home/azureuser/miniconda3/envs/myenv/lib/python3.10/site-packages/vllm-0.6.3.dev65+g7f60520d.d20241004.cpu-py3.10-linux-x86_64.egg/vllm/engine/multiprocessing/client.py", line 147, in run_heartbeat_loop
ERROR 10-04 06:33:36 client.py:247]     await self._check_success(
ERROR 10-04 06:33:36 client.py:247]   File "/home/azureuser/miniconda3/envs/myenv/lib/python3.10/site-packages/vllm-0.6.3.dev65+g7f60520d.d20241004.cpu-py3.10-linux-x86_64.egg/vllm/engine/multiprocessing/client.py", line 311, in _check_success
ERROR 10-04 06:33:36 client.py:247]     raise response
ERROR 10-04 06:33:36 client.py:247] RuntimeError: Engine loop has died
INFO 10-04 06:33:43 metrics.py:351] Avg prompt throughput: 1.2 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
CRITICAL 10-04 06:33:45 launcher.py:99] MQLLMEngine is already dead, terminating server process
INFO:     xx.xx.xx.xx:xx - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [1063]

DarkLight1337 · 2024-10-04T06:44:33Z

@robertgshaw2-neuralmagic seems that MQLLMEngine is hiding the stack trace... can we avoid this to make debugging easier?

bpucla · 2024-10-08T21:17:07Z

Any update on the issue? Have the same issue (exactly same stack trace) with Llama2 inference with vllm-0.6.1 on GPU.

Kepontry · 2024-10-09T01:13:55Z

@bpucla Hi, have you tried #9044, that pull solved my problem. I'm working on vllm 0.6.2+cpu.

vrtust · 2024-10-16T14:02:06Z

Same error and log.
Set VLLM_RPC_TIMEOUT=100000 solve it, just like:

VLLM_RPC_TIMEOUT=100000 python -m vllm.entrypoints.openai.api_server --port 8080 --served-model-name Qwen2-7B-Instruct --model /workspace/model/Qwen2-7B-Instruct --tensor-parallel-size 1 --dtype half

github-actions · 2025-01-15T01:57:58Z

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

kunalmohan added the bug Something isn't working label Oct 3, 2024

github-actions bot added the stale label Jan 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Error Running Llama 3.2 1B on CPU #9037

[Bug]: Error Running Llama 3.2 1B on CPU #9037

kunalmohan commented Oct 3, 2024 •

edited

Loading

kunalmohan commented Oct 3, 2024

DarkLight1337 commented Oct 3, 2024

DarkLight1337 commented Oct 3, 2024

kunalmohan commented Oct 4, 2024

DarkLight1337 commented Oct 4, 2024

bpucla commented Oct 8, 2024

Kepontry commented Oct 9, 2024

vrtust commented Oct 16, 2024

github-actions bot commented Jan 15, 2025

[Bug]: Error Running Llama 3.2 1B on CPU #9037

[Bug]: Error Running Llama 3.2 1B on CPU #9037

Comments

kunalmohan commented Oct 3, 2024 • edited Loading

Your current environment

Model Input Dumps

🐛 Describe the bug

Before submitting a new issue...

kunalmohan commented Oct 3, 2024

DarkLight1337 commented Oct 3, 2024

DarkLight1337 commented Oct 3, 2024

kunalmohan commented Oct 4, 2024

DarkLight1337 commented Oct 4, 2024

bpucla commented Oct 8, 2024

Kepontry commented Oct 9, 2024

vrtust commented Oct 16, 2024

github-actions bot commented Jan 15, 2025

kunalmohan commented Oct 3, 2024 •

edited

Loading