-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Vllm0.6.2 UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown #8933
Comments
This comment was marked as spam.
This comment was marked as spam.
LOG_DIR="/raid/xinference/modelscope/hub/qwen/Qwen2-72B-Instruct/logs";
|
分析一下我的日志:INFO 09-29 08:48:42 logger.py:36] Received request chat-312348aecf9e42ec91de81da41e091db: prompt: '<|im_start|>system\n你是一位商品名词分析专家,请从描述内容中分析是否包含与输入的产品关键词含义相同。请使用Json 的格式返回如下结果,结果的值只允许有包含和不包含。格式 如下:{“result”:"结果"}<|im_end|>\n<|im_start|>user\nFIBER CABLE是否包含在哪下内容中,内容如下:None<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=None, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [151644, 8948, 198, 56568, 109182, 45943, 113046, 101042, 101057, 37945, 45181, 53481, 43815, 15946, 101042, 64471, 102298, 57218, 31196, 104310, 105291, 109091, 102486, 1773, 14880, 37029, 5014, 43589, 68805, 31526, 104506, 59151, 3837, 59151, 9370, 25511, 91680, 102496, 18830, 102298, 33108, 16530, 102298, 1773, 68805, 69372, 16872, 5122, 90, 2073, 1382, 854, 2974, 59151, 9207, 151645, 198, 151644, 872, 198, 37, 3256, 640, 356, 3494, 64471, 102298, 109333, 16872, 43815, 15946, 3837, 43815, 104506, 5122, 4064, 151645, 198, 151644, 77091, 198], lora_request: None, prompt_adapter_request: None. During handling of the above exception, another exception occurred: Traceback (most recent call last): [rank3]:[E ProcessGroupNCCL.cpp:563] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=29100340, OpType=GATHER, NumelIn=38016, NumelOut=0, Timeout(ms)=600000) ran for 600051 milliseconds before timing out. [rank1]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=29100340, OpType=GATHER, NumelIn=38016, NumelOut=0, Timeout(ms)=600000) ran for 600053 milliseconds before timing out. [rank0]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=29100340, OpType=GATHER, NumelIn=38016, NumelOut=38016, Timeout(ms)=600000) ran for 600079 milliseconds before timing out. /raid/demo/anaconda3/envs/vllm/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown |
same here |
2 similar comments
same here |
same here |
Once multiple GPU cards are used, this problem will occur. |
+1 |
i face a new bug now, Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first): INFO 10-16 11:05:09 logger.py:36] Received request chat-af9601cd9bc24371bb0b11c4a36bc55f: prompt: '<|im_start|>system\n<|im_end|>\n<|im_start|>user\n\n## Role: 专业AI翻译助手\n- description: 你是一个专业的AI翻译助手,精通中文。\n\n## Skills\n1. 精通多种语言的翻译技巧\n2. 准确理解和传达原文含义\n3. 严格遵循输出格式要求\n4. 专业术语和品牌名称的处理能力\n\n## Rules\n1. 仅输出原文和翻译,不添加任何解释或评论。\n2. 严格按照指定的JSON格式提供翻译结果。\n3. 保持专业术语和品牌名称不变,除非有官方的中文译名。\n4. 确保翻译准确传达原文含义,同时保持中文的自然流畅。\n5. 如遇无法翻译的内容,保持原文不变,不做额外说明。\n6. 禁止在输出中包含任何额外的解释、注释或元数据。\n\n## Workflow\n1. 接收翻译任务:目标语言为中文,待翻译文本为"BATERIA RECARGABLE PARA FUENTE DE VOLTAJE UNINTERRUPTIBLE"。\n2. 分析文本,识别专业术语和品牌名称。\n3. 进行翻译,遵循翻译规则。\n4. 按照指定的JSON格式输出结果。\n\n## OutputFormat\n{\n "翻译后": ""\n}\n\n## Init\n你的任务是将给定的文本"BATERIA RECARGABLE PARA FUENTE DE VOLTAJE UNINTERRUPTIBLE"准确翻译成中文。请直接开始翻译工作,只返回要求的JSON格式结果。<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.5, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=7868, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [151644, 8948, 198, 151645, 198, 151644, 872, 271, 565, 15404, 25, 220, 99878, 15469, 105395, 110498, 198, 12, 4008, 25, 220, 56568, 101909, 104715, 15469, 105395, 110498, 3837, 114806, 104811, 3407, 565, 30240, 198, 16, 13, 10236, 110, 122, 31935, 101312, 102064, 9370, 105395, 102118, 198, 17, 13, 65727, 228, 33956, 115167, 107707, 103283, 109091, 198, 18, 13, 220, 100470, 106466, 66017, 68805, 101882, 198, 19, 13, 220, 99878, 116925, 33108, 100135, 29991, 9370, 54542, 99788, 271, 565, 22847, 198, 16, 13, 220, 99373, 66017, 103283, 33108, 105395, 3837, 16530, 42855, 99885, 104136, 57191, 85641, 8997, 17, 13, 220, 110439, 105146, 9370, 5370, 68805, 99553, 105395, 59151, 8997, 18, 13, 220, 100662, 99878, 116925, 33108, 100135, 29991, 105928, 3837, 106781, 18830, 100777, 9370, 104811, 102610, 13072, 8997, 19, 13, 10236, 94, 106, 32463, 105395, 102188, 107707, 103283, 109091, 3837, 91572, 100662, 104811, 9370, 99795, 110205, 8997, 20, 13, 69372, 99688, 101068, 105395, 104597, 3837, 100662, 103283, 105928, 3837, 109513, 108593, 66394, 8997, 21, 13, 10236, 99, 223, 81433, 18493, 66017, 15946, 102298, 99885, 108593, 9370, 104136, 5373, 25074, 68862, 57191, 23305, 20074, 3407, 565, 60173, 198, 16, 13, 46602, 98, 50009, 105395, 88802, 5122, 100160, 102064, 17714, 104811, 3837, 74193, 105395, 108704, 17714, 63590, 19157, 5863, 74136, 7581, 3494, 50400, 95512, 93777, 3385, 68226, 15204, 40329, 6643, 3221, 48584, 13568, 1, 8997, 17, 13, 58657, 97771, 108704, 3837, 102450, 99878, 116925, 33108, 100135, 29991, 8997, 18, 13, 32181, 249, 22243, 105395, 3837, 106466, 105395, 104190, 8997, 19, 13, 6567, 234, 231, 99331, 105146, 9370, 5370, 68805, 66017, 59151, 3407, 565, 9258, 4061, 198, 515, 220, 330, 105395, 33447, 788, 8389, 630, 565, 15690, 198, 103929, 88802, 20412, 44063, 89012, 22382, 9370, 108704, 63590, 19157, 5863, 74136, 7581, 3494, 50400, 95512, 93777, 3385, 68226, 15204, 40329, 6643, 3221, 48584, 13568, 1, 102188, 105395, 12857, 104811, 1773, 14880, 101041, 55286, 105395, 99257, 3837, 91680, 31526, 101882, 9370, 5370, 68805, 59151, 1773, 151645, 198, 151644, 77091, 198], lora_request: None, prompt_adapter_request: None. During handling of the above exception, another exception occurred:
During handling of the above exception, another exception occurred:
During handling of the above exception, another exception occurred:
During handling of the above exception, another exception occurred:
During handling of the above exception, another exception occurred:
|
upgrade to v0.6.3 seems to solve this error |
I have the same issue with vllm 0.6.4.post1. I worked with 4 A100 GPUs, and the generation slows down until completely hangs. Not sure if the hanging was related to the memory leak. |
WARNING 01-12 08:36:16 config.py:1656] Casting torch.bfloat16 to torch.float16. INFO 01-12 08:36:24 model_runner.py:1025] Loading model weights took 3.7710 GB Current thread 0x00007fe6efeb8280 (most recent call first): |
Your current environment
Model Input Dumps
No response
🐛 Describe the bug
(demo_vllm) demo@dgx03:/raid/xinference/modelscope/hub/qwen/Qwen2-72B-Instruct/logs$ tail -f vllm_20240927.log
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5cf7ba9897 in /raid/demo/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f5cf8e82c62 in /raid/demo/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f5cf8e87a80 in /raid/demo/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f5cf8e88dcc in /raid/demo/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdbbf4 (0x7f5d44931bf4 in /raid/demo/anaconda3/envs/vllm/bin/../lib/libstdc++.so.6)
frame #5: + 0x8609 (0x7f5d4618b609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f5d45f56353 in /lib/x86_64-linux-gnu/libc.so.6)
/raid/demo/anaconda3/envs/vllm/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: