-
-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Usage]: How to expand inference context length to longer (such as 128k, 256k) on multi modality models. #11337
Comments
Setting |
@DarkLight1337 Thanks for your timely reply, the script is below. context_length = 128000
llm = LLM("/mnt/hwfile/mllm/weixilin/cache/Qwen2-VL-7B-Instruct",
max_model_len=context_length,
limit_mm_per_prompt={"video": 10},
tensor_parallel_size=2
) |
Did you call any CUDA-related functions before initializing vLLM? You should remove those calls if possible. |
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
from PIL import Image
import requests
import torch
from torchvision import io
from typing import Dict
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from vllm import LLM, SamplingParams
import tempfile
from petrel_client.client import Client
import copy
from torchvision.transforms.functional import to_pil_image
from torchvision import io, transforms
from torchvision.transforms import InterpolationMode
from transformers import AutoConfig
client = Client()
sampling_params = SamplingParams(
best_of=1,
temperature=0.0,
top_p=1,
top_k=-1,
max_tokens=128,
presence_penalty=0,
frequency_penalty=0,
)
context_length = 128000
llm = LLM("/mnt/hwfile/mllm/weixilin/cache/Qwen2-VL-7B-Instruct",
max_model_len=context_length,
limit_mm_per_prompt={"video": 10},
tensor_parallel_size=2
) There is no cuda related call, seems problem not here |
Hmm, can you wrap your code under an |
nothing has changed
|
@youkaichao any idea about this? |
What's the error message? We need the full log and environment to know what's happening. |
The problem had been resolved, thanks.
youkaichao ***@***.***> 于2024年12月20日周五 17:01写道:
… What's the error message? We need the full log and environment to know
what's happening.
—
Reply to this email directly, view it on GitHub
<#11337 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AS3M2AWPRPVCFDNVDRX2LBT2GPMGRAVCNFSM6AAAAABT5BJZ4GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNJWGU3DGMBVGA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
For future reference, how did you solve it? |
I registered the model as API servers instead of passing these parameters. |
But I've encountered a new issue. When registering as an API server and sending a video, it results in an openai.BadRequestError: Error code: 400 - {'object': 'error', 'message': 'Unknown part type: video_url', 'type': 'BadRequestError', 'param': None, 'code': 400}." I only found the same question at QwenLM/Qwen2.5-VL#539. however no solution right now. |
What do you mean exactly by that? |
ref: https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#vllm-serve |
Can you show the code you used to send the request? |
"""
vllm serve /mnt/hwfile/mllm/weixilin/cache/Qwen2-VL-7B-Instruct --trust-remote-code --tensor-parallel-size=2 --pipeline-parallel-size=1 --served-model-name=qwen2-vl-test --enable-prefix-caching --host 0.0.0.0
"""
import base64
import requests
import time
from tqdm import tqdm
from PIL import Image
from qwen_vl_utils import process_vision_info
from io import BytesIO
import numpy as np
# Function to encode the image
def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode('utf-8')
from openai import OpenAI
client = OpenAI(
api_key="EMPTY",
base_url="http://10.140.1.149:8000/v1/",
)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{
"role": "user",
"content": [
{
"type": "video",
"video": "/mnt/hwfile/mllm/zhangpan/share/from/xiaoyi/MLVU/video/6_anomaly_reco/surveil_176.mp4",
"min_pixels": 144*28*28,
# "max_pixels": 1280 * 28 * 28,
"total_pixels": (128000-1024) * 28 * 28,
# "nframes": 3,
# "total_pixels": 64 * 28 * 28,
},
{"type": "text", "text": "Describe the video."},
],
},
]
image_inputs, video_inputs = process_vision_info(messages)
model = 'qwen2-vl-test'
video_prompt_pair = [
("/mnt/petrelfs/weixilin/projects/MLLM/Qwen2-VL/assets/vis.mp4", "请详细介绍一下这个视频中的内容"),
("/mnt/petrelfs/weixilin/projects/MLLM/Qwen2-VL/assets/vis.mp4", "请问视频中人物的性别,以及他在做什么?"),
]
def encode_image_base64(
image: Image.Image,
*,
image_mode: str = "RGB",
format: str = "JPEG",
) -> str:
"""
Encode a pillow image to base64 format.
By default, the image is converted into RGB format before being encoded.
"""
buffered = BytesIO()
image = image.convert(image_mode)
image.save(buffered, format)
return base64.b64encode(buffered.getvalue()).decode('utf-8')
def encode_video_base64(frames):
"""
Encode a list of frames (tensors) into a single base64 string.
"""
base64_frames = []
frames = frames[0]
import pdb; pdb.set_trace()
for frame in frames:
# Convert tensor to numpy array (if it's not already)
if not isinstance(frame, np.ndarray):
frame = frame.numpy() # Assuming the tensor is a PyTorch tensor
if frame.shape[0] == 3: # Shape is (3, H, W)
frame = np.transpose(frame, (1, 2, 0))
if frame.dtype != np.uint8:
# Normalize and convert to uint8
frame = (frame * 255).clip(0, 255).astype(np.uint8)
# Encode the frame as base64
img_base64 = encode_image_base64(Image.fromarray(frame))
base64_frames.append(img_base64)
return ",".join(base64_frames)
def encode_image_from_path(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode('utf-8')
image_path = "/mnt/petrelfs/weixilin/projects/MLLM/Qwen2-VL/assets/images/input_ids_length_distribution_330k.png"
for video_url, prompt in video_prompt_pair:
chat_completion_from_url = client.chat.completions.create(
messages=[{
"role": "user",
"content": [
# {
# "type": "text",
# "text": prompt
# },
# {
{"type": "video_url",
"video_url": {
"url": f"data:video/jpeg;base64,{encode_video_base64(video_inputs)}"
}
}
# {"type": "image_url", "image_url": f"data:image/gif;base64,{encode_image_from_path(image_path)}"
],
}],
model=model,
max_tokens=1024,
)
result = chat_completion_from_url.choices[0].message.content
print(f"For video: {video_url}")
print(f"Prompt: {prompt}")
print(f"Response: {result}")
print() Now the code can only output text response, if pass video_url will cause openai.BadRequestError: Error code: 400 - {'object': 'error', 'message': 'Unknown part type: video_url', 'type': 'BadRequestError', 'param': None, 'code': 400} |
Can you try to use |
Your current environment
environment
torch: 2.5.0
vllm: 0.6.3.post2.dev171+g890ca360
problem
Now I have several A100 and a Qwen2-VL-7B model, I'm trying expanding this model during inference time to 128k, but cause oom.
I have tried:
How would you like to use vllm
I want to run inference of Qwen2-VL-7B on longer context(exceed 128k) based on my current working environment(several A100 80G).
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: