[Usage]: How to expand inference context length to longer (such as 128k, 256k) on multi modality models. #11337

Wiselnn570 · 2024-12-19T13:52:45Z

Your current environment

environment
torch: 2.5.0
vllm: 0.6.3.post2.dev171+g890ca360
problem
Now I have several A100 and a Qwen2-VL-7B model, I'm trying expanding this model during inference time to 128k, but cause oom.

I have tried:

[Usage]: How to use ROPE scaling for llama3.1 and gemma2? #10537 Set max_num_seqs = 1 or 2, cause oom again.
[Usage]: Cant use vllm on a multiGPU node #10474 Set tensor_parallel_size=2, find the same issue, which have't solved yet.
[Bug]: Qwen/Qwen2-72B-Instruct 128k server down #5496 Set enable_chunked_prefill=True, max_num_batched_tokens=8192, cause oom again.

How would you like to use vllm

I want to run inference of Qwen2-VL-7B on longer context(exceed 128k) based on my current working environment(several A100 80G).

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

DarkLight1337 · 2024-12-19T14:10:32Z

Setting tensor_parallel_size is probably the way to go. Can you show your code?

Wiselnn570 · 2024-12-19T15:18:08Z

Setting tensor_parallel_size is probably the way to go. Can you show your code?

@DarkLight1337 Thanks for your timely reply, the script is below.

context_length = 128000
llm = LLM("/mnt/hwfile/mllm/weixilin/cache/Qwen2-VL-7B-Instruct", 
            max_model_len=context_length,
            limit_mm_per_prompt={"video": 10},
            tensor_parallel_size=2
            )

DarkLight1337 · 2024-12-19T15:22:22Z

Did you call any CUDA-related functions before initializing vLLM? You should remove those calls if possible.

Wiselnn570 · 2024-12-19T15:32:26Z

Did you call any CUDA-related functions before initializing vLLM? You should remove those calls if possible.

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
from PIL import Image
import requests
import torch
from torchvision import io
from typing import Dict
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from vllm import LLM, SamplingParams
import tempfile
from petrel_client.client import Client
import copy
from torchvision.transforms.functional import to_pil_image
from torchvision import io, transforms
from torchvision.transforms import InterpolationMode
from transformers import AutoConfig


client = Client()
sampling_params = SamplingParams(
    best_of=1,
    temperature=0.0,
    top_p=1,
    top_k=-1,
    max_tokens=128,
    presence_penalty=0,
    frequency_penalty=0,
)


context_length = 128000


llm = LLM("/mnt/hwfile/mllm/weixilin/cache/Qwen2-VL-7B-Instruct", 
            max_model_len=context_length,
            limit_mm_per_prompt={"video": 10},
            tensor_parallel_size=2
            )

There is no cuda related call, seems problem not here

DarkLight1337 · 2024-12-19T15:40:22Z

Hmm, can you wrap your code under an if __name__ == "__main__" guard? Perhaps multiprocessing is messing this up.

Wiselnn570 · 2024-12-19T15:51:18Z

nothing has changed

Hmm, can you wrap your code under an if __name__ == "__main__" guard? Perhaps multiprocessing is messing this up.

DarkLight1337 · 2024-12-19T15:57:17Z

@youkaichao any idea about this?

youkaichao · 2024-12-20T09:01:07Z

What's the error message? We need the full log and environment to know what's happening.

Wiselnn570 · 2024-12-20T13:27:35Z

The problem had been resolved, thanks. youkaichao ***@***.***> 于2024年12月20日周五 17:01写道：

…

What's the error message? We need the full log and environment to know what's happening. — Reply to this email directly, view it on GitHub <#11337 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AS3M2AWPRPVCFDNVDRX2LBT2GPMGRAVCNFSM6AAAAABT5BJZ4GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNJWGU3DGMBVGA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

DarkLight1337 · 2024-12-20T13:29:16Z

For future reference, how did you solve it?

Wiselnn570 · 2024-12-20T13:37:19Z

For future reference, how did you solve it?

I registered the model as API servers instead of passing these parameters.

Wiselnn570 · 2024-12-20T13:40:44Z

But I've encountered a new issue. When registering as an API server and sending a video, it results in an

openai.BadRequestError: Error code: 400 - {'object': 'error', 'message': 'Unknown part type: video_url', 'type': 'BadRequestError', 'param': None, 'code': 400}."

I only found the same question at QwenLM/Qwen2.5-VL#539. however no solution right now.

DarkLight1337 · 2024-12-20T13:43:52Z

When registering as an API server

What do you mean exactly by that?

Wiselnn570 · 2024-12-20T13:49:07Z

When registering as an API server

What do you mean exactly by that?

ref: https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#vllm-serve

DarkLight1337 · 2024-12-20T13:51:18Z

Can you show the code you used to send the request?

Wiselnn570 · 2024-12-20T14:04:36Z

"""
vllm serve /mnt/hwfile/mllm/weixilin/cache/Qwen2-VL-7B-Instruct --trust-remote-code --tensor-parallel-size=2 --pipeline-parallel-size=1 --served-model-name=qwen2-vl-test --enable-prefix-caching --host 0.0.0.0
"""
import base64
import requests
import time
from tqdm import tqdm
from PIL import Image
from qwen_vl_utils import process_vision_info
from io import BytesIO
import numpy as np
# Function to encode the image
def encode_image(image_path):
  with open(image_path, "rb") as image_file:
    return base64.b64encode(image_file.read()).decode('utf-8')
  
from openai import OpenAI
client = OpenAI(
    api_key="EMPTY",
    base_url="http://10.140.1.149:8000/v1/",
)
messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {
            "role": "user",
            "content": [
                {
                    "type": "video",
                    "video": "/mnt/hwfile/mllm/zhangpan/share/from/xiaoyi/MLVU/video/6_anomaly_reco/surveil_176.mp4",
                    "min_pixels": 144*28*28,
                    # "max_pixels": 1280 * 28 * 28,
                    "total_pixels": (128000-1024) * 28 * 28,
                    # "nframes": 3,
                    # "total_pixels": 64 * 28 * 28,
                },
                {"type": "text", "text": "Describe the video."},
            ],
        },
    ]
image_inputs, video_inputs = process_vision_info(messages)

model = 'qwen2-vl-test'
video_prompt_pair = [
    ("/mnt/petrelfs/weixilin/projects/MLLM/Qwen2-VL/assets/vis.mp4", "请详细介绍一下这个视频中的内容"),
    ("/mnt/petrelfs/weixilin/projects/MLLM/Qwen2-VL/assets/vis.mp4", "请问视频中人物的性别，以及他在做什么？"),
]
def encode_image_base64(
    image: Image.Image,
    *,
    image_mode: str = "RGB",
    format: str = "JPEG",
) -> str:
    """
    Encode a pillow image to base64 format.

    By default, the image is converted into RGB format before being encoded.
    """
    buffered = BytesIO()
    image = image.convert(image_mode)
    image.save(buffered, format)
    return base64.b64encode(buffered.getvalue()).decode('utf-8')
def encode_video_base64(frames):
    """
    Encode a list of frames (tensors) into a single base64 string.
    """
    base64_frames = []
    frames = frames[0]
    import pdb; pdb.set_trace()
    for frame in frames:
        # Convert tensor to numpy array (if it's not already)
        if not isinstance(frame, np.ndarray):
            frame = frame.numpy()  # Assuming the tensor is a PyTorch tensor

        if frame.shape[0] == 3:  # Shape is (3, H, W)
            frame = np.transpose(frame, (1, 2, 0))
        
        if frame.dtype != np.uint8:
            # Normalize and convert to uint8
            frame = (frame * 255).clip(0, 255).astype(np.uint8)
        # Encode the frame as base64
        img_base64 = encode_image_base64(Image.fromarray(frame))
        base64_frames.append(img_base64)
    return ",".join(base64_frames)

def encode_image_from_path(image_path):
  with open(image_path, "rb") as image_file:
    return base64.b64encode(image_file.read()).decode('utf-8')
image_path = "/mnt/petrelfs/weixilin/projects/MLLM/Qwen2-VL/assets/images/input_ids_length_distribution_330k.png"
for video_url, prompt in video_prompt_pair:
    chat_completion_from_url = client.chat.completions.create(
        messages=[{
            "role": "user",
            "content": [
                # {
                #     "type": "text",
                #     "text": prompt
                # },
                # {
                {"type": "video_url",
                "video_url": {
                    "url": f"data:video/jpeg;base64,{encode_video_base64(video_inputs)}"
                }
                }
                # {"type": "image_url", "image_url": f"data:image/gif;base64,{encode_image_from_path(image_path)}"
            ],
        }],
        model=model,
        max_tokens=1024,
    )
    result = chat_completion_from_url.choices[0].message.content
    print(f"For video: {video_url}")
    print(f"Prompt: {prompt}")
    print(f"Response: {result}")
    print()

Now the code can only output text response, if pass video_url will cause openai.BadRequestError: Error code: 400 - {'object': 'error', 'message': 'Unknown part type: video_url', 'type': 'BadRequestError', 'param': None, 'code': 400}

DarkLight1337 · 2024-12-20T14:18:03Z

Can you try to use curl command or requests library to send the request instead? Maybe OpenAI Python client is too strict with type checking.

Wiselnn570 added the usage How to use vllm label Dec 19, 2024

Wiselnn570 closed this as completed Dec 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Usage]: How to expand inference context length to longer (such as 128k, 256k) on multi modality models. #11337

[Usage]: How to expand inference context length to longer (such as 128k, 256k) on multi modality models. #11337

Wiselnn570 commented Dec 19, 2024

DarkLight1337 commented Dec 19, 2024

Wiselnn570 commented Dec 19, 2024

DarkLight1337 commented Dec 19, 2024

Wiselnn570 commented Dec 19, 2024

DarkLight1337 commented Dec 19, 2024

Wiselnn570 commented Dec 19, 2024 •

edited

Loading

DarkLight1337 commented Dec 19, 2024

youkaichao commented Dec 20, 2024

Wiselnn570 commented Dec 20, 2024 via email

DarkLight1337 commented Dec 20, 2024

Wiselnn570 commented Dec 20, 2024

Wiselnn570 commented Dec 20, 2024

DarkLight1337 commented Dec 20, 2024

Wiselnn570 commented Dec 20, 2024

DarkLight1337 commented Dec 20, 2024

Wiselnn570 commented Dec 20, 2024

DarkLight1337 commented Dec 20, 2024

[Usage]: How to expand inference context length to longer (such as 128k, 256k) on multi modality models. #11337

[Usage]: How to expand inference context length to longer (such as 128k, 256k) on multi modality models. #11337

Comments

Wiselnn570 commented Dec 19, 2024

Your current environment

How would you like to use vllm

Before submitting a new issue...

DarkLight1337 commented Dec 19, 2024

Wiselnn570 commented Dec 19, 2024

DarkLight1337 commented Dec 19, 2024

Wiselnn570 commented Dec 19, 2024

DarkLight1337 commented Dec 19, 2024

Wiselnn570 commented Dec 19, 2024 • edited Loading

DarkLight1337 commented Dec 19, 2024

youkaichao commented Dec 20, 2024

Wiselnn570 commented Dec 20, 2024 via email

DarkLight1337 commented Dec 20, 2024

Wiselnn570 commented Dec 20, 2024

Wiselnn570 commented Dec 20, 2024

DarkLight1337 commented Dec 20, 2024

Wiselnn570 commented Dec 20, 2024

DarkLight1337 commented Dec 20, 2024

Wiselnn570 commented Dec 20, 2024

DarkLight1337 commented Dec 20, 2024

Wiselnn570 commented Dec 19, 2024 •

edited

Loading