Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Usage]: How to expand inference context length to longer (such as 128k, 256k) on multi modality models. #11337

Closed
1 task done
Wiselnn570 opened this issue Dec 19, 2024 · 17 comments
Labels
usage How to use vllm

Comments

@Wiselnn570
Copy link

Your current environment

  • environment
    torch: 2.5.0
    vllm: 0.6.3.post2.dev171+g890ca360

  • problem
    Now I have several A100 and a Qwen2-VL-7B model, I'm trying expanding this model during inference time to 128k, but cause oom.

I have tried:

  1. [Usage]: How to use ROPE scaling for llama3.1 and gemma2? #10537 Set max_num_seqs = 1 or 2, cause oom again.
  2. [Usage]: Cant use vllm on a multiGPU node #10474 Set tensor_parallel_size=2, find the same issue, which have't solved yet.
  3. [Bug]: Qwen/Qwen2-72B-Instruct 128k server down  #5496 Set enable_chunked_prefill=True, max_num_batched_tokens=8192, cause oom again.

How would you like to use vllm

I want to run inference of Qwen2-VL-7B on longer context(exceed 128k) based on my current working environment(several A100 80G).

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@Wiselnn570 Wiselnn570 added the usage How to use vllm label Dec 19, 2024
@DarkLight1337
Copy link
Member

Setting tensor_parallel_size is probably the way to go. Can you show your code?

@Wiselnn570
Copy link
Author

Setting tensor_parallel_size is probably the way to go. Can you show your code?

@DarkLight1337 Thanks for your timely reply, the script is below.

context_length = 128000
llm = LLM("/mnt/hwfile/mllm/weixilin/cache/Qwen2-VL-7B-Instruct", 
            max_model_len=context_length,
            limit_mm_per_prompt={"video": 10},
            tensor_parallel_size=2
            )

@DarkLight1337
Copy link
Member

Did you call any CUDA-related functions before initializing vLLM? You should remove those calls if possible.

@Wiselnn570
Copy link
Author

Did you call any CUDA-related functions before initializing vLLM? You should remove those calls if possible.

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
from PIL import Image
import requests
import torch
from torchvision import io
from typing import Dict
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from vllm import LLM, SamplingParams
import tempfile
from petrel_client.client import Client
import copy
from torchvision.transforms.functional import to_pil_image
from torchvision import io, transforms
from torchvision.transforms import InterpolationMode
from transformers import AutoConfig


client = Client()
sampling_params = SamplingParams(
    best_of=1,
    temperature=0.0,
    top_p=1,
    top_k=-1,
    max_tokens=128,
    presence_penalty=0,
    frequency_penalty=0,
)


context_length = 128000


llm = LLM("/mnt/hwfile/mllm/weixilin/cache/Qwen2-VL-7B-Instruct", 
            max_model_len=context_length,
            limit_mm_per_prompt={"video": 10},
            tensor_parallel_size=2
            )

There is no cuda related call, seems problem not here

@DarkLight1337
Copy link
Member

Hmm, can you wrap your code under an if __name__ == "__main__" guard? Perhaps multiprocessing is messing this up.

@Wiselnn570
Copy link
Author

Wiselnn570 commented Dec 19, 2024

nothing has changed

Hmm, can you wrap your code under an if __name__ == "__main__" guard? Perhaps multiprocessing is messing this up.

@DarkLight1337
Copy link
Member

@youkaichao any idea about this?

@youkaichao
Copy link
Member

What's the error message? We need the full log and environment to know what's happening.

@Wiselnn570
Copy link
Author

Wiselnn570 commented Dec 20, 2024 via email

@DarkLight1337
Copy link
Member

For future reference, how did you solve it?

@Wiselnn570
Copy link
Author

For future reference, how did you solve it?

I registered the model as API servers instead of passing these parameters.

@Wiselnn570
Copy link
Author

But I've encountered a new issue. When registering as an API server and sending a video, it results in an

openai.BadRequestError: Error code: 400 - {'object': 'error', 'message': 'Unknown part type: video_url', 'type': 'BadRequestError', 'param': None, 'code': 400}."

I only found the same question at QwenLM/Qwen2.5-VL#539. however no solution right now.

@DarkLight1337
Copy link
Member

When registering as an API server

What do you mean exactly by that?

@Wiselnn570
Copy link
Author

When registering as an API server

What do you mean exactly by that?

ref: https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#vllm-serve

@DarkLight1337
Copy link
Member

Can you show the code you used to send the request?

@Wiselnn570
Copy link
Author

"""
vllm serve /mnt/hwfile/mllm/weixilin/cache/Qwen2-VL-7B-Instruct --trust-remote-code --tensor-parallel-size=2 --pipeline-parallel-size=1 --served-model-name=qwen2-vl-test --enable-prefix-caching --host 0.0.0.0
"""
import base64
import requests
import time
from tqdm import tqdm
from PIL import Image
from qwen_vl_utils import process_vision_info
from io import BytesIO
import numpy as np
# Function to encode the image
def encode_image(image_path):
  with open(image_path, "rb") as image_file:
    return base64.b64encode(image_file.read()).decode('utf-8')
  
from openai import OpenAI
client = OpenAI(
    api_key="EMPTY",
    base_url="http://10.140.1.149:8000/v1/",
)
messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {
            "role": "user",
            "content": [
                {
                    "type": "video",
                    "video": "/mnt/hwfile/mllm/zhangpan/share/from/xiaoyi/MLVU/video/6_anomaly_reco/surveil_176.mp4",
                    "min_pixels": 144*28*28,
                    # "max_pixels": 1280 * 28 * 28,
                    "total_pixels": (128000-1024) * 28 * 28,
                    # "nframes": 3,
                    # "total_pixels": 64 * 28 * 28,
                },
                {"type": "text", "text": "Describe the video."},
            ],
        },
    ]
image_inputs, video_inputs = process_vision_info(messages)

model = 'qwen2-vl-test'
video_prompt_pair = [
    ("/mnt/petrelfs/weixilin/projects/MLLM/Qwen2-VL/assets/vis.mp4", "请详细介绍一下这个视频中的内容"),
    ("/mnt/petrelfs/weixilin/projects/MLLM/Qwen2-VL/assets/vis.mp4", "请问视频中人物的性别,以及他在做什么?"),
]
def encode_image_base64(
    image: Image.Image,
    *,
    image_mode: str = "RGB",
    format: str = "JPEG",
) -> str:
    """
    Encode a pillow image to base64 format.

    By default, the image is converted into RGB format before being encoded.
    """
    buffered = BytesIO()
    image = image.convert(image_mode)
    image.save(buffered, format)
    return base64.b64encode(buffered.getvalue()).decode('utf-8')
def encode_video_base64(frames):
    """
    Encode a list of frames (tensors) into a single base64 string.
    """
    base64_frames = []
    frames = frames[0]
    import pdb; pdb.set_trace()
    for frame in frames:
        # Convert tensor to numpy array (if it's not already)
        if not isinstance(frame, np.ndarray):
            frame = frame.numpy()  # Assuming the tensor is a PyTorch tensor

        if frame.shape[0] == 3:  # Shape is (3, H, W)
            frame = np.transpose(frame, (1, 2, 0))
        
        if frame.dtype != np.uint8:
            # Normalize and convert to uint8
            frame = (frame * 255).clip(0, 255).astype(np.uint8)
        # Encode the frame as base64
        img_base64 = encode_image_base64(Image.fromarray(frame))
        base64_frames.append(img_base64)
    return ",".join(base64_frames)

def encode_image_from_path(image_path):
  with open(image_path, "rb") as image_file:
    return base64.b64encode(image_file.read()).decode('utf-8')
image_path = "/mnt/petrelfs/weixilin/projects/MLLM/Qwen2-VL/assets/images/input_ids_length_distribution_330k.png"
for video_url, prompt in video_prompt_pair:
    chat_completion_from_url = client.chat.completions.create(
        messages=[{
            "role": "user",
            "content": [
                # {
                #     "type": "text",
                #     "text": prompt
                # },
                # {
                {"type": "video_url",
                "video_url": {
                    "url": f"data:video/jpeg;base64,{encode_video_base64(video_inputs)}"
                }
                }
                # {"type": "image_url", "image_url": f"data:image/gif;base64,{encode_image_from_path(image_path)}"
            ],
        }],
        model=model,
        max_tokens=1024,
    )
    result = chat_completion_from_url.choices[0].message.content
    print(f"For video: {video_url}")
    print(f"Prompt: {prompt}")
    print(f"Response: {result}")
    print()

Now the code can only output text response, if pass video_url will cause openai.BadRequestError: Error code: 400 - {'object': 'error', 'message': 'Unknown part type: video_url', 'type': 'BadRequestError', 'param': None, 'code': 400}

@DarkLight1337
Copy link
Member

Can you try to use curl command or requests library to send the request instead? Maybe OpenAI Python client is too strict with type checking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
usage How to use vllm
Projects
None yet
Development

No branches or pull requests

3 participants