[Model]: Llava-Next-Video support #6571

TKONIY · 2024-07-19T10:13:55Z

The model to consider.

LLaVA-NeXT-Video* (LlavaNextVideoForConditionalGeneration)

The closest model vllm already supports.

Llava-Next (LlavaNextForConditionalGeneration)

What's your difficulty of supporting the model you want?

Implement the video processor.
Implement the merging of video embedding and text embedding.

ywang96 · 2024-07-19T16:44:23Z

Do you plan to make a PR for this? FYI, the support for multi-image (which is essentially what video Llava is doing) is indeed in our Q3 roadmap, so it would be great if we collaborate on the effort.

TKONIY · 2024-07-20T01:59:42Z

Yes but I haven't finished yet. I am working on it.

TKONIY · 2024-08-14T19:26:22Z

I will make a PR this week. It will support a dynamic number of input frames, which is important but not supported by SGLang.

SGlang needs to set a num_frames parameters when launching a llava-next-video model, and simply asserts all the input videos contains num_frames frames. If input frames are less than num_frames, their embedding will be padded with the wrong numbers. If the input frames are more than num_frames, their embedding will be truncated. In both situations the results are unexpected.
Instead, in LLM, the length of embedding will be calculated from the arrived requests to support videos with different frames.

TKONIY added the new model Requests to new models label Jul 19, 2024

TKONIY mentioned this issue Jul 19, 2024

[New Model]: LLaVA-NeXT-Video support #5124

Closed

DarkLight1337 mentioned this issue Jul 19, 2024

[RFC]: Multi-modality Support on vLLM #4194

Open

86 tasks

TKONIY changed the title ~~[New Model]: Proposal for implementing Llava-Next-Video~~ [RFC]: Proposal for implementing support for video input Aug 15, 2024

TKONIY changed the title ~~[RFC]: Proposal for implementing support for video input~~ [Model]: Llava-Next-Video support Aug 15, 2024

DarkLight1337 mentioned this issue Aug 15, 2024

[model] Support for Llava-Next-Video model #7559

Merged

9 tasks

ywang96 mentioned this issue Aug 21, 2024

[Doc] Section for Multimodal Language Models #7719

Merged

ywang96 mentioned this issue Aug 29, 2024

[Model][VLM] Add Qwen2-VL model support #7905

Merged

youkaichao closed this as completed in #7559 Sep 11, 2024

This was referenced Sep 12, 2024

[New Model]: Adding MiniGPT4_video model #6805

Closed

[New Model]: LLaVA-OneVision #7420

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Model]: Llava-Next-Video support #6571

[Model]: Llava-Next-Video support #6571

TKONIY commented Jul 19, 2024

ywang96 commented Jul 19, 2024 •

edited

Loading

TKONIY commented Jul 20, 2024

TKONIY commented Aug 14, 2024

[Model]: Llava-Next-Video support #6571

[Model]: Llava-Next-Video support #6571

Comments

TKONIY commented Jul 19, 2024

The model to consider.

The closest model vllm already supports.

What's your difficulty of supporting the model you want?

ywang96 commented Jul 19, 2024 • edited Loading

TKONIY commented Jul 20, 2024

TKONIY commented Aug 14, 2024

ywang96 commented Jul 19, 2024 •

edited

Loading