Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Model]: Llava-Next-Video support #6571

Closed
TKONIY opened this issue Jul 19, 2024 · 3 comments · Fixed by #7559
Closed

[Model]: Llava-Next-Video support #6571

TKONIY opened this issue Jul 19, 2024 · 3 comments · Fixed by #7559
Labels
new model Requests to new models

Comments

@TKONIY
Copy link
Contributor

TKONIY commented Jul 19, 2024

The model to consider.

LLaVA-NeXT-Video* (LlavaNextVideoForConditionalGeneration)

The closest model vllm already supports.

Llava-Next (LlavaNextForConditionalGeneration)

What's your difficulty of supporting the model you want?

  • Implement the video processor.
  • Implement the merging of video embedding and text embedding.
image
@ywang96
Copy link
Member

ywang96 commented Jul 19, 2024

Do you plan to make a PR for this? FYI, the support for multi-image (which is essentially what video Llava is doing) is indeed in our Q3 roadmap, so it would be great if we collaborate on the effort.

@TKONIY
Copy link
Contributor Author

TKONIY commented Jul 20, 2024

Yes but I haven't finished yet. I am working on it.

@TKONIY
Copy link
Contributor Author

TKONIY commented Aug 14, 2024

I will make a PR this week. It will support a dynamic number of input frames, which is important but not supported by SGLang.

  • SGlang needs to set a num_frames parameters when launching a llava-next-video model, and simply asserts all the input videos contains num_frames frames. If input frames are less than num_frames, their embedding will be padded with the wrong numbers. If the input frames are more than num_frames, their embedding will be truncated. In both situations the results are unexpected.

  • Instead, in LLM, the length of embedding will be calculated from the arrived requests to support videos with different frames.

@TKONIY TKONIY changed the title [New Model]: Proposal for implementing Llava-Next-Video [RFC]: Proposal for implementing support for video input Aug 15, 2024
@TKONIY TKONIY changed the title [RFC]: Proposal for implementing support for video input [Model]: Llava-Next-Video support Aug 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new model Requests to new models
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants