[Feature]: phi-3.5 is a strong model for its size, including vision support. Has multi-image support, but vllm does not support #7740

pseudotensor · 2024-08-21T15:23:37Z

🚀 The feature, motivation and pitch

phi-3.5 is a strong model for its size, including strong multi-image vision support. But vllm does not support the multi-image case.

vllm/vllm/model_executor/models/phi3v.py

Lines 421 to 425 in 03b7bfb

    
           elif len(re.findall(r"(<\|image_\d+\|>)+", prompt)) > 1: 
        
               logger.warning("Multiple image input is not supported yet, " 
        
                              "so any extra image tokens will be treated " 
        
                              "as plain text.")

Alternatives

Only other models

Additional context

No response

DarkLight1337 · 2024-08-21T15:29:19Z

@Isotr0py are you interested in implementing this?

Isotr0py · 2024-08-21T16:17:31Z

Of course, I'm just working on implementing this feature.
I will create a PR once it's nearly to be finished.

pseudotensor added the feature request label Aug 21, 2024

pseudotensor mentioned this issue Aug 21, 2024

[Model] Fix Phi-3.5-vision-instruct 'num_crops' issue #7710

Merged

DarkLight1337 mentioned this issue Aug 21, 2024

[RFC]: Multi-modality Support Refactoring #4194

Open

55 tasks

Isotr0py mentioned this issue Aug 22, 2024

[Model][VLM] Support multi-images inputs for Phi-3-vision models #7783

Merged

2 tasks

DarkLight1337 closed this as completed in #7783 Aug 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: phi-3.5 is a strong model for its size, including vision support. Has multi-image support, but vllm does not support #7740

[Feature]: phi-3.5 is a strong model for its size, including vision support. Has multi-image support, but vllm does not support #7740

pseudotensor commented Aug 21, 2024

DarkLight1337 commented Aug 21, 2024

Isotr0py commented Aug 21, 2024

[Feature]: phi-3.5 is a strong model for its size, including vision support. Has multi-image support, but vllm does not support #7740

[Feature]: phi-3.5 is a strong model for its size, including vision support. Has multi-image support, but vllm does not support #7740

Comments

pseudotensor commented Aug 21, 2024

🚀 The feature, motivation and pitch

Alternatives

Additional context

DarkLight1337 commented Aug 21, 2024

Isotr0py commented Aug 21, 2024