Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Support image processor #4197

Merged
merged 130 commits into from
Jun 3, 2024

Conversation

DarkLight1337
Copy link
Member

@DarkLight1337 DarkLight1337 commented Apr 19, 2024

I have implemented a plugin architecture (MultiModelPlugin) over MultiModalData to define how each modality type should be preprocessed before being passed to the model as keyword arguments. This preserves the contract between the output of HuggingFace processor and the input into the HuggingFace model. As long as those keyword arguments do not conflict with the ones we have in vLLM, I think this is a good way to make the framework flexible enough to support other multi-modal architectures.

FIX #4054 (the data is now automatically converted into the model's device)

Related Contributions

This PR is part of #3978.

This PR also implements Proposals 1 and 3 of #4194.

Features

  1. Refactor of MultiModalData
    • Image input is now split into two subclasses:
      • ImageFeatureData represents the image features of LLaVA after being passed through the vision tower, but before the multi-modal projection is applied.
      • ImagePixelData represents the raw image (using PIL.Image class). AutoImageProcessor from HuggingFace is loaded from config.json to pre-process input images before being passed to the model as pixel_values. As with the tokenizer, you can override the default one and specify the version of image processor via EngineConfig; you can even disable image preprocessing altogether, which is useful if you want to pass in images that have already been preprocessed.
    • The LLaVA implementation has been updated accordingly to accept the new inputs.
  2. A new documentation page for using VLMs can be found under (dev/multimodal).

Compatibility Changes

  • pillow will be upgraded to a common dependency (from dev) to process the images.

- Also add docs for basic VLM usage
- Other data types may need to be of different dtype from that of the model
@DarkLight1337
Copy link
Member Author

DarkLight1337 commented Apr 19, 2024

The LLaVA test passes on my end (with both outputs matching the HF output shown in CI). Does anyone have a clue what might cause it to fail in CI? Perhaps a case of floating-point error in GPU computation?

@DarkLight1337 DarkLight1337 force-pushed the mm-data-processor branch 3 times, most recently from a92952e to 2d57f27 Compare April 22, 2024 11:40
@DarkLight1337 DarkLight1337 force-pushed the mm-data-processor branch 7 times, most recently from acc378d to b60e5f8 Compare April 22, 2024 15:01
@ywang96
Copy link
Member

ywang96 commented May 31, 2024

Per offline discussion - waiting for #5118 to be merged first.

@ywang96
Copy link
Member

ywang96 commented Jun 3, 2024

@DarkLight1337 Could you resolve the merge conflicts? Once that's done I think this PR is ready to merge.

Copy link
Member

@ywang96 ywang96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a final pass and left a note, but everything LGTM! Thank you for the hard work on this. @DarkLight1337

vllm/model_executor/models/llava.py Show resolved Hide resolved
@ywang96 ywang96 enabled auto-merge (squash) June 3, 2024 03:25
@zhuohan123 zhuohan123 disabled auto-merge June 3, 2024 05:56
@zhuohan123 zhuohan123 merged commit 7a64d24 into vllm-project:main Jun 3, 2024
63 of 65 checks passed
@DarkLight1337 DarkLight1337 deleted the mm-data-processor branch June 3, 2024 05:58
blinkbear pushed a commit to blinkbear/vllm that referenced this pull request Jun 6, 2024
robertgshaw2-neuralmagic pushed a commit to neuralmagic/nm-vllm that referenced this pull request Jun 11, 2024
joerunde pushed a commit to joerunde/vllm that referenced this pull request Jun 17, 2024
xjpang pushed a commit to xjpang/vllm that referenced this pull request Jun 27, 2024
xjpang pushed a commit to xjpang/vllm that referenced this pull request Jul 8, 2024
xjpang pushed a commit to xjpang/vllm that referenced this pull request Jul 24, 2024
Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: Incorrect Data Type Conversion for MultiModalData
3 participants