-
Notifications
You must be signed in to change notification settings - Fork 27.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Llava-NeXT / Llava-NeXT Video / Llava-OneVision's token unpadding mismatch #35779
Fix Llava-NeXT / Llava-NeXT Video / Llava-OneVision's token unpadding mismatch #35779
Conversation
@zucchini-nlp I got the same issue: ValueError: Image features and image tokens do not match: tokens: 4589, features 4588 when using Llava-v1.6-vicuna-7b-hf (llava-next model) and the transformers version is 4.47.0. I followed PR: #35779 and modified processing_llava_next.py shown in the following code. But it doesn't work. Is there anything difference between llava-next and llava-onevision? I thought they might be the same issue. original_aspect_ratio = width / height |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks a lot for fixing!
Unpadding is having too many precision related bugs 🥲 Can you propagate the changes to llava-next and llava-next-video which use the same pad-unpad method? And then add # Copied from transformers.models.llava_next.processing_llava_next._get_unpadded_features
in other models, so it always stays the same.Since we already have the model-side copied and identical
|
There has been some other fixes (e.g. #34779 for #34625) for this problem integrated in |
Right, it cannot be 100% copied from llava-next. Oke, you can then leave "Adapted from ..." so that anyone reading the code can refer to llava-next as the source of truth |
652bc33
to
b78c385
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, LGTM!
Is there anything else I need to do? @zucchini-nlp |
Sorry, yeah let's merge it. I don't think we need one more review for this |
Quick question, why do we have a 7 hardcoded, a comment would be nice to know why it's seven and not 12983 |
Yeah hardcoded, comes from an old issue afair when the rounding happened to the wrong decimal, and we needed more precision. So 7 was aribtrary large number to keep precision. Same rounding is present in modeling code, and the idea is to keep model-processor code identical to prevent precision errors |
… mismatch (huggingface#35779) * Fix Llava OneVision's token padding * Fix Llava next and Llava next video's token unpadding for consistency
Hey @sheryc @zucchini-nlp I'm still getting this error even after installing the lastest version (4.49.0.dev0) by running
The mismatch between number of features and tokens is always very small, what makes me think that this is still related to these rounding issues somehow. I'm trying to finetune either Not sure if this helps, but here is some pseudo-code that describes the prompt/image formatting and tokenization process that I'm performing. Images are PIL objects. messages = [
{
"role": "user",
"content": [
{"index": 0, "text": None, "type": "image"},
{"index": 1, "text": None, "type": "image"},
{"index": None, "text": instruction, "type": "text"},
]
},
{
"role": "assistant",
"content": [
{"index": None, "text": label, "type": "text"},
]
}
]
prompt = self.processor.apply_chat_template(
messages,
tokenize = False,
add_generation_prompt = False,
)
inputs = self.processor(
text = prompt,
images = [image1, image2],
padding = True,
return_tensors = "pt",
) |
Hi @baraujo98, the models you used are Llava models. This PR only fixes the problem for Llava-NeXT, Llava-NeXT Video, and Llava-OneVision, whose processing is a bit different from Llava. I think it's worth a new issue? Also could you add a minimal code snippet to reproduce the error? |
@sheryc I believe that I will try to write a minimal example nonetheless. |
Yes you're right, sorry |
@baraujo98 right, I see the config on the hub wasn't updated with the |
Thank you very much for the tip @zucchini-nlp. Unsloth doesn't allow me to load revisions of the model for some reason, but I was able to fix the issue for now by running I opend a similar PR for the second model I mentioned: https://huggingface.co/unsloth/llava-v1.6-mistral-7b-hf-bnb-4bit/discussions/1 |
What does this PR do?
Fixes #35775 and the further discussions in #34625.
This PR fixes the vision feature / vision token mismatch issue for LLaVA OneVision models. Specifically, the unpadding token calculation for the Processor and the model differs by a rounding funciton, where precision issue may happen to create a dimension mismatch. The PR adds the rounding function to the Processor to match the behavior with LlavaOnevision's and LlavaNext's model behaviors.
Before submitting
Pull Request section?
to it if that's the case. (LLaVA-OneVision image features and image tokens mismatch #35775)
documentation guidelines, and
here are tips on formatting docstrings. (No new functions introduced.)
Who can review?
@zucchini-nlp
Edit: updated the title to reflect the latest changes in this PR.