Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Llava-NeXT / Llava-NeXT Video / Llava-OneVision's token unpadding mismatch #35779

Merged
merged 2 commits into from
Jan 24, 2025

Conversation

sheryc
Copy link
Contributor

@sheryc sheryc commented Jan 19, 2025

What does this PR do?

Fixes #35775 and the further discussions in #34625.

This PR fixes the vision feature / vision token mismatch issue for LLaVA OneVision models. Specifically, the unpadding token calculation for the Processor and the model differs by a rounding funciton, where precision issue may happen to create a dimension mismatch. The PR adds the rounding function to the Processor to match the behavior with LlavaOnevision's and LlavaNext's model behaviors.

Before submitting

Who can review?

@zucchini-nlp

Edit: updated the title to reflect the latest changes in this PR.

@Happy-Corpse
Copy link

@zucchini-nlp I got the same issue: ValueError: Image features and image tokens do not match: tokens: 4589, features 4588 when using Llava-v1.6-vicuna-7b-hf (llava-next model) and the transformers version is 4.47.0. I followed PR: #35779 and modified processing_llava_next.py shown in the following code. But it doesn't work. Is there anything difference between llava-next and llava-onevision? I thought they might be the same issue.

original_aspect_ratio = width / height
current_aspect_ratio = current_width / current_height
if original_aspect_ratio > current_aspect_ratio:
new_height = int(round(height * current_width / width, 7))
padding = (current_height - new_height) // 2
current_height -= padding * 2
else:
new_width = int(round(width * current_height / height, 7))
padding = (current_width - new_width) // 2
current_width -= padding * 2

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks a lot for fixing!

Unpadding is having too many precision related bugs 🥲 Can you propagate the changes to llava-next and llava-next-video which use the same pad-unpad method? And then add # Copied from transformers.models.llava_next.processing_llava_next._get_unpadded_features in other models, so it always stays the same.Since we already have the model-side copied and identical

@sheryc
Copy link
Contributor Author

sheryc commented Jan 20, 2025

Great, thanks a lot for fixing!

Unpadding is having too many precision related bugs 🥲 Can you propagate the changes to llava-next and llava-next-video which use the same pad-unpad method? And then add # Copied from transformers.models.llava_next.processing_llava_next._get_unpadded_features in other models, so it always stays the same.Since we already have the model-side copied and identical

processing_llava_onevision has some added processing regarding the ratio for interpolation, so after I add this "Copied from" comment, the makefile prompts me to fix copies. Should I ignore it or change that comment for processing_llava_onevision? @zucchini-nlp

@sheryc
Copy link
Contributor Author

sheryc commented Jan 20, 2025

@zucchini-nlp I got the same issue: ValueError: Image features and image tokens do not match: tokens: 4589, features 4588 when using Llava-v1.6-vicuna-7b-hf (llava-next model) and the transformers version is 4.47.0. I followed PR: #35779 and modified processing_llava_next.py shown in the following code. But it doesn't work. Is there anything difference between llava-next and llava-onevision? I thought they might be the same issue.

original_aspect_ratio = width / height current_aspect_ratio = current_width / current_height if original_aspect_ratio > current_aspect_ratio: new_height = int(round(height * current_width / width, 7)) padding = (current_height - new_height) // 2 current_height -= padding * 2 else: new_width = int(round(width * current_height / height, 7)) padding = (current_width - new_width) // 2 current_width -= padding * 2

There has been some other fixes (e.g. #34779 for #34625) for this problem integrated in transformers v4.47.1. Could you try upgrading first? If after the upgrade and the new code change the issue persists, could you please provide a code snippet to reproduce the issue? @Happy-Corpse

@sheryc sheryc changed the title Fix Llava OneVision's vision token padding Fix Llava-NeXT / Llava-NeXT Video / Llava-OneVision's token unpadding mismatch Jan 20, 2025
@zucchini-nlp
Copy link
Member

processing_llava_onevision has some added processing regarding the ratio for interpolation,

Right, it cannot be 100% copied from llava-next. Oke, you can then leave "Adapted from ..." so that anyone reading the code can refer to llava-next as the source of truth

@qubvel qubvel removed their request for review January 20, 2025 16:01
@sheryc sheryc force-pushed the fix-llava-onevision-padding branch from 652bc33 to b78c385 Compare January 20, 2025 16:11
@sheryc sheryc requested a review from zucchini-nlp January 20, 2025 16:13
Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM!

@sheryc
Copy link
Contributor Author

sheryc commented Jan 24, 2025

Is there anything else I need to do? @zucchini-nlp

@zucchini-nlp
Copy link
Member

Sorry, yeah let's merge it. I don't think we need one more review for this

@zucchini-nlp zucchini-nlp merged commit 72d1a4c into huggingface:main Jan 24, 2025
10 checks passed
@ArthurZucker
Copy link
Collaborator

Quick question, why do we have a 7 hardcoded, a comment would be nice to know why it's seven and not 12983

@zucchini-nlp
Copy link
Member

Yeah hardcoded, comes from an old issue afair when the rounding happened to the wrong decimal, and we needed more precision. So 7 was aribtrary large number to keep precision. Same rounding is present in modeling code, and the idea is to keep model-processor code identical to prevent precision errors

@sheryc sheryc deleted the fix-llava-onevision-padding branch January 24, 2025 14:38
bursteratom pushed a commit to bursteratom/transformers that referenced this pull request Jan 31, 2025
… mismatch (huggingface#35779)

* Fix Llava OneVision's token padding

* Fix Llava next and Llava next video's token unpadding for consistency
@baraujo98
Copy link

baraujo98 commented Feb 10, 2025

Hey @sheryc @zucchini-nlp I'm still getting this error even after installing the lastest version (4.49.0.dev0) by running pip install git+https://github.com/huggingface/transformers.git. There error is the following:

ValueError: Image features and image tokens do not match: tokens: 6552, features 6556

The mismatch between number of features and tokens is always very small, what makes me think that this is still related to these rounding issues somehow.

I'm trying to finetune either unsloth/llava-1.5-7b-hf-bnb-4bit or unsloth/llava-v1.6-mistral-7b-hf-bnb-4bit through Unsloth, to perform binary classification based on a list of 2 images.

Not sure if this helps, but here is some pseudo-code that describes the prompt/image formatting and tokenization process that I'm performing. Images are PIL objects.

messages = [
  {
    "role": "user",
    "content": [
      {"index": 0, "text": None, "type": "image"},
      {"index": 1, "text": None, "type": "image"},
      {"index": None, "text": instruction, "type": "text"},
    ]
  },
  {
    "role": "assistant",
    "content": [
      {"index": None, "text": label, "type": "text"},
    ]
  }
]

prompt = self.processor.apply_chat_template(
    messages,
    tokenize = False,
    add_generation_prompt = False,
)

inputs = self.processor(
    text = prompt,
    images  = [image1, image2],
    padding = True,
    return_tensors = "pt",
)

@sheryc
Copy link
Contributor Author

sheryc commented Feb 10, 2025

Hi @baraujo98, the models you used are Llava models. This PR only fixes the problem for Llava-NeXT, Llava-NeXT Video, and Llava-OneVision, whose processing is a bit different from Llava. I think it's worth a new issue? Also could you add a minimal code snippet to reproduce the error?

@baraujo98
Copy link

@sheryc I believe that unsloth/llava-v1.6-mistral-7b-hf-bnb-4bit is a LLaVa-NeXT model and that it uses the transformers' LLaVa-NeXT scripts.

I will try to write a minimal example nonetheless.

@sheryc
Copy link
Contributor Author

sheryc commented Feb 10, 2025

@sheryc I believe that unsloth/llava-v1.6-mistral-7b-hf-bnb-4bit is a LLaVa-NeXT model and that it uses the transformers' LLaVa-NeXT scripts.

I will try to write a minimal example nonetheless.

Yes you're right, sorry

@zucchini-nlp
Copy link
Member

@baraujo98 right, I see the config on the hub wasn't updated with the additional_tokens. I opened a PR here (https://huggingface.co/unsloth/llava-1.5-7b-hf-bnb-4bit/discussions/1). Can you check if adding the attribute solves the issue?

@baraujo98
Copy link

Thank you very much for the tip @zucchini-nlp. Unsloth doesn't allow me to load revisions of the model for some reason, but I was able to fix the issue for now by running tokenizer.num_additional_image_tokens = 1 after loading the pretrained model. It is now able to train without throwing that error.

I opend a similar PR for the second model I mentioned: https://huggingface.co/unsloth/llava-v1.6-mistral-7b-hf-bnb-4bit/discussions/1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

LLaVA-OneVision image features and image tokens mismatch
5 participants