Fix Llava-NeXT / Llava-NeXT Video / Llava-OneVision's token unpadding mismatch #35779

sheryc · 2025-01-19T21:16:43Z

What does this PR do?

Fixes #35775 and the further discussions in #34625.

This PR fixes the vision feature / vision token mismatch issue for LLaVA OneVision models. Specifically, the unpadding token calculation for the Processor and the model differs by a rounding funciton, where precision issue may happen to create a dimension mismatch. The PR adds the rounding function to the Processor to match the behavior with LlavaOnevision's and LlavaNext's model behaviors.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). (N/A)
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case. (LLaVA-OneVision image features and image tokens mismatch #35775)
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings. (No new functions introduced.)
Did you write any new necessary tests? (No new tests needed.)

Who can review?

@zucchini-nlp

Edit: updated the title to reflect the latest changes in this PR.

Happy-Corpse · 2025-01-20T08:08:49Z

@zucchini-nlp I got the same issue: ValueError: Image features and image tokens do not match: tokens: 4589, features 4588 when using Llava-v1.6-vicuna-7b-hf (llava-next model) and the transformers version is 4.47.0. I followed PR: #35779 and modified processing_llava_next.py shown in the following code. But it doesn't work. Is there anything difference between llava-next and llava-onevision? I thought they might be the same issue.

original_aspect_ratio = width / height
current_aspect_ratio = current_width / current_height
if original_aspect_ratio > current_aspect_ratio:
new_height = int(round(height * current_width / width, 7))
padding = (current_height - new_height) // 2
current_height -= padding * 2
else:
new_width = int(round(width * current_height / height, 7))
padding = (current_width - new_width) // 2
current_width -= padding * 2

zucchini-nlp

Great, thanks a lot for fixing!

Unpadding is having too many precision related bugs 🥲 Can you propagate the changes to llava-next and llava-next-video which use the same pad-unpad method? And then add # Copied from transformers.models.llava_next.processing_llava_next._get_unpadded_features in other models, so it always stays the same.Since we already have the model-side copied and identical

sheryc · 2025-01-20T15:13:00Z

Great, thanks a lot for fixing!

Unpadding is having too many precision related bugs 🥲 Can you propagate the changes to llava-next and llava-next-video which use the same pad-unpad method? And then add # Copied from transformers.models.llava_next.processing_llava_next._get_unpadded_features in other models, so it always stays the same.Since we already have the model-side copied and identical

processing_llava_onevision has some added processing regarding the ratio for interpolation, so after I add this "Copied from" comment, the makefile prompts me to fix copies. Should I ignore it or change that comment for processing_llava_onevision? @zucchini-nlp

sheryc · 2025-01-20T15:16:50Z

@zucchini-nlp I got the same issue: ValueError: Image features and image tokens do not match: tokens: 4589, features 4588 when using Llava-v1.6-vicuna-7b-hf (llava-next model) and the transformers version is 4.47.0. I followed PR: #35779 and modified processing_llava_next.py shown in the following code. But it doesn't work. Is there anything difference between llava-next and llava-onevision? I thought they might be the same issue.

original_aspect_ratio = width / height current_aspect_ratio = current_width / current_height if original_aspect_ratio > current_aspect_ratio: new_height = int(round(height * current_width / width, 7)) padding = (current_height - new_height) // 2 current_height -= padding * 2 else: new_width = int(round(width * current_height / height, 7)) padding = (current_width - new_width) // 2 current_width -= padding * 2

There has been some other fixes (e.g. #34779 for #34625) for this problem integrated in transformers v4.47.1. Could you try upgrading first? If after the upgrade and the new code change the issue persists, could you please provide a code snippet to reproduce the issue? @Happy-Corpse

zucchini-nlp · 2025-01-20T15:37:34Z

processing_llava_onevision has some added processing regarding the ratio for interpolation,

Right, it cannot be 100% copied from llava-next. Oke, you can then leave "Adapted from ..." so that anyone reading the code can refer to llava-next as the source of truth

zucchini-nlp

Thanks, LGTM!

sheryc · 2025-01-24T04:49:40Z

Is there anything else I need to do? @zucchini-nlp

zucchini-nlp · 2025-01-24T08:10:20Z

Sorry, yeah let's merge it. I don't think we need one more review for this

ArthurZucker · 2025-01-24T09:57:05Z

Quick question, why do we have a 7 hardcoded, a comment would be nice to know why it's seven and not 12983

zucchini-nlp · 2025-01-24T09:59:55Z

Yeah hardcoded, comes from an old issue afair when the rounding happened to the wrong decimal, and we needed more precision. So 7 was aribtrary large number to keep precision. Same rounding is present in modeling code, and the idea is to keep model-processor code identical to prevent precision errors

… mismatch (huggingface#35779) * Fix Llava OneVision's token padding * Fix Llava next and Llava next video's token unpadding for consistency

baraujo98 · 2025-02-10T19:48:23Z

Hey @sheryc @zucchini-nlp I'm still getting this error even after installing the lastest version (4.49.0.dev0) by running pip install git+https://github.com/huggingface/transformers.git. There error is the following:

ValueError: Image features and image tokens do not match: tokens: 6552, features 6556

The mismatch between number of features and tokens is always very small, what makes me think that this is still related to these rounding issues somehow.

I'm trying to finetune either unsloth/llava-1.5-7b-hf-bnb-4bit or unsloth/llava-v1.6-mistral-7b-hf-bnb-4bit through Unsloth, to perform binary classification based on a list of 2 images.

Not sure if this helps, but here is some pseudo-code that describes the prompt/image formatting and tokenization process that I'm performing. Images are PIL objects.

messages = [
  {
    "role": "user",
    "content": [
      {"index": 0, "text": None, "type": "image"},
      {"index": 1, "text": None, "type": "image"},
      {"index": None, "text": instruction, "type": "text"},
    ]
  },
  {
    "role": "assistant",
    "content": [
      {"index": None, "text": label, "type": "text"},
    ]
  }
]

prompt = self.processor.apply_chat_template(
    messages,
    tokenize = False,
    add_generation_prompt = False,
)

inputs = self.processor(
    text = prompt,
    images  = [image1, image2],
    padding = True,
    return_tensors = "pt",
)

sheryc · 2025-02-10T20:47:26Z

Hi @baraujo98, the models you used are Llava models. This PR only fixes the problem for Llava-NeXT, Llava-NeXT Video, and Llava-OneVision, whose processing is a bit different from Llava. I think it's worth a new issue? Also could you add a minimal code snippet to reproduce the error?

baraujo98 · 2025-02-10T20:58:13Z

@sheryc I believe that unsloth/llava-v1.6-mistral-7b-hf-bnb-4bit is a LLaVa-NeXT model and that it uses the transformers' LLaVa-NeXT scripts.

I will try to write a minimal example nonetheless.

sheryc · 2025-02-10T21:12:12Z

@sheryc I believe that unsloth/llava-v1.6-mistral-7b-hf-bnb-4bit is a LLaVa-NeXT model and that it uses the transformers' LLaVa-NeXT scripts.

I will try to write a minimal example nonetheless.

Yes you're right, sorry

zucchini-nlp · 2025-02-11T08:51:43Z

@baraujo98 right, I see the config on the hub wasn't updated with the additional_tokens. I opened a PR here (https://huggingface.co/unsloth/llava-1.5-7b-hf-bnb-4bit/discussions/1). Can you check if adding the attribute solves the issue?

baraujo98 · 2025-02-11T13:24:40Z

Thank you very much for the tip @zucchini-nlp. Unsloth doesn't allow me to load revisions of the model for some reason, but I was able to fix the issue for now by running tokenizer.num_additional_image_tokens = 1 after loading the pretrained model. It is now able to train without throwing that error.

I opend a similar PR for the second model I mentioned: https://huggingface.co/unsloth/llava-v1.6-mistral-7b-hf-bnb-4bit/discussions/1

sheryc requested review from molbap, yonigozlan and qubvel as code owners January 19, 2025 21:16

sheryc mentioned this pull request Jan 19, 2025

LLaVA-OneVision image features and image tokens mismatch #35775

Closed

4 tasks

zucchini-nlp reviewed Jan 20, 2025

View reviewed changes

sheryc changed the title ~~Fix Llava OneVision's vision token padding~~ Fix Llava-NeXT / Llava-NeXT Video / Llava-OneVision's token unpadding mismatch Jan 20, 2025

qubvel removed their request for review January 20, 2025 16:01

sheryc added 2 commits January 20, 2025 16:03

Fix Llava OneVision's token padding

c1e740d

Fix Llava next and Llava next video's token unpadding for consistency

b78c385

sheryc force-pushed the fix-llava-onevision-padding branch from 652bc33 to b78c385 Compare January 20, 2025 16:11

sheryc requested a review from zucchini-nlp January 20, 2025 16:13

zucchini-nlp approved these changes Jan 21, 2025

View reviewed changes

zucchini-nlp merged commit 72d1a4c into huggingface:main Jan 24, 2025
10 checks passed

sheryc deleted the fix-llava-onevision-padding branch January 24, 2025 14:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Llava-NeXT / Llava-NeXT Video / Llava-OneVision's token unpadding mismatch #35779

Fix Llava-NeXT / Llava-NeXT Video / Llava-OneVision's token unpadding mismatch #35779

sheryc commented Jan 19, 2025 •

edited

Loading

Happy-Corpse commented Jan 20, 2025

zucchini-nlp left a comment

sheryc commented Jan 20, 2025 •

edited

Loading

sheryc commented Jan 20, 2025 •

edited

Loading

zucchini-nlp commented Jan 20, 2025

zucchini-nlp left a comment

sheryc commented Jan 24, 2025

zucchini-nlp commented Jan 24, 2025

ArthurZucker commented Jan 24, 2025

zucchini-nlp commented Jan 24, 2025

baraujo98 commented Feb 10, 2025 •

edited

Loading

sheryc commented Feb 10, 2025

baraujo98 commented Feb 10, 2025

sheryc commented Feb 10, 2025

zucchini-nlp commented Feb 11, 2025

baraujo98 commented Feb 11, 2025

Fix Llava-NeXT / Llava-NeXT Video / Llava-OneVision's token unpadding mismatch #35779

Fix Llava-NeXT / Llava-NeXT Video / Llava-OneVision's token unpadding mismatch #35779

Conversation

sheryc commented Jan 19, 2025 • edited Loading

What does this PR do?

Before submitting

Who can review?

Happy-Corpse commented Jan 20, 2025

zucchini-nlp left a comment

Choose a reason for hiding this comment

sheryc commented Jan 20, 2025 • edited Loading

sheryc commented Jan 20, 2025 • edited Loading

zucchini-nlp commented Jan 20, 2025

zucchini-nlp left a comment

Choose a reason for hiding this comment

sheryc commented Jan 24, 2025

zucchini-nlp commented Jan 24, 2025

ArthurZucker commented Jan 24, 2025

zucchini-nlp commented Jan 24, 2025

baraujo98 commented Feb 10, 2025 • edited Loading

sheryc commented Feb 10, 2025

baraujo98 commented Feb 10, 2025

sheryc commented Feb 10, 2025

zucchini-nlp commented Feb 11, 2025

baraujo98 commented Feb 11, 2025

sheryc commented Jan 19, 2025 •

edited

Loading

sheryc commented Jan 20, 2025 •

edited

Loading

sheryc commented Jan 20, 2025 •

edited

Loading

baraujo98 commented Feb 10, 2025 •

edited

Loading