-
Notifications
You must be signed in to change notification settings - Fork 27.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LLaVA-OneVision mismatch between image features and image tokens #34625
Comments
@agadetsky , it seems like there are differences in how we compute number of image tokens in the processing code and in modeling. Might be related to prev bugs with numerical issues when the image resolution is on the edge case of all possible grid resolutiions (like 337 here). I'll take a look and see where is the precision error coming |
Hi @zucchini-nlp , have you managed to identify the issue? I'm encountering the same error while using |
@chenweize1998 yes, that is most probably the anyres calculations. Unfortunately I didn't have time to look in more detail, will try to have a look today EDIT: found the place where there was precision error and opened a PR to fix |
@zucchini-nlp Thanks for looking into this! I've pinpointed the batch of data causing the issue and uploaded it here. The problem specifically originates from the first data point in the batch. Hope it helps with debugging. Additionally, here’s a minimal script to reproduce the error (assuming the data point is downloaded as from transformers import AutoModelForVision2Seq
import torch
# Load the model
model = AutoModelForVision2Seq.from_pretrained(
"llava-hf/llava-v1.6-mistral-7b-hf",
torch_dtype=torch.bfloat16
).to("cuda:0")
# Load the problematic input
inputs = torch.load("tmp.bin")
# Note: inputs['input_ids'][0] triggers the error
for k, v in inputs.items():
inputs[k] = v.to("cuda:0")
# Generate outputs
outputs = model(**inputs) I'm using |
Thank you @zucchini-nlp! |
Hi, |
@atanasmatev hey, Qwen2 has a different processing and has no unpadding like in LLaVA-OV. Can you open a new issue for it pls and provide a small code snippet for reproduction? |
Hi @agadetsky , |
Same issue with Qwen2-VL-7B with images data |
For Qwen we have an issue here #33399 (comment) But the issue is about shape errors on |
@agadetsky @zhangboshen @LysandreJik @chenweize1998 @atanasmatev anyone has solutions for this problem? i use the llamafactory to train qwen2-vl. |
For qwen2-vl-7b with llama-factory I changed one of the parameters in the config file for llama-factory from 2 to 1 (I am not near that PC to check which one exactly but the default was 2) |
hello,
…---Original---
From: ***@***.***>
Date: Tue, Jan 7, 2025 20:25 PM
To: ***@***.***>;
Cc: ***@***.******@***.***>;
Subject: Re: [huggingface/transformers] LLaVA-OneVision mismatch between imagefeatures and image tokens (Issue #34625)
For qwen2-vl-7b with llama-factory I changed one of the parameters in the config file for llama-factory from 2 to 1 (I am not near that PC to check which one exactly but the default was 2)
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you commented.Message ID: ***@***.***>
|
Hi @zucchini-nlp, I faced this issue again with transformers==4.47.1 The data caused it is the number 2482 in huggingface dataset "lmms-lab/docvqa" test split. |
@chchch0109 it would be help me a lot if you can provide a small runnable code without much external dependencies :) The numerical error bug from padding/unpadding should have been fixed by v4.47.1, so I can look if there are any other reasons to error out |
@zucchini-nlp sure from transformers import AutoProcessor
from datasets import load_dataset
import torch
processor = AutoProcessor.from_pretrained("llava-hf/llava-onevision-qwen2-7b-ov-hf")
model = LlavaOnevisionForConditionalGeneration.from_pretrained("llava-hf/llava-onevision-qwen2-7b-ov-hf", torch_dtype=torch.float16)
dataset = load_dataset("lmms-lab/docvqa", 'DocVQA')
d = dataset['test'][2482]
question = d['question']
image = d['image']
conversation = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": question},
],
},
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(images=image, text=prompt, return_tensors="pt").to("cuda")
with torch.no_grad():
outputs = model(**inputs) |
I'm facing the same problem when using Edit: new issue for this: #35775 |
System Info
transformers
version: 4.46.2Who can help?
@amyeroberts @qubvel @ArthurZucker @ITaz
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Error is the following
Expected behavior
Given that LLaVA-OneVision can work with any resolutions, the model is expected to successfully generate the output.
The text was updated successfully, but these errors were encountered: