-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[LLAVA-NEXT] ValueError: The input provided to the model are wrong. The number of image tokens is 1 while the number of image given to the model is 1. This prevents correct indexing and breaks batch generation. #29835
Comments
EDIT: it is always reproducible even on a single data. |
I have never gotten it in the original impl with the exact same set of data |
cc @NielsRogge |
Update: I think I found one of the causes (there may be more if @ScottishFold007 's data is not same as mine), but having a |
Yea it's caused by https://github.com/huggingface/transformers/blob/ba56ed0869eb4bbeb1c04af7f62a04350150e8d4/src/transformers/models/llava_next/modeling_llava_next.py#L407C30-L407C69 I am not sure if |
Thanks for investigating, I had a very similar error when trying to add batched generation for llava-next. Regarding |
I found a quick fix for this error: def resize_image(iimage_path, output_image_path, image_dimension=()):
images = Image.open(image_path)
resized_image = images.resize(size)
resized_image.save(output_image_path) Code explanation: The Llava model supports a specific dimension for image size. By resizing input images, Llava can generate image tokens and extract information text from the image. input_image_path: specifies the path to image |
This does not solve the issue as i stated the issue only exists if something is being wrongy identified as the image token when it is not, not related to image sizes at all. |
Thats not the reason here as using this format is assumed to be known by users already since it is documented. The root issue is with zero embeddings |
I found that the weights for <unk> token were zeroed out when we were converting the checkpoint. Going from bf16 to float16 here, clamps weights to 0. I will leave for @NielsRogge to decide if we should convert weights again, and if casting to float16 is necessary |
Having unkown token in input also causes it. Pretty much anything not padding that could result in zero embedding. I think a better way is to use the image index solely and dont mark anything zero as images. |
this makes LLaVA-Next unusable in any version of Transformers/Tokenizers, fwiw. the demo code doesn't work.. can this be prioritised to resolve? @NielsRogge @zucchini-nlp |
The tutorial code does work though. And it works fine for all cases except when your input contains both image and a token that lead to zero embedding (namely, ) |
maybe it's helpful to note that it's the 34b model that fails to work. |
no zero embedding to speak of:
|
opened #30294 to avoid making more noise on this issue since it appears there are two distinct problems with this same error message occurring. |
This comment has been minimized.
This comment has been minimized.
might not be what you want to hear but Llava 1.6 has rather poor support through anything but their own codebase, which essentially opens up a webserver to serve inference from. you might want to switch to that, or a different VLM. there's a lot of VLMs to select from, and Llava 1.6 is possibly not worth the effort considering the frequency of issues and the lack of support. |
This comment has been minimized.
This comment has been minimized.
@zucchini-nlp if that resolves this issue then we should reupload the weights. @Z1zs did you include the special <image> token in the input sequence? |
Okay, will make a PR then. I believe @bghira is correct that we should fix the code, and not weights as loading models in fp=16 prob causes the error anyway. I will look into it tomorrow |
@bghira sorry to hear that but we're planning to have best-in-class support for Llava-NeXT. This issue seems related to the unk token (which @zucchini-nlp is going to look into), batched generation should hopefully be part of the next Transformers release. The issue linked at #30294 was found to be non-reproducible on CUDA and CPU and seems related to a bug in PyTorch regarding the MPS device. |
I got it, the problem of mine results from the lack of |
i think the automated tests need to catch things like this if the goal is best in class support, it cant be just left to the users and downstream development to discover these things werent tested or usable? |
@bghira true, batched generation is not enforced in default tests so far, one writes (slow) integration tests for these, e.g. here for llava. cc @zucchini-nlp @gante - I think it'd be great if we had batched generation tests by default which force contributors (like me) to make sure this is supported and tested from the start, not sure if feasible. Regarding the issue with <unk> token => that's something we could also add default tests for, this is currently not tested. cc @ydshieh do you think we should add a test to |
thank you. the first impression is ever so important, as it's already taken so long to get llava 1.6 support anywhere but their own demo codebase that once you test it and everything falls apart, there is a real mounting risk of losing interest in the model after spending a few hours trying to determine where it went wrong. also, i am not doing batched generations. for MPS, it fails just in the normal setup. |
(I think the problem happens whenever there is token (id) no matter where it comes - but just to be sure).
OK, my above understanding is correct. Regarding testing, it's not very clear to me as the problem may happen to any token (or at least any special token). Even if we restrict to I would suggest to focus on resolving the |
Yes, I am working on adding GenerationTesterMixin for multimodal models, and hope that will help to catch more errors in the stage of adding the model 😄 |
There is already a image token ids in merge input features. Why not reuse? |
The problem is caused by the loss of and |
I maybe encounter the same question with transformer==4.41.0
the reason behind is may be the data itself contain Thanks |
Hi, I encountered a ValueError when fine-tuning the llava-hf/llava-1.5-7b-hf model using HuggingFace. The error says: My dataset is in this format: { Could you please help me understand why this is happening and how to fix it? |
@ShobhaRajanna Could you plz provide more details about your setup or share a minimal reproducible code snippet? I encountered a similar issue when fine-tuning llava-hf/llava-v1.6-vicuna-7b-hf, and it turned out to be related to the (However I haven't figure out what exactly cause this error😅) |
I'm encountering an issue with the format_example function when preparing inputs for fine-tuning the model. Specifically, it seems to be related to how the
Error processing example 1: The input provided to the model are wrong. The number of image tokens is 511 while the number of images given to the model is 1. This prevents correct indexing and breaks batch generation. This seems to suggest that the model isn't detecting the presence of the image properly without the I verified that the image paths are correct and accessible. Thanks again for your insights any additional suggestions are much appreciated! |
@ShobhaRajanna Your problem raised from the wrong <image> token num. Plz use |
@Z1zs @ShobhaRajanna
But the error is: I found the reason because the inputs built by
How should I modify the code to solve this problem? thank you |
@LiuJinzhe-Keepgoing `from transformers import AutoTokenizer, LlavaForConditionalGeneration model_id = "llava-hf/llava-1.5-7b-hf" def preprocess_image(image_path, size=(336, 336)): def generate_response(model, tokenizer, image_tensor, prompt, max_length=100, num_beams=5, temperature=0.6, top_p=0.9): image_path = "download2.jpeg" image_tensor = preprocess_image(image_path) print("Model Response:", response)` |
@na69tyzy
Error content:
The contents of prompt printed by my code are as follows:
Excuse me, if I don't want to change the version of transformers, can I solve this problem in version 4.37.1 ? |
@LiuJinzhe-Keepgoing i'm using transformer version 4.46.0 and it works for me |
System Info
transformers
version: 4.39.1Who can help?
@amyeroberts
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Running model.generate on a batch of input with batch size 1 (meaning input is a list of len 1), i randomly get this error. This does not happen for the original impl, but the original impl has a weird image encoding error. This does not seem to be reliably reproducible as running it again/separately does not cause it,
My code:
Expected behavior
No error
The text was updated successfully, but these errors were encountered: