[LLAVA-NEXT] ValueError: The input provided to the model are wrong. The number of image tokens is 1 while the number of image given to the model is 1. This prevents correct indexing and breaks batch generation. #29835

aliencaocao · 2024-03-24T04:42:23Z

System Info

transformers version: 4.39.1
Platform: Linux-5.19.0-051900rc6-generic-x86_64-with-glibc2.35
Python version: 3.9.18
Huggingface_hub version: 0.21.1
Safetensors version: 0.4.2
Accelerate version: 0.28.0
Accelerate config: not found
PyTorch version (GPU?): 2.2.1+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: yes
Using distributed or parallel set-up in script?: no

Who can help?

@amyeroberts

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Running model.generate on a batch of input with batch size 1 (meaning input is a list of len 1), i randomly get this error. This does not happen for the original impl, but the original impl has a weird image encoding error. This does not seem to be reliably reproducible as running it again/separately does not cause it,

 File "/home2/*.py", line 159, in infer_batch
    output = model.generate(
  File "/home2/*/venv/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home2/*/venv/lib/python3.9/site-packages/transformers/generation/utils.py", line 1527, in generate
    result = self._greedy_search(
  File "/home2/*/venv/lib/python3.9/site-packages/transformers/generation/utils.py", line 2411, in _greedy_search
    outputs = self(
  File "/home2/*/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home2/*/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home2/*/venv/lib/python3.9/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home2/*/venv/lib/python3.9/site-packages/transformers/models/llava_next/modeling_llava_next.py", line 555, in forward
    inputs_embeds, attention_mask, labels, position_ids = self._merge_input_ids_with_image_features(
  File "/home2/*/venv/lib/python3.9/site-packages/transformers/models/llava_next/modeling_llava_next.py", line 411, in _merge_input_ids_with_image_features
    raise ValueError(
ValueError: The input provided to the model are wrong. The number of image tokens is 1 while the number of image given to the model is 1. This prevents correct indexing and breaks batch generation.

My code:

model_path = 'panoyo9829/llava-v1.6-mistral-7b-bnb-4bit-hf'
processor = LlavaNextProcessor.from_pretrained(model_path, local_files_only=True)
model = LlavaNextForConditionalGeneration.from_pretrained(model_path, torch_dtype=torch.float16, low_cpu_mem_usage=True, local_files_only=True, device_map='auto')
            inputs = processor(text=prompts, images=images, padding=True, return_tensors="pt").to(model.device)
            with torch.backends.cuda.sdp_kernel(enable_flash=False, enable_math=False, enable_mem_efficient=True):
                output = model.generate(
                    **inputs,
                    max_new_tokens=512,
                    use_cache=True,
                    output_logits=True,
                    return_dict_in_generate=True,
                    pad_token_id=processor.tokenizer.pad_token_id,
                )

Expected behavior

No error

The text was updated successfully, but these errors were encountered:

aliencaocao · 2024-03-24T05:51:23Z

~~Actually I find that it is reproducible if I don't run them separately but continuously with the previous data, and reloading the model on the fly does not solve it. I am not sure why is that so.~~
Because the data and full script here is private, I can email to whoever from HF investigating this.

EDIT: it is always reproducible even on a single data.

ScottishFold007 · 2024-03-24T13:19:37Z

The same problem is with llava.

aliencaocao · 2024-03-24T13:21:17Z

I have never gotten it in the original impl with the exact same set of data

amyeroberts · 2024-03-24T20:13:01Z

cc @NielsRogge

aliencaocao · 2024-03-29T09:25:38Z

Update: I think I found one of the causes (there may be more if @ScottishFold007 's data is not same as mine), but having a <unk> in prompt causes it. Yes, I have it as a pure text and not a token so by right it should be treating it as text and not a single unknown token.

aliencaocao · 2024-03-29T10:59:59Z

Yea it's caused by https://github.com/huggingface/transformers/blob/ba56ed0869eb4bbeb1c04af7f62a04350150e8d4/src/transformers/models/llava_next/modeling_llava_next.py#L407C30-L407C69
When the input contains unknown token, it will have zero embedding, and then be wrongly masked as a image token here.

I am not sure if <unk> input leading to an 0 embedding is expected though. It sounds a bit weird to me that user can directly input actual tokens and it will be tokenized as it is, instead of like < and un etc.

NielsRogge · 2024-03-29T18:45:16Z

Thanks for investigating, I had a very similar error when trying to add batched generation for llava-next.

Regarding <unk> => that's a special token just like <pad> and <s>, which means these are kept as is, they are not split into multiple tokens.

cc @ArthurZucker

Ceejay16042 · 2024-03-30T14:44:07Z

I found a quick fix for this error:

def resize_image(iimage_path, output_image_path, image_dimension=()):
    images = Image.open(image_path)
    resized_image = images.resize(size)
    resized_image.save(output_image_path)

Code explanation: The Llava model supports a specific dimension for image size. By resizing input images, Llava can generate image tokens and extract information text from the image.

input_image_path: specifies the path to image
image dimension: resizes the image to fit into the llava model for interpretation. e.g (1000, 667)
output_image_path: path to save image

aliencaocao · 2024-03-30T15:01:44Z

This does not solve the issue as i stated the issue only exists if something is being wrongy identified as the image token when it is not, not related to image sizes at all.

Ceejay16042 · 2024-04-15T19:36:45Z

Found an alternative to resolve the LLAVA [LLAMA-NEXT] ValueError: regarding correct indexing and batch generation breaking. When providing your input prompt to the LLAVA LLM, a specific format must be adhered to.

model_id = "llava-hf/llava-1.5-7b-hf"
pipe = pipeline("image-to-text", model=model_id, model_kwargs={"quantization_config": quantization_config})
prompt_instructions = "USER: \n" + prompt + "\nASSISTANT:"
outputs = pipe(image, prompt=prompt_instructions, generate_kwargs={"max_new_tokens": max_new_tokens})

take note of the prompt_instructions variable specified in the code above and make sure your prompt is passed as an input to the LLM in that same format.

aliencaocao · 2024-04-15T23:55:51Z

Thats not the reason here as using this format is assumed to be known by users already since it is documented. The root issue is with zero embeddings

NielsRogge · 2024-04-16T07:18:26Z

cc @zucchini-nlp

zucchini-nlp · 2024-04-16T08:47:51Z

I found that the weights for <unk> token were zeroed out when we were converting the checkpoint. Going from bf16 to float16 here, clamps weights to 0.

I will leave for @NielsRogge to decide if we should convert weights again, and if casting to float16 is necessary

aliencaocao · 2024-04-16T09:24:46Z

Having unkown token in input also causes it. Pretty much anything not padding that could result in zero embedding. I think a better way is to use the image index solely and dont mark anything zero as images.

bghira · 2024-04-17T13:51:20Z

this makes LLaVA-Next unusable in any version of Transformers/Tokenizers, fwiw.

the demo code doesn't work.. can this be prioritised to resolve? @NielsRogge @zucchini-nlp

aliencaocao · 2024-04-17T13:55:40Z

The tutorial code does work though. And it works fine for all cases except when your input contains both image and a token that lead to zero embedding (namely, )

bghira · 2024-04-17T13:56:58Z

maybe it's helpful to note that it's the 34b model that fails to work.

bghira · 2024-04-17T13:57:34Z

no zero embedding to speak of:

INFO:root:Processing image: anime-summerghost-54.png, data: <PIL.PngImagePlugin.PngImageFile image mode=RGB size=1920x1080 at 0x16FC842E0>
INFO:root:Using LLaVA 1.6+ model.
INFO:root:Inputs: {'input_ids': tensor([[59603,  9334,  1397,   562, 13310,  2756,   597,   663, 15874, 10357,
         14135,    98,   707, 14135,  3641,  6901,    97,  7283,    97,   597,
         31081,  8476,   592,   567,  2756, 59610, 59575,  3275,    98,  2134,
          1471, 59601, 59568, 64000,   144,  5697,   620,  2709,   594,   719,
          2728,   100, 39965,  8898,  9129, 59601]], device='mps:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]],
       device='mps:0'), 'pixel_values': tensor([[[[[ 1.3464,  1.3464,  1.3464,  ...,  0.0325,  0.1201,  0.1493],
           [ 1.3464,  1.3464,  1.3464,  ...,  0.2077,  0.2223,  0.4559],
           [ 1.3464,  1.3464,  1.3464,  ...,  0.4559,  0.3391,  0.4851],
           ...,
           [ 0.6457,  0.6311,  0.6457,  ...,  0.0471,  0.0617,  0.0763],
           [ 0.6603,  0.6457,  0.6749,  ...,  0.0471,  0.0617,  0.0617],
           [ 0.6895,  0.6749,  0.7041,  ...,  0.0471,  0.0617,  0.0471]],

          [[ 1.4145,  1.4145,  1.4145,  ..., -0.3864, -0.2663, -0.1763],
           [ 1.4145,  1.4145,  1.4145,  ..., -0.1313, -0.1313,  0.1239],
           [ 1.4145,  1.4145,  1.4145,  ...,  0.1539,  0.0038,  0.1389],
... snip ...

bghira · 2024-04-17T14:14:53Z

opened #30294 to avoid making more noise on this issue since it appears there are two distinct problems with this same error message occurring.

bghira · 2024-04-24T16:28:42Z

might not be what you want to hear but Llava 1.6 has rather poor support through anything but their own codebase, which essentially opens up a webserver to serve inference from.

you might want to switch to that, or a different VLM. there's a lot of VLMs to select from, and Llava 1.6 is possibly not worth the effort considering the frequency of issues and the lack of support.

NielsRogge · 2024-04-24T18:38:56Z

@zucchini-nlp if that resolves this issue then we should reupload the weights.

@Z1zs did you include the special <image> token in the input sequence?

zucchini-nlp · 2024-04-24T18:53:13Z

Okay, will make a PR then. I believe @bghira is correct that we should fix the code, and not weights as loading models in fp=16 prob causes the error anyway. I will look into it tomorrow

NielsRogge · 2024-04-24T19:15:41Z

@bghira sorry to hear that but we're planning to have best-in-class support for Llava-NeXT. This issue seems related to the unk token (which @zucchini-nlp is going to look into), batched generation should hopefully be part of the next Transformers release. The issue linked at #30294 was found to be non-reproducible on CUDA and CPU and seems related to a bug in PyTorch regarding the MPS device.

Z1zs · 2024-04-24T19:31:53Z

I got it, the problem of mine results from the lack of token. The image token highlights the location in where the image_feature should be inserted during the text-image feature merging.
Thanks very much for the reminding @NielsRogge @bghira ! Thanks for all the efforts and sorry for any bothering caused.

bghira · 2024-04-24T19:40:12Z

@bghira sorry to hear that but we're planning to have best-in-class support for Llava-NeXT. This issue seems related to the unk token (which @zucchini-nlp is going to look into), batched generation should hopefully be part of the next Transformers release. The issue linked at #30294 was found to be non-reproducible on CUDA and CPU and seems related to a bug in PyTorch regarding the MPS device.

i think the automated tests need to catch things like this if the goal is best in class support, it cant be just left to the users and downstream development to discover these things werent tested or usable?

NielsRogge · 2024-04-24T20:07:08Z

@bghira true, batched generation is not enforced in default tests so far, one writes (slow) integration tests for these, e.g. here for llava. cc @zucchini-nlp @gante - I think it'd be great if we had batched generation tests by default which force contributors (like me) to make sure this is supported and tested from the start, not sure if feasible.

Regarding the issue with <unk> token => that's something we could also add default tests for, this is currently not tested. cc @ydshieh do you think we should add a test to test_modeling_common.py to make sure models work as expected during a forward pass when passing a sequence that includes an <unk> token?

bghira · 2024-04-24T20:09:36Z

thank you. the first impression is ever so important, as it's already taken so long to get llava 1.6 support anywhere but their own demo codebase that once you test it and everything falls apart, there is a real mounting risk of losing interest in the model after spending a few hours trying to determine where it went wrong.

also, i am not doing batched generations. for MPS, it fails just in the normal setup.

ydshieh · 2024-04-25T11:53:06Z

Regarding the issue with token => that's something we could also add default tests for, this is currently not tested. cc @ydshieh do you think we should add a test to test_modeling_common.py to make sure models work as expected during a forward pass when passing a sequence that includes an token?

A question first: is the problem from there is an <unk> token (id) in the model input (general case), or there is <unk> string in the input text and it is kept as <unk> (id) after being encoded.

(I think the problem happens whenever there is token (id) no matter where it comes - but just to be sure).

When the input contains unknown token, it will have zero embedding, and then be wrongly masked as a image token here.

OK, my above understanding is correct.

Regarding testing, it's not very clear to me as the problem may happen to any token (or at least any special token). Even if we restrict to <unk> token only, the issue raised here is likely to happen for special models like LLAVA-Next, as text-only models won't have issue like that (they just use Embedding layer of to embed tokens).

I would suggest to focus on resolving the zero embedding issue. But if anyone has good suggestion regarding how to make a test, I am happy to think of it.

zucchini-nlp · 2024-04-25T12:00:34Z

I think it'd be great if we had batched generation tests by default which force contributors (like me) to make sure this is supported and tested from the start, not sure if feasible.

Yes, I am working on adding GenerationTesterMixin for multimodal models, and hope that will help to catch more errors in the stage of adding the model 😄

aliencaocao · 2024-04-25T12:54:17Z

There is already a image token ids in merge input features. Why not reuse?

huttersadan · 2024-05-06T08:18:41Z

The problem is caused by the loss of and in processor, you can solve this question by adding the following code.
processor.tokenizer.add_tokens(["<image>", "<pad>"], special_tokens=True) model.resize_token_embeddings(len(processor.tokenizer))
this solution link is https://huggingface.co/IlyasMoutawwakil/tiny-random-LlavaForConditionalGeneration/discussions/1.
It worked for me.

hxhcreate · 2024-08-20T10:18:47Z

I maybe encounter the same question with transformer==4.41.0

ValueError: The input provided to the model are wrong. The number of image tokens is 4 while the number of image given to the model is 1. This prevents correct indexing and breaks batch generation.

the reason behind is may be the data itself contain <image> token, can this be addressed in LlavaProcessor internally, or should I change transformer version?

Thanks

ShobhaRajanna · 2024-11-30T22:51:11Z

Hi,

I encountered a ValueError when fine-tuning the llava-hf/llava-1.5-7b-hf model using HuggingFace. The error says:
"The input provided to the model are wrong. The number of image tokens is 0 while the number of images given to the model is 1. This prevents correct indexing and breaks batch generation."

My dataset is in this format: {
"image": "data/png_images/train/00038.png",
"instruction": "Please provide the attributes of this GNSS image:\n- Class\n- Amplitude\n- Area\n- Subjammer\n- Position\n- Environment.",
"response": "Class: 2, Amplitude: 6, Area: 2, Subjammer: Chirp_LinearFast_BW15_MXG.bin, Position: 1, Environment: 1"
}
I’m using transformers version 4.46.0. The error seems to occur because the model is not processing the image inputs properly, leading to a mismatch in the number of image tokens.

Could you please help me understand why this is happening and how to fix it?

Z1zs · 2024-12-05T21:44:07Z

Hi,

I encountered a ValueError when fine-tuning the llava-hf/llava-1.5-7b-hf model using HuggingFace. The error says: "The input provided to the model are wrong. The number of image tokens is 0 while the number of images given to the model is 1. This prevents correct indexing and breaks batch generation."

My dataset is in this format: { "image": "data/png_images/train/00038.png", "instruction": "Please provide the attributes of this GNSS image:\n- Class\n- Amplitude\n- Area\n- Subjammer\n- Position\n- Environment.", "response": "Class: 2, Amplitude: 6, Area: 2, Subjammer: Chirp_LinearFast_BW15_MXG.bin, Position: 1, Environment: 1" } I’m using transformers version 4.46.0. The error seems to occur because the model is not processing the image inputs properly, leading to a mismatch in the number of image tokens.

Could you please help me understand why this is happening and how to fix it?

@ShobhaRajanna Could you plz provide more details about your setup or share a minimal reproducible code snippet?

I encountered a similar issue when fine-tuning llava-hf/llava-v1.6-vicuna-7b-hf, and it turned out to be related to the accelerate library and DDP setting. To resolve it, I try to launch the script directly with Python. It worked successfully for me. What's more, using one single GPU and set batch_size=1 also worked.

(However I haven't figure out what exactly cause this error😅)

ShobhaRajanna · 2024-12-05T22:54:25Z

Hi,
I encountered a ValueError when fine-tuning the llava-hf/llava-1.5-7b-hf model using HuggingFace. The error says: "The input provided to the model are wrong. The number of image tokens is 0 while the number of images given to the model is 1. This prevents correct indexing and breaks batch generation."
My dataset is in this format: { "image": "data/png_images/train/00038.png", "instruction": "Please provide the attributes of this GNSS image:\n- Class\n- Amplitude\n- Area\n- Subjammer\n- Position\n- Environment.", "response": "Class: 2, Amplitude: 6, Area: 2, Subjammer: Chirp_LinearFast_BW15_MXG.bin, Position: 1, Environment: 1" } I’m using transformers version 4.46.0. The error seems to occur because the model is not processing the image inputs properly, leading to a mismatch in the number of image tokens.
Could you please help me understand why this is happening and how to fix it?

@ShobhaRajanna Could you plz provide more details about your setup or share a minimal reproducible code snippet?

I encountered a similar issue when fine-tuning llava-hf/llava-v1.6-vicuna-7b-hf, and it turned out to be related to the accelerate library and DDP setting. To resolve it, I try to launch the script directly with Python. It worked successfully for me. What's more, using one single GPU and set batch_size=1 also worked.

(However I haven't figure out what exactly cause this error😅)
Hi there @Z1zs ,

I'm encountering an issue with the format_example function when preparing inputs for fine-tuning the model. Specifically, it seems to be related to how the token is included in the text input and how the model processes image tokens. def format_example(example):
image_path = example.get("image", None)
if not os.path.exists(image_path):
print(f"Image not found at path: {image_path}.")
return None
image = Image.open(image_path)

    image_array = np.array(image).astype(np.float32)/255.0
    image_array = np.clip(image_array, 0 , 1)

    text_input = f"<image>\n{example['instruction']}\nASSISTANT:"  
    response = example["response"]
    #print(f"Text input: {text_input}")
    #print(f"Response: {response}")
    processed = processor(
        images=image,
        text=text_input,
        return_tensors="pt",
        padding=True,
        truncation=True,
        do_rescale=False,
        max_length=512,
    )
    input_ids = processed.input_ids
    pixel_values = processed.pixel_values.squeeze(0)
    # Debugging Statements
    print(f"Text Input: {text_input}")
    print(f"Tokenized Input IDs: {input_ids}")
    print(f"Decoded Input Text: {tokenizer.decode(input_ids.squeeze(0).tolist(), skip_special_tokens=False)}")
    print(f"Pixel Values Shape: {pixel_values.shape}")

    return {
        "pixel_values": pixel_values,
        "text": text_input,
        "image": image_path,
    }                                                 scenario 1: In my format_example function, I add the <image> token to the text_input as follows: this leads to Evaluating Model Accuracy...

Error processing example 1: The input provided to the model are wrong. The number of image tokens is 511 while the number of images given to the model is 1. This prevents correct indexing and breaks batch generation.
scenario 2: If I exclude the token from the text_input: text_input = f"{example['instruction']}\nASSISTANT:"
I instead get the following error: ValueError: The input provided to the model are wrong. The number of image tokens is 0 while the number of images given to the model is 1. This prevents correct indexing and breaks batch generation.

This seems to suggest that the model isn't detecting the presence of the image properly without the token .

I verified that the image paths are correct and accessible.
The image pixel values are normalized and properly formatted for the processor.
The token has been added to the tokenizer’s vocabulary, and the embeddings have been resized:
if "" not in tokenizer.get_vocab():
tokenizer.add_tokens([""], special_tokens=True)
model.resize_token_embeddings(len(tokenizer))
Could this be an issue with how the processor or tokenizer is set up for image inputs in LLaVA? and I’d like to clarify that in my setup, the batch size is already set to 1 for both training and evaluation. Additionally, I’m using the SFTTrainer from the trl library for fine-tuning the model with LoRA. I’ve verified that both the model and inputs are properly configured for my environment, but I still encounter the following issues.

Thanks again for your insights any additional suggestions are much appreciated!

Z1zs · 2024-12-06T11:36:40Z

@ShobhaRajanna Your problem raised from the wrong <image> token num. Plz use processor.apply_chat_template(conversation, add_generation_prompt=True) instead of adding <image> token manually. Refer to the Official Example for more information.

LiuJinzhe-Keepgoing · 2024-12-25T05:52:34Z

@Z1zs @ShobhaRajanna
Hello, I have a similar problem. When I build inputs through processor, my transformers==4.46.3. Code is as follows:

model_id = "/data/liujinzhe/model/hugging_cache/llava-hf/llava-1.5-7b-hf"
processor = AutoProcessor.from_pretrained(
model_id,
revision='a272c74' 
) 
model = LlavaForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.bfloat16) 
 
model.eval()
model.cuda()

image_url = "/VLM/llava-mechanism-main/000000219578.jpg"
image = Image.open(image_url) 
prompt = "USER: <image> \nWhat is the color of the dog?\nASSISTANT: The color of the dog is" 
inputs = processor(text=prompt, images=image, return_tensors="pt").to(device)  

print(processor.image_token)
print(inputs["input_ids"][0])
decoded_text = processor.decode(inputs["input_ids"][0], skip_special_tokens=False)
print("Decoded Text:", decoded_text)
 
outputs = model(**inputs)

But the error is:
Value error: the input provided to the model is wrong. The number of image tokens is 576 while the number of image given to the model is 1. this P. revents correct indexing and breaks batch generation.

I found the reason because the inputs built by
inputs = processor (text = prompt, images = image, return _ tensors = "pt").to (device) contain 576 image token, but I actually only entered one < image >, and my code is printed as follows:

< image >

tensor([    1,  3148,  1001, 29901, 29871, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000,  32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000,  32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000,  32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000,
32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000,
...
11203,   338], device='cuda:0')

Decoded Text: <s> USER: <image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image>< image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image> ...
What is the color of the dog?
ASSISTANT: The color of the dog is

How should I modify the code to solve this problem? thank you

na69tyzy · 2024-12-25T09:47:49Z

@LiuJinzhe-Keepgoing `from transformers import AutoTokenizer, LlavaForConditionalGeneration
from PIL import Image
import numpy as np
import torch
import os

model_id = "llava-hf/llava-1.5-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = LlavaForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.bfloat16)
model.eval().cuda()

def preprocess_image(image_path, size=(336, 336)):
if not os.path.exists(image_path):
raise FileNotFoundError(f"Image not found at path: {image_path}")
image = Image.open(image_path).convert("RGB").resize(size)
image_array = np.array(image).astype(np.float32) / 255.0
return torch.tensor(image_array).permute(2, 0, 1).unsqueeze(0).to("cuda")

def generate_response(model, tokenizer, image_tensor, prompt, max_length=100, num_beams=5, temperature=0.6, top_p=0.9):
text_inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True, max_length=512).to("cuda")
inputs = {
"input_ids": text_inputs["input_ids"],
"attention_mask": text_inputs["attention_mask"],
"pixel_values": image_tensor
}
outputs = model.generate(
input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
pixel_values=inputs["pixel_values"],
max_length=max_length,
num_beams=num_beams,
temperature=temperature,
top_p=top_p,
do_sample=True
)
return tokenizer.decode(outputs[0], skip_special_tokens=True)

image_path = "download2.jpeg"
prompt = "\nWhat is the colour of dog?\nASSISTANT:"

image_tensor = preprocess_image(image_path)
response = generate_response(model, tokenizer, image_tensor, prompt)

print("Model Response:", response)`
I used the tokenizer instead of the processor because it gave me more control over handling the inputs. The processor tries to manage both the image and text automatically, but this can sometimes cause issues or add unnecessary complexity. By using the tokenizer, I could directly handle the text input, and I manually processed the image to match the model's requirements. This way, I avoided potential Value error: the input provided to the model is wrong. The number of image tokens is 576 while the number of image given to the model is 1.

LiuJinzhe-Keepgoing · 2024-12-26T04:04:41Z

@na69tyzy
Hello, thank you for your help.
I modified the code in your way, but I encountered the same problem when transformer==4.37.1. My code is as follows:

from transformers import LlavaForConditionalGeneration, BitsAndBytesConfig, AutoTokenizer, LlavaForConditionalGeneration
model_id = "/data/liujinzhe/model/hugging_cache/liuhaotian/llava-v1.5-7b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = LlavaForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.bfloat16)
model.eval().cuda()
image_path = "/home/liujinzhe/code/VLM/llava-mechanism-main/000000219578.jpg"
prompt = "USER: <image>\nWhat is the color of the dog? \nASSISTANT: The color of the dog is"
image_tensor = preprocess_image(image_path)

text_inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True, max_length=512).to("cuda")
inputs = {
"input_ids": text_inputs["input_ids"],
"attention_mask": text_inputs["attention_mask"],
"pixel_values": image_tensor
}

print(inputs["input_ids"][0])
decoded_text = tokenizer.decode(inputs["input_ids"][0], skip_special_tokens=False)
print("Decoded Text:", decoded_text)

outputs = model(**inputs)
outputs_probs = get_prob(outputs["logits"][0][-1])
outputs_probs_sort = torch.argsort(outputs_probs, descending=True)
print([tokenizer.decode(x) for x in outputs_probs_sort[:10]])
print(outputs_probs_sort[:10].tolist())

Error content:

"name": "ValueError",
	"message": "The input provided to the model are wrong. The number of image tokens is 0 while the number of image given to the model is 1. This prevents correct indexing and breaks batch generation.",

The contents of prompt printed by my code are as follows:

tensor([    1,  3148,  1001, 29901,   529,  3027, 29958,    13,  5618,   338,
          278,  2927,   310,   278, 11203, 29973,    13, 22933,  9047, 13566,
        29901,   450,  2927,   310,   278, 11203,   338], device='cuda:0')
Decoded Text: <s> USER: <image>
What is the color of the dog?
ASSISTANT: The color of the dog is

Excuse me, if I don't want to change the version of transformers, can I solve this problem in version 4.37.1 ?
Thank you very much for your help！！！

ShobhaRajanna · 2024-12-26T08:40:46Z

@LiuJinzhe-Keepgoing i'm using transformer version 4.46.0 and it works for me

aliencaocao mentioned this issue Mar 27, 2024

Better llava next. #29850

Merged

5 tasks

This comment has been minimized.

Sign in to view

zucchini-nlp mentioned this issue Apr 25, 2024

Fix Llava for 0-embeddings #30473

Merged

zucchini-nlp closed this as completed in #30473 Apr 25, 2024

ashahba mentioned this issue Nov 5, 2024

MultimodalQnA Image and Audio Support Phase 1 opea-project/GenAIComps#852

Merged

3 tasks

[LLAVA-NEXT] ValueError: The input provided to the model are wrong. The number of image tokens is 1 while the number of image given to the model is 1. This prevents correct indexing and breaks batch generation. #29835

[LLAVA-NEXT] ValueError: The input provided to the model are wrong. The number of image tokens is 1 while the number of image given to the model is 1. This prevents correct indexing and breaks batch generation. #29835

Comments

aliencaocao commented Mar 24, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

aliencaocao commented Mar 24, 2024 • edited Loading

ScottishFold007 commented Mar 24, 2024

aliencaocao commented Mar 24, 2024

amyeroberts commented Mar 24, 2024

aliencaocao commented Mar 29, 2024

aliencaocao commented Mar 29, 2024

NielsRogge commented Mar 29, 2024 • edited Loading

Ceejay16042 commented Mar 30, 2024 • edited by ArthurZucker Loading

aliencaocao commented Mar 30, 2024 • edited Loading

Ceejay16042 commented Apr 15, 2024

aliencaocao commented Apr 15, 2024

NielsRogge commented Apr 16, 2024

zucchini-nlp commented Apr 16, 2024 • edited by NielsRogge Loading

aliencaocao commented Apr 16, 2024

bghira commented Apr 17, 2024

aliencaocao commented Apr 17, 2024

bghira commented Apr 17, 2024

bghira commented Apr 17, 2024

bghira commented Apr 17, 2024

This comment has been minimized.

bghira commented Apr 24, 2024

This comment has been minimized.

NielsRogge commented Apr 24, 2024

zucchini-nlp commented Apr 24, 2024

NielsRogge commented Apr 24, 2024 • edited Loading

Z1zs commented Apr 24, 2024 • edited Loading

bghira commented Apr 24, 2024

NielsRogge commented Apr 24, 2024

bghira commented Apr 24, 2024 • edited Loading

ydshieh commented Apr 25, 2024

zucchini-nlp commented Apr 25, 2024

aliencaocao commented Apr 25, 2024

huttersadan commented May 6, 2024

hxhcreate commented Aug 20, 2024

ShobhaRajanna commented Nov 30, 2024

Z1zs commented Dec 5, 2024 • edited Loading

ShobhaRajanna commented Dec 5, 2024 • edited Loading

Z1zs commented Dec 6, 2024 • edited Loading

LiuJinzhe-Keepgoing commented Dec 25, 2024 • edited Loading

na69tyzy commented Dec 25, 2024

LiuJinzhe-Keepgoing commented Dec 26, 2024 • edited Loading

ShobhaRajanna commented Dec 26, 2024

aliencaocao commented Mar 24, 2024 •

edited

Loading

NielsRogge commented Mar 29, 2024 •

edited

Loading

Ceejay16042 commented Mar 30, 2024 •

edited by ArthurZucker

Loading

aliencaocao commented Mar 30, 2024 •

edited

Loading

zucchini-nlp commented Apr 16, 2024 •

edited by NielsRogge

Loading

NielsRogge commented Apr 24, 2024 •

edited

Loading

Z1zs commented Apr 24, 2024 •

edited

Loading

bghira commented Apr 24, 2024 •

edited

Loading

Z1zs commented Dec 5, 2024 •

edited

Loading

ShobhaRajanna commented Dec 5, 2024 •

edited

Loading

Z1zs commented Dec 6, 2024 •

edited

Loading

LiuJinzhe-Keepgoing commented Dec 25, 2024 •

edited

Loading

LiuJinzhe-Keepgoing commented Dec 26, 2024 •

edited

Loading