Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LLAVA-NEXT] ValueError: The input provided to the model are wrong. The number of image tokens is 1 while the number of image given to the model is 1. This prevents correct indexing and breaks batch generation. #29835

Closed
2 of 4 tasks
aliencaocao opened this issue Mar 24, 2024 · 42 comments · Fixed by #30473

Comments

@aliencaocao
Copy link
Contributor

System Info

  • transformers version: 4.39.1
  • Platform: Linux-5.19.0-051900rc6-generic-x86_64-with-glibc2.35
  • Python version: 3.9.18
  • Huggingface_hub version: 0.21.1
  • Safetensors version: 0.4.2
  • Accelerate version: 0.28.0
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.2.1+cu121 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: no

Who can help?

@amyeroberts

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Running model.generate on a batch of input with batch size 1 (meaning input is a list of len 1), i randomly get this error. This does not happen for the original impl, but the original impl has a weird image encoding error. This does not seem to be reliably reproducible as running it again/separately does not cause it,

 File "/home2/*.py", line 159, in infer_batch
    output = model.generate(
  File "/home2/*/venv/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home2/*/venv/lib/python3.9/site-packages/transformers/generation/utils.py", line 1527, in generate
    result = self._greedy_search(
  File "/home2/*/venv/lib/python3.9/site-packages/transformers/generation/utils.py", line 2411, in _greedy_search
    outputs = self(
  File "/home2/*/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home2/*/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home2/*/venv/lib/python3.9/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home2/*/venv/lib/python3.9/site-packages/transformers/models/llava_next/modeling_llava_next.py", line 555, in forward
    inputs_embeds, attention_mask, labels, position_ids = self._merge_input_ids_with_image_features(
  File "/home2/*/venv/lib/python3.9/site-packages/transformers/models/llava_next/modeling_llava_next.py", line 411, in _merge_input_ids_with_image_features
    raise ValueError(
ValueError: The input provided to the model are wrong. The number of image tokens is 1 while the number of image given to the model is 1. This prevents correct indexing and breaks batch generation.

My code:

model_path = 'panoyo9829/llava-v1.6-mistral-7b-bnb-4bit-hf'
processor = LlavaNextProcessor.from_pretrained(model_path, local_files_only=True)
model = LlavaNextForConditionalGeneration.from_pretrained(model_path, torch_dtype=torch.float16, low_cpu_mem_usage=True, local_files_only=True, device_map='auto')
            inputs = processor(text=prompts, images=images, padding=True, return_tensors="pt").to(model.device)
            with torch.backends.cuda.sdp_kernel(enable_flash=False, enable_math=False, enable_mem_efficient=True):
                output = model.generate(
                    **inputs,
                    max_new_tokens=512,
                    use_cache=True,
                    output_logits=True,
                    return_dict_in_generate=True,
                    pad_token_id=processor.tokenizer.pad_token_id,
                )

Expected behavior

No error

@aliencaocao
Copy link
Contributor Author

aliencaocao commented Mar 24, 2024

Actually I find that it is reproducible if I don't run them separately but continuously with the previous data, and reloading the model on the fly does not solve it. I am not sure why is that so.
Because the data and full script here is private, I can email to whoever from HF investigating this.

EDIT: it is always reproducible even on a single data.

@ScottishFold007
Copy link
Contributor

The same problem is with llava.
image

@aliencaocao
Copy link
Contributor Author

I have never gotten it in the original impl with the exact same set of data

@amyeroberts
Copy link
Collaborator

cc @NielsRogge

@aliencaocao aliencaocao mentioned this issue Mar 27, 2024
5 tasks
@aliencaocao
Copy link
Contributor Author

Update: I think I found one of the causes (there may be more if @ScottishFold007 's data is not same as mine), but having a <unk> in prompt causes it. Yes, I have it as a pure text and not a token so by right it should be treating it as text and not a single unknown token.

@aliencaocao
Copy link
Contributor Author

Yea it's caused by https://github.com/huggingface/transformers/blob/ba56ed0869eb4bbeb1c04af7f62a04350150e8d4/src/transformers/models/llava_next/modeling_llava_next.py#L407C30-L407C69
When the input contains unknown token, it will have zero embedding, and then be wrongly masked as a image token here.

I am not sure if <unk> input leading to an 0 embedding is expected though. It sounds a bit weird to me that user can directly input actual tokens and it will be tokenized as it is, instead of like < and un etc.

@NielsRogge
Copy link
Contributor

NielsRogge commented Mar 29, 2024

Thanks for investigating, I had a very similar error when trying to add batched generation for llava-next.

Regarding <unk> => that's a special token just like <pad> and <s>, which means these are kept as is, they are not split into multiple tokens.

cc @ArthurZucker

@Ceejay16042
Copy link

Ceejay16042 commented Mar 30, 2024

I found a quick fix for this error:

def resize_image(iimage_path, output_image_path, image_dimension=()):
    images = Image.open(image_path)
    resized_image = images.resize(size)
    resized_image.save(output_image_path)

Code explanation: The Llava model supports a specific dimension for image size. By resizing input images, Llava can generate image tokens and extract information text from the image.

input_image_path: specifies the path to image
image dimension: resizes the image to fit into the llava model for interpretation. e.g (1000, 667)
output_image_path: path to save image

@aliencaocao
Copy link
Contributor Author

aliencaocao commented Mar 30, 2024

This does not solve the issue as i stated the issue only exists if something is being wrongy identified as the image token when it is not, not related to image sizes at all.

@Ceejay16042
Copy link

Found an alternative to resolve the LLAVA [LLAMA-NEXT] ValueError: regarding correct indexing and batch generation breaking. When providing your input prompt to the LLAVA LLM, a specific format must be adhered to.

model_id = "llava-hf/llava-1.5-7b-hf"
pipe = pipeline("image-to-text", model=model_id, model_kwargs={"quantization_config": quantization_config})
prompt_instructions = "USER: \n" + prompt + "\nASSISTANT:"
outputs = pipe(image, prompt=prompt_instructions, generate_kwargs={"max_new_tokens": max_new_tokens})

take note of the prompt_instructions variable specified in the code above and make sure your prompt is passed as an input to the LLM in that same format.

@aliencaocao
Copy link
Contributor Author

Thats not the reason here as using this format is assumed to be known by users already since it is documented. The root issue is with zero embeddings

@NielsRogge
Copy link
Contributor

cc @zucchini-nlp

@zucchini-nlp
Copy link
Member

zucchini-nlp commented Apr 16, 2024

I found that the weights for <unk> token were zeroed out when we were converting the checkpoint. Going from bf16 to float16 here, clamps weights to 0.

I will leave for @NielsRogge to decide if we should convert weights again, and if casting to float16 is necessary

@aliencaocao
Copy link
Contributor Author

Having unkown token in input also causes it. Pretty much anything not padding that could result in zero embedding. I think a better way is to use the image index solely and dont mark anything zero as images.

@bghira
Copy link

bghira commented Apr 17, 2024

this makes LLaVA-Next unusable in any version of Transformers/Tokenizers, fwiw.

the demo code doesn't work.. can this be prioritised to resolve? @NielsRogge @zucchini-nlp

@aliencaocao
Copy link
Contributor Author

The tutorial code does work though. And it works fine for all cases except when your input contains both image and a token that lead to zero embedding (namely, )

@bghira
Copy link

bghira commented Apr 17, 2024

maybe it's helpful to note that it's the 34b model that fails to work.

@bghira
Copy link

bghira commented Apr 17, 2024

no zero embedding to speak of:

INFO:root:Processing image: anime-summerghost-54.png, data: <PIL.PngImagePlugin.PngImageFile image mode=RGB size=1920x1080 at 0x16FC842E0>
INFO:root:Using LLaVA 1.6+ model.
INFO:root:Inputs: {'input_ids': tensor([[59603,  9334,  1397,   562, 13310,  2756,   597,   663, 15874, 10357,
         14135,    98,   707, 14135,  3641,  6901,    97,  7283,    97,   597,
         31081,  8476,   592,   567,  2756, 59610, 59575,  3275,    98,  2134,
          1471, 59601, 59568, 64000,   144,  5697,   620,  2709,   594,   719,
          2728,   100, 39965,  8898,  9129, 59601]], device='mps:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]],
       device='mps:0'), 'pixel_values': tensor([[[[[ 1.3464,  1.3464,  1.3464,  ...,  0.0325,  0.1201,  0.1493],
           [ 1.3464,  1.3464,  1.3464,  ...,  0.2077,  0.2223,  0.4559],
           [ 1.3464,  1.3464,  1.3464,  ...,  0.4559,  0.3391,  0.4851],
           ...,
           [ 0.6457,  0.6311,  0.6457,  ...,  0.0471,  0.0617,  0.0763],
           [ 0.6603,  0.6457,  0.6749,  ...,  0.0471,  0.0617,  0.0617],
           [ 0.6895,  0.6749,  0.7041,  ...,  0.0471,  0.0617,  0.0471]],

          [[ 1.4145,  1.4145,  1.4145,  ..., -0.3864, -0.2663, -0.1763],
           [ 1.4145,  1.4145,  1.4145,  ..., -0.1313, -0.1313,  0.1239],
           [ 1.4145,  1.4145,  1.4145,  ...,  0.1539,  0.0038,  0.1389],
... snip ...

@bghira
Copy link

bghira commented Apr 17, 2024

opened #30294 to avoid making more noise on this issue since it appears there are two distinct problems with this same error message occurring.

@Z1zs

This comment has been minimized.

@bghira
Copy link

bghira commented Apr 24, 2024

might not be what you want to hear but Llava 1.6 has rather poor support through anything but their own codebase, which essentially opens up a webserver to serve inference from.

you might want to switch to that, or a different VLM. there's a lot of VLMs to select from, and Llava 1.6 is possibly not worth the effort considering the frequency of issues and the lack of support.

@Z1zs

This comment has been minimized.

@NielsRogge
Copy link
Contributor

@zucchini-nlp if that resolves this issue then we should reupload the weights.

@Z1zs did you include the special <image> token in the input sequence?

@zucchini-nlp
Copy link
Member

Okay, will make a PR then. I believe @bghira is correct that we should fix the code, and not weights as loading models in fp=16 prob causes the error anyway. I will look into it tomorrow

@NielsRogge
Copy link
Contributor

NielsRogge commented Apr 24, 2024

@bghira sorry to hear that but we're planning to have best-in-class support for Llava-NeXT. This issue seems related to the unk token (which @zucchini-nlp is going to look into), batched generation should hopefully be part of the next Transformers release. The issue linked at #30294 was found to be non-reproducible on CUDA and CPU and seems related to a bug in PyTorch regarding the MPS device.

@Z1zs
Copy link

Z1zs commented Apr 24, 2024

I got it, the problem of mine results from the lack of token. The image token highlights the location in where the image_feature should be inserted during the text-image feature merging.
Thanks very much for the reminding @NielsRogge @bghira ! Thanks for all the efforts and sorry for any bothering caused.

@bghira
Copy link

bghira commented Apr 24, 2024

@bghira sorry to hear that but we're planning to have best-in-class support for Llava-NeXT. This issue seems related to the unk token (which @zucchini-nlp is going to look into), batched generation should hopefully be part of the next Transformers release. The issue linked at #30294 was found to be non-reproducible on CUDA and CPU and seems related to a bug in PyTorch regarding the MPS device.

i think the automated tests need to catch things like this if the goal is best in class support, it cant be just left to the users and downstream development to discover these things werent tested or usable?

@NielsRogge
Copy link
Contributor

@bghira true, batched generation is not enforced in default tests so far, one writes (slow) integration tests for these, e.g. here for llava. cc @zucchini-nlp @gante - I think it'd be great if we had batched generation tests by default which force contributors (like me) to make sure this is supported and tested from the start, not sure if feasible.

Regarding the issue with <unk> token => that's something we could also add default tests for, this is currently not tested. cc @ydshieh do you think we should add a test to test_modeling_common.py to make sure models work as expected during a forward pass when passing a sequence that includes an <unk> token?

@bghira
Copy link

bghira commented Apr 24, 2024

thank you. the first impression is ever so important, as it's already taken so long to get llava 1.6 support anywhere but their own demo codebase that once you test it and everything falls apart, there is a real mounting risk of losing interest in the model after spending a few hours trying to determine where it went wrong.

also, i am not doing batched generations. for MPS, it fails just in the normal setup.

@ydshieh
Copy link
Collaborator

ydshieh commented Apr 25, 2024

Regarding the issue with token => that's something we could also add default tests for, this is currently not tested. cc @ydshieh do you think we should add a test to test_modeling_common.py to make sure models work as expected during a forward pass when passing a sequence that includes an token?

A question first: is the problem from there is an <unk> token (id) in the model input (general case), or there is <unk> string in the input text and it is kept as <unk> (id) after being encoded.

(I think the problem happens whenever there is token (id) no matter where it comes - but just to be sure).


When the input contains unknown token, it will have zero embedding, and then be wrongly masked as a image token here.

OK, my above understanding is correct.

Regarding testing, it's not very clear to me as the problem may happen to any token (or at least any special token). Even if we restrict to <unk> token only, the issue raised here is likely to happen for special models like LLAVA-Next, as text-only models won't have issue like that (they just use Embedding layer of to embed tokens).

I would suggest to focus on resolving the zero embedding issue. But if anyone has good suggestion regarding how to make a test, I am happy to think of it.

@zucchini-nlp
Copy link
Member

I think it'd be great if we had batched generation tests by default which force contributors (like me) to make sure this is supported and tested from the start, not sure if feasible.

Yes, I am working on adding GenerationTesterMixin for multimodal models, and hope that will help to catch more errors in the stage of adding the model 😄

@aliencaocao
Copy link
Contributor Author

There is already a image token ids in merge input features. Why not reuse?

@huttersadan
Copy link

The problem is caused by the loss of and in processor, you can solve this question by adding the following code.
processor.tokenizer.add_tokens(["<image>", "<pad>"], special_tokens=True) model.resize_token_embeddings(len(processor.tokenizer))
this solution link is https://huggingface.co/IlyasMoutawwakil/tiny-random-LlavaForConditionalGeneration/discussions/1.
It worked for me.

@hxhcreate
Copy link

I maybe encounter the same question with transformer==4.41.0

ValueError: The input provided to the model are wrong. The number of image tokens is 4 while the number of image given to the model is 1. This prevents correct indexing and breaks batch generation.

the reason behind is may be the data itself contain <image> token, can this be addressed in LlavaProcessor internally, or should I change transformer version?

Thanks

@ShobhaRajanna
Copy link

Hi,

I encountered a ValueError when fine-tuning the llava-hf/llava-1.5-7b-hf model using HuggingFace. The error says:
"The input provided to the model are wrong. The number of image tokens is 0 while the number of images given to the model is 1. This prevents correct indexing and breaks batch generation."

My dataset is in this format: {
"image": "data/png_images/train/00038.png",
"instruction": "Please provide the attributes of this GNSS image:\n- Class\n- Amplitude\n- Area\n- Subjammer\n- Position\n- Environment.",
"response": "Class: 2, Amplitude: 6, Area: 2, Subjammer: Chirp_LinearFast_BW15_MXG.bin, Position: 1, Environment: 1"
}
I’m using transformers version 4.46.0. The error seems to occur because the model is not processing the image inputs properly, leading to a mismatch in the number of image tokens.

Could you please help me understand why this is happening and how to fix it?

@Z1zs
Copy link

Z1zs commented Dec 5, 2024

Hi,

I encountered a ValueError when fine-tuning the llava-hf/llava-1.5-7b-hf model using HuggingFace. The error says: "The input provided to the model are wrong. The number of image tokens is 0 while the number of images given to the model is 1. This prevents correct indexing and breaks batch generation."

My dataset is in this format: { "image": "data/png_images/train/00038.png", "instruction": "Please provide the attributes of this GNSS image:\n- Class\n- Amplitude\n- Area\n- Subjammer\n- Position\n- Environment.", "response": "Class: 2, Amplitude: 6, Area: 2, Subjammer: Chirp_LinearFast_BW15_MXG.bin, Position: 1, Environment: 1" } I’m using transformers version 4.46.0. The error seems to occur because the model is not processing the image inputs properly, leading to a mismatch in the number of image tokens.

Could you please help me understand why this is happening and how to fix it?

@ShobhaRajanna Could you plz provide more details about your setup or share a minimal reproducible code snippet?

I encountered a similar issue when fine-tuning llava-hf/llava-v1.6-vicuna-7b-hf, and it turned out to be related to the accelerate library and DDP setting. To resolve it, I try to launch the script directly with Python. It worked successfully for me. What's more, using one single GPU and set batch_size=1 also worked.

(However I haven't figure out what exactly cause this error😅)

@ShobhaRajanna
Copy link

ShobhaRajanna commented Dec 5, 2024

Hi,
I encountered a ValueError when fine-tuning the llava-hf/llava-1.5-7b-hf model using HuggingFace. The error says: "The input provided to the model are wrong. The number of image tokens is 0 while the number of images given to the model is 1. This prevents correct indexing and breaks batch generation."
My dataset is in this format: { "image": "data/png_images/train/00038.png", "instruction": "Please provide the attributes of this GNSS image:\n- Class\n- Amplitude\n- Area\n- Subjammer\n- Position\n- Environment.", "response": "Class: 2, Amplitude: 6, Area: 2, Subjammer: Chirp_LinearFast_BW15_MXG.bin, Position: 1, Environment: 1" } I’m using transformers version 4.46.0. The error seems to occur because the model is not processing the image inputs properly, leading to a mismatch in the number of image tokens.
Could you please help me understand why this is happening and how to fix it?

@ShobhaRajanna Could you plz provide more details about your setup or share a minimal reproducible code snippet?

I encountered a similar issue when fine-tuning llava-hf/llava-v1.6-vicuna-7b-hf, and it turned out to be related to the accelerate library and DDP setting. To resolve it, I try to launch the script directly with Python. It worked successfully for me. What's more, using one single GPU and set batch_size=1 also worked.

(However I haven't figure out what exactly cause this error😅)
Hi there @Z1zs ,

token

I'm encountering an issue with the format_example function when preparing inputs for fine-tuning the model. Specifically, it seems to be related to how the token is included in the text input and how the model processes image tokens. def format_example(example):
image_path = example.get("image", None)
if not os.path.exists(image_path):
print(f"Image not found at path: {image_path}.")
return None
image = Image.open(image_path)

    image_array = np.array(image).astype(np.float32)/255.0
    image_array = np.clip(image_array, 0 , 1)

    text_input = f"<image>\n{example['instruction']}\nASSISTANT:"  
    response = example["response"]
    #print(f"Text input: {text_input}")
    #print(f"Response: {response}")
    processed = processor(
        images=image,
        text=text_input,
        return_tensors="pt",
        padding=True,
        truncation=True,
        do_rescale=False,
        max_length=512,
    )
    input_ids = processed.input_ids
    pixel_values = processed.pixel_values.squeeze(0)
    # Debugging Statements
    print(f"Text Input: {text_input}")
    print(f"Tokenized Input IDs: {input_ids}")
    print(f"Decoded Input Text: {tokenizer.decode(input_ids.squeeze(0).tolist(), skip_special_tokens=False)}")
    print(f"Pixel Values Shape: {pixel_values.shape}")

    return {
        "pixel_values": pixel_values,
        "text": text_input,
        "image": image_path,
    }                                                 scenario 1: In my format_example function, I add the <image> token to the text_input as follows: this leads to Evaluating Model Accuracy...

Error processing example 1: The input provided to the model are wrong. The number of image tokens is 511 while the number of images given to the model is 1. This prevents correct indexing and breaks batch generation.
scenario 2: If I exclude the token from the text_input: text_input = f"{example['instruction']}\nASSISTANT:"
I instead get the following error: ValueError: The input provided to the model are wrong. The number of image tokens is 0 while the number of images given to the model is 1. This prevents correct indexing and breaks batch generation.

This seems to suggest that the model isn't detecting the presence of the image properly without the token .

I verified that the image paths are correct and accessible.
The image pixel values are normalized and properly formatted for the processor.
The token has been added to the tokenizer’s vocabulary, and the embeddings have been resized:
if "" not in tokenizer.get_vocab():
tokenizer.add_tokens([""], special_tokens=True)
model.resize_token_embeddings(len(tokenizer))
Could this be an issue with how the processor or tokenizer is set up for image inputs in LLaVA? and I’d like to clarify that in my setup, the batch size is already set to 1 for both training and evaluation. Additionally, I’m using the SFTTrainer from the trl library for fine-tuning the model with LoRA. I’ve verified that both the model and inputs are properly configured for my environment, but I still encounter the following issues.

Thanks again for your insights any additional suggestions are much appreciated!

@Z1zs
Copy link

Z1zs commented Dec 6, 2024

@ShobhaRajanna Your problem raised from the wrong <image> token num. Plz use processor.apply_chat_template(conversation, add_generation_prompt=True) instead of adding <image> token manually. Refer to the Official Example for more information.

@LiuJinzhe-Keepgoing
Copy link

LiuJinzhe-Keepgoing commented Dec 25, 2024

@Z1zs @ShobhaRajanna
Hello, I have a similar problem. When I build inputs through processor, my transformers==4.46.3. Code is as follows:

model_id = "/data/liujinzhe/model/hugging_cache/llava-hf/llava-1.5-7b-hf"
processor = AutoProcessor.from_pretrained(
model_id,
revision='a272c74' 
) 
model = LlavaForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.bfloat16) 
 
model.eval()
model.cuda()

image_url = "/VLM/llava-mechanism-main/000000219578.jpg"
image = Image.open(image_url) 
prompt = "USER: <image> \nWhat is the color of the dog?\nASSISTANT: The color of the dog is" 
inputs = processor(text=prompt, images=image, return_tensors="pt").to(device)  

print(processor.image_token)
print(inputs["input_ids"][0])
decoded_text = processor.decode(inputs["input_ids"][0], skip_special_tokens=False)
print("Decoded Text:", decoded_text)
 
outputs = model(**inputs)

But the error is:
Value error: the input provided to the model is wrong. The number of image tokens is 576 while the number of image given to the model is 1. this P. revents correct indexing and breaks batch generation.

I found the reason because the inputs built by
inputs = processor (text = prompt, images = image, return _ tensors = "pt").to (device) contain 576 image token, but I actually only entered one < image >, and my code is printed as follows:

< image >

tensor([    1,  3148,  1001, 29901, 29871, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000,  32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000,  32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000,  32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000,
32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000,
...
11203,   338], device='cuda:0')

Decoded Text: <s> USER: <image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image>< image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image> ...
What is the color of the dog?
ASSISTANT: The color of the dog is

How should I modify the code to solve this problem? thank you

@na69tyzy
Copy link

@LiuJinzhe-Keepgoing `from transformers import AutoTokenizer, LlavaForConditionalGeneration
from PIL import Image
import numpy as np
import torch
import os

model_id = "llava-hf/llava-1.5-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = LlavaForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.bfloat16)
model.eval().cuda()

def preprocess_image(image_path, size=(336, 336)):
if not os.path.exists(image_path):
raise FileNotFoundError(f"Image not found at path: {image_path}")
image = Image.open(image_path).convert("RGB").resize(size)
image_array = np.array(image).astype(np.float32) / 255.0
return torch.tensor(image_array).permute(2, 0, 1).unsqueeze(0).to("cuda")

def generate_response(model, tokenizer, image_tensor, prompt, max_length=100, num_beams=5, temperature=0.6, top_p=0.9):
text_inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True, max_length=512).to("cuda")
inputs = {
"input_ids": text_inputs["input_ids"],
"attention_mask": text_inputs["attention_mask"],
"pixel_values": image_tensor
}
outputs = model.generate(
input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
pixel_values=inputs["pixel_values"],
max_length=max_length,
num_beams=num_beams,
temperature=temperature,
top_p=top_p,
do_sample=True
)
return tokenizer.decode(outputs[0], skip_special_tokens=True)

image_path = "download2.jpeg"
prompt = "\nWhat is the colour of dog?\nASSISTANT:"

image_tensor = preprocess_image(image_path)
response = generate_response(model, tokenizer, image_tensor, prompt)

print("Model Response:", response)`
image I used the tokenizer instead of the processor because it gave me more control over handling the inputs. The processor tries to manage both the image and text automatically, but this can sometimes cause issues or add unnecessary complexity. By using the tokenizer, I could directly handle the text input, and I manually processed the image to match the model's requirements. This way, I avoided potential Value error: the input provided to the model is wrong. The number of image tokens is 576 while the number of image given to the model is 1.

@LiuJinzhe-Keepgoing
Copy link

LiuJinzhe-Keepgoing commented Dec 26, 2024

@na69tyzy
Hello, thank you for your help.
I modified the code in your way, but I encountered the same problem when transformer==4.37.1. My code is as follows:

from transformers import LlavaForConditionalGeneration, BitsAndBytesConfig, AutoTokenizer, LlavaForConditionalGeneration
model_id = "/data/liujinzhe/model/hugging_cache/liuhaotian/llava-v1.5-7b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = LlavaForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.bfloat16)
model.eval().cuda()
image_path = "/home/liujinzhe/code/VLM/llava-mechanism-main/000000219578.jpg"
prompt = "USER: <image>\nWhat is the color of the dog? \nASSISTANT: The color of the dog is"
image_tensor = preprocess_image(image_path)

text_inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True, max_length=512).to("cuda")
inputs = {
"input_ids": text_inputs["input_ids"],
"attention_mask": text_inputs["attention_mask"],
"pixel_values": image_tensor
}

print(inputs["input_ids"][0])
decoded_text = tokenizer.decode(inputs["input_ids"][0], skip_special_tokens=False)
print("Decoded Text:", decoded_text)

outputs = model(**inputs)
outputs_probs = get_prob(outputs["logits"][0][-1])
outputs_probs_sort = torch.argsort(outputs_probs, descending=True)
print([tokenizer.decode(x) for x in outputs_probs_sort[:10]])
print(outputs_probs_sort[:10].tolist())

Error content:

"name": "ValueError",
	"message": "The input provided to the model are wrong. The number of image tokens is 0 while the number of image given to the model is 1. This prevents correct indexing and breaks batch generation.",
	

The contents of prompt printed by my code are as follows:

tensor([    1,  3148,  1001, 29901,   529,  3027, 29958,    13,  5618,   338,
          278,  2927,   310,   278, 11203, 29973,    13, 22933,  9047, 13566,
        29901,   450,  2927,   310,   278, 11203,   338], device='cuda:0')
Decoded Text: <s> USER: <image>
What is the color of the dog?
ASSISTANT: The color of the dog is

Excuse me, if I don't want to change the version of transformers, can I solve this problem in version 4.37.1 ?
Thank you very much for your help!!!

@ShobhaRajanna
Copy link

@LiuJinzhe-Keepgoing i'm using transformer version 4.46.0 and it works for me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.