Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tensor Size Mismatch in language decoder on evaluation of BLIP with COCO caption task #241

Open
Luodian opened this issue Apr 9, 2023 · 4 comments

Comments

@Luodian
Copy link

Luodian commented Apr 9, 2023

Hi sorry to bother, it would be much appreciated if you could take a look at this error on BLIP(v1) with COCO caption task.
I was running command

python -m torch.distributed.run --nproc_per_node=1 LAVIS/evaluate.py --cfg-path=LAVIS/lavis/projects/blip/eval/caption_coco_eval.yaml

and it was

  • without any modifications
  • just simple default run
  • I dont have any more errors with other tasks)

The error I encountered is

Exception has occurred: RuntimeError
The size of tensor a (192) must match the size of tensor b (576) at non-singleton dimension 0

The error was on

...
generate_from_encoder (/home/LAVIS/lavis/models/med.py:1360)
generate (/home/LAVIS/lavis/models/blip_models/blip_caption.py:188)
...

The dimension of relevant tensors are

image

@dxli94
Copy link
Contributor

dxli94 commented Apr 10, 2023

You may want to downgrade your transformers version to >=4.25.0,<4.27

@Luodian
Copy link
Author

Luodian commented Apr 10, 2023

Thanks! Is there a possible reason that causes this error? So maybe I could somehow fix it with higher version transformer. I may use 4.28.dev0 for my own project.

@dxli94
Copy link
Contributor

dxli94 commented Apr 10, 2023

@Luodian we haven't looked into the issue yet. If you can help to suggest and possibly PR, that'd be very helpful.

@Luodian
Copy link
Author

Luodian commented Apr 12, 2023

Hi I think the reason for this is due to the prompt not being correctly repeated when using num_beam>1.

In lavis/models/med.py, line 1331, the visual_embeds are repeated num_beams times but tokenized_prompt.input_ids was not.

        if not use_nucleus_sampling:
            num_beams = num_beams
            visual_embeds = visual_embeds.repeat_interleave(num_beams, dim=0)

Please verify if this is indeed the correct reason.

I have submitted two pull requests (PRs) to address this issue. You may choose either one for review and potential merging.

The first PR (Luodian:fix-blip_caption/coco_caption_eval) directly repeats the tokenized_prompt in blip_caption.py.

The second PR (Luodian:fix-med/coco_caption_eval) adds the repeated code into med.py and ensures that the dimensions are aligned, which may also address other cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants