Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: Qwen2-VL training on video datasets #33307

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

hiyouga
Copy link
Contributor

@hiyouga hiyouga commented Sep 4, 2024

What does this PR do?

We should clone the leaf tensor before doing the in-place operation, otherwise it raises exception in training.

File "/site-packages/transformers/models/qwen2_vl/modeling_qwen2_vl.py", line 1589, in forward
    inputs_embeds[video_mask] = video_embeds
RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.

Similar to

if pixel_values is not None:
pixel_values = pixel_values.type(self.visual.get_dtype())
image_embeds = self.visual(pixel_values, grid_thw=image_grid_thw).to(inputs_embeds.device)
image_mask = input_ids == self.config.image_token_id
if self.training:
inputs_embeds = inputs_embeds.clone()
inputs_embeds[image_mask] = image_embeds

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@ArthurZucker @zucchini-nlp @simonJJJ

@zucchini-nlp
Copy link
Member

Oh yeah, and this also prevents compile. Can we use torch.masked_scatter for consistency among VLMs and readability (w/o any if/else)

@hiyouga
Copy link
Contributor Author

hiyouga commented Sep 4, 2024

@zucchini-nlp We have updated the implementation, how about the new one?

@zucchini-nlp
Copy link
Member

@hiyouga sorry, what do you mean by "the new one"?

@hiyouga
Copy link
Contributor Author

hiyouga commented Sep 4, 2024

@zucchini-nlp sorry, i say the latest commit in this PR 96286c3

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good to me!

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIce! Would want to add a test to make sure this works, but good to merge as well

@hiyouga
Copy link
Contributor Author

hiyouga commented Sep 6, 2024

@ArthurZucker Hi, I think this slow test case already covers this patch, it may require non-trivial effort to implement a test case for training I think.

@slow
def test_small_model_integration_test(self):
model = Qwen2VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen2-VL-7B-Instruct", torch_dtype="auto", device_map="auto"
)
text = self.processor.apply_chat_template(self.messages, tokenize=False, add_generation_prompt=True)
inputs = self.processor(text=[text], images=[self.image], return_tensors="pt")
expected_input_ids = [151644, 8948, 198, 2610, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 872, 198, 151652, 151655, 151655] # fmt: skip
assert expected_input_ids == inputs.input_ids[0].tolist()[:17]
expected_pixel_slice = torch.tensor(
[
[0.8792, 0.8792, 0.9084],
[1.1858, 1.1858, 1.2296],
[1.2004, 1.2004, 1.2150],
[1.4340, 1.4340, 1.4194],
[1.3902, 1.4048, 1.4194],
[1.5216, 1.5362, 1.5362],
],
dtype=torch.float32,
device="cpu",
)
assert torch.allclose(expected_pixel_slice, inputs.pixel_values[:6, :3], atol=3e-3)
# verify generation
inputs = inputs.to(torch_device)
output = model.generate(**inputs, max_new_tokens=30)
EXPECTED_DECODED_TEXT = "system\nYou are a helpful assistant.\nuser\nWhat kind of dog is this?\nassistant\nThe dog in the picture appears to be a Labrador Retriever. Labradors are known for their friendly and intelligent nature, making them popular pets"
self.assertEqual(
self.processor.decode(output[0], skip_special_tokens=True),
EXPECTED_DECODED_TEXT,
)

@zucchini-nlp
Copy link
Member

Imo the fullgraph compile tests should capture the test, but those are part of GenerationTester and Qwen2VL yet doesn't have it. For that we need generation tests with image + text + (optional video) inputs, will add those soon. Tracker here #33374

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants