Fix: Qwen2-VL training on video datasets #33307

hiyouga · 2024-09-04T18:09:33Z

What does this PR do?

We should clone the leaf tensor before doing the in-place operation, otherwise it raises exception in training.

File "/site-packages/transformers/models/qwen2_vl/modeling_qwen2_vl.py", line 1589, in forward
    inputs_embeds[video_mask] = video_embeds
RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.

Similar to

transformers/src/transformers/models/qwen2_vl/modeling_qwen2_vl.py

Lines 1578 to 1584 in 9230d78

    
           if pixel_values is not None: 
        
               pixel_values = pixel_values.type(self.visual.get_dtype()) 
        
               image_embeds = self.visual(pixel_values, grid_thw=image_grid_thw).to(inputs_embeds.device) 
        
               image_mask = input_ids == self.config.image_token_id 
        
               if self.training: 
        
                   inputs_embeds = inputs_embeds.clone() 
        
               inputs_embeds[image_mask] = image_embeds

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@ArthurZucker @zucchini-nlp @simonJJJ

zucchini-nlp · 2024-09-04T18:17:13Z

Oh yeah, and this also prevents compile. Can we use torch.masked_scatter for consistency among VLMs and readability (w/o any if/else)

hiyouga · 2024-09-04T18:41:14Z

@zucchini-nlp We have updated the implementation, how about the new one?

zucchini-nlp · 2024-09-04T18:44:36Z

@hiyouga sorry, what do you mean by "the new one"?

hiyouga · 2024-09-04T18:50:48Z

@zucchini-nlp sorry, i say the latest commit in this PR 96286c3

zucchini-nlp

Thanks, looks good to me!

HuggingFaceDocBuilderDev · 2024-09-04T19:12:32Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker

NIce! Would want to add a test to make sure this works, but good to merge as well

hiyouga · 2024-09-06T13:45:04Z

@ArthurZucker Hi, I think this slow test case already covers this patch, it may require non-trivial effort to implement a test case for training I think.

transformers/tests/models/qwen2_vl/test_modeling_qwen2_vl.py

Lines 324 to 359 in e48e5f1

    
           @slow 
        
           def test_small_model_integration_test(self): 
        
               model = Qwen2VLForConditionalGeneration.from_pretrained( 
        
                   "Qwen/Qwen2-VL-7B-Instruct", torch_dtype="auto", device_map="auto" 
        
               ) 
        
               text = self.processor.apply_chat_template(self.messages, tokenize=False, add_generation_prompt=True) 
        
               inputs = self.processor(text=[text], images=[self.image], return_tensors="pt") 
        
               expected_input_ids = [151644, 8948, 198, 2610, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 872, 198, 151652, 151655, 151655]  # fmt: skip 
        
               assert expected_input_ids == inputs.input_ids[0].tolist()[:17] 
        
               expected_pixel_slice = torch.tensor( 
        
                   [ 
        
                       [0.8792, 0.8792, 0.9084], 
        
                       [1.1858, 1.1858, 1.2296], 
        
                       [1.2004, 1.2004, 1.2150], 
        
                       [1.4340, 1.4340, 1.4194], 
        
                       [1.3902, 1.4048, 1.4194], 
        
                       [1.5216, 1.5362, 1.5362], 
        
                   ], 
        
                   dtype=torch.float32, 
        
                   device="cpu", 
        
               ) 
        
               assert torch.allclose(expected_pixel_slice, inputs.pixel_values[:6, :3], atol=3e-3) 
        
               # verify generation 
        
               inputs = inputs.to(torch_device) 
        
               output = model.generate(**inputs, max_new_tokens=30) 
        
               EXPECTED_DECODED_TEXT = "system\nYou are a helpful assistant.\nuser\nWhat kind of dog is this?\nassistant\nThe dog in the picture appears to be a Labrador Retriever. Labradors are known for their friendly and intelligent nature, making them popular pets" 
        
               self.assertEqual( 
        
                   self.processor.decode(output[0], skip_special_tokens=True), 
        
                   EXPECTED_DECODED_TEXT, 
        
               )

zucchini-nlp · 2024-09-09T07:19:17Z

Imo the fullgraph compile tests should capture the test, but those are part of GenerationTester and Qwen2VL yet doesn't have it. For that we need generation tests with image + text + (optional video) inputs, will add those soon. Tracker here #33374

fix video finetuning

2fbec02

hiyouga mentioned this pull request Sep 4, 2024

Support Qwen2-VL Fine-Tuning on Video Datasets hiyouga/LLaMA-Factory#5365

Merged

2 tasks

hiyouga added 2 commits September 5, 2024 02:35

Update modeling_qwen2_vl.py

bb4c2b6

Update modeling_qwen2_vl.py

96286c3

zucchini-nlp approved these changes Sep 4, 2024

View reviewed changes

zucchini-nlp requested a review from ArthurZucker September 4, 2024 18:52

fix

627d147

ArthurZucker approved these changes Sep 6, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: Qwen2-VL training on video datasets #33307

Fix: Qwen2-VL training on video datasets #33307

hiyouga commented Sep 4, 2024 •

edited

Loading

zucchini-nlp commented Sep 4, 2024

hiyouga commented Sep 4, 2024

zucchini-nlp commented Sep 4, 2024

hiyouga commented Sep 4, 2024

zucchini-nlp left a comment

HuggingFaceDocBuilderDev commented Sep 4, 2024

ArthurZucker left a comment

hiyouga commented Sep 6, 2024 •

edited

Loading

zucchini-nlp commented Sep 9, 2024

	if pixel_values is not None:
	pixel_values = pixel_values.type(self.visual.get_dtype())
	image_embeds = self.visual(pixel_values, grid_thw=image_grid_thw).to(inputs_embeds.device)
	image_mask = input_ids == self.config.image_token_id
	if self.training:
	inputs_embeds = inputs_embeds.clone()
	inputs_embeds[image_mask] = image_embeds

Fix: Qwen2-VL training on video datasets #33307

Are you sure you want to change the base?

Fix: Qwen2-VL training on video datasets #33307

Conversation

hiyouga commented Sep 4, 2024 • edited Loading

What does this PR do?

Before submitting

Who can review?

zucchini-nlp commented Sep 4, 2024

hiyouga commented Sep 4, 2024

zucchini-nlp commented Sep 4, 2024

hiyouga commented Sep 4, 2024

zucchini-nlp left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Sep 4, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

hiyouga commented Sep 6, 2024 • edited Loading

zucchini-nlp commented Sep 9, 2024

hiyouga commented Sep 4, 2024 •

edited

Loading

hiyouga commented Sep 6, 2024 •

edited

Loading