Optimize ForCausalLMLoss by removing unnecessary contiguous() call to reduce memory overhead #35646

efsotr · 2025-01-13T03:24:08Z

What does this PR do?

Removing unnecessary contiguous() call

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

…o reduce memory overhead

ArthurZucker

Sounds good but before merging could you add some performance data? 🤗

efsotr · 2025-01-13T13:11:06Z

Condition: Without clearing the cache, the dtype is bfloat16, and the size of logits is 0.5 GB.
The old ForCausalLMLoss requires 3 GB for the forward pass and an additional 0.5 GB to hold the original logits.
The new ForCausalLMLoss reduces this to 2 GB for the forward pass while still needing 0.5 GB for holding the logits.
For the backward pass, both versions require 3 GB, plus the 0.5 GB needed to hold the original logits.

v-lmn · 2025-01-16T10:16:37Z

Thank you very much, I want to understand some questions.Can you explain some of your test case,I don't understand what's the meaning of "~create_mask(max_seqlen, half_prompts_lens)" and "mask" and "start_pos_id" and "end_pos_id". Can you give me your personal email? @efsotr

Rocketknight1

I like this PR!

Explaining it (for myself and reviewers): logits is (batch, seq_len, vocab_size) and labels is just (batch, seq_len). Also, logits is a float tensor that carries gradient, and labels is not. Therefore, you want to avoid small manipulations of logits as much as possible, because they are slow and because they pollute the graph with extra temporary tensors that have to be saved for backprop.

Before this PR, we shifted logits and labels one position each, in opposite directions. After this PR, we keep logits static and pad+shift labels. This is not exactly equivalent, so the code compensates by using the ignore_index of -100 as the pad value, which restores equivalence.

This is a great change. Some models compute loss internally and don't use this function, so if you want to do a follow-up PR to search the codebase for shift_logits and update the logic there too, that would be a nice speed boost for them as well. Thank you!

Rocketknight1 · 2025-01-16T15:47:18Z

(Merging without @ArthurZucker approval because the code has the same output, it's just faster + less memory)

ArthurZucker · 2025-01-16T17:05:56Z

LGTM anyways, thanks all 🤗

… reduce memory overhead (huggingface#35646) Optimize ForCausalLMLoss by removing unnecessary contiguous() calls to reduce memory overhead

Optimize ForCausalLMLoss by removing unnecessary contiguous() calls t…

0e53a58

…o reduce memory overhead

efsotr requested a review from ArthurZucker as a code owner January 13, 2025 03:24

ArthurZucker reviewed Jan 13, 2025

View reviewed changes

Rocketknight1 approved these changes Jan 16, 2025

View reviewed changes

Rocketknight1 merged commit 8ebe9d7 into huggingface:main Jan 16, 2025
24 of 25 checks passed

Tcc0403 mentioned this pull request Jan 27, 2025

The convergence test test_mini_models_with_logits is failing with the latest transformers linkedin/Liger-Kernel#543

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize ForCausalLMLoss by removing unnecessary contiguous() call to reduce memory overhead #35646

Optimize ForCausalLMLoss by removing unnecessary contiguous() call to reduce memory overhead #35646

efsotr commented Jan 13, 2025

ArthurZucker left a comment

efsotr commented Jan 13, 2025

v-lmn commented Jan 16, 2025

Rocketknight1 left a comment

Rocketknight1 commented Jan 16, 2025

ArthurZucker commented Jan 16, 2025

Optimize ForCausalLMLoss by removing unnecessary contiguous() call to reduce memory overhead #35646

Optimize ForCausalLMLoss by removing unnecessary contiguous() call to reduce memory overhead #35646

Conversation

efsotr commented Jan 13, 2025

What does this PR do?

Before submitting

Who can review?

ArthurZucker left a comment

Choose a reason for hiding this comment

efsotr commented Jan 13, 2025

v-lmn commented Jan 16, 2025

Rocketknight1 left a comment

Choose a reason for hiding this comment

Rocketknight1 commented Jan 16, 2025

ArthurZucker commented Jan 16, 2025