Fix RNN-T loss memory usage #11144

artbataev · 2024-11-04T15:52:25Z

What does this PR do ?

Fixes memory usage for Numba-based implementation of RNN-T and Multi-blank Transducer losses.
The current implementation requires 3x memory compared to the size of logits (logits, gradient, extra memory = size of logits). This PR fixes memory usage to the minimal possible 2x (logits, gradient).

It looks like assigning tensors directly to ctx instead of saving them through save_for_backward breaks PyTorch logic, and it copies the gradient tensor (which results in extra memory usage).

TDT loss was not affected by this issue (I'm unsure why, but it requires contiguous tensor for labels-related logits).

Before (main):

Peak memory before loss: 2.09 GB
Peak memory after loss: 6.27 GB

After (this PR):

Peak memory before loss: 2.09 GB
Peak memory after loss: 4.18 GB

Code to check memory usage (size of tensors besides logits is negligible compared to logits):

import torch
from nemo.collections.asr.parts.numba.rnnt_loss import RNNTLossNumba
device = torch.device("cuda")

loss = RNNTLossNumba(blank=1023, reduction='none')
batch_size = 32
logits = torch.rand([batch_size, 188, 90 + 1, 1024], device=device, dtype=torch.float32, requires_grad=True)
encoder_lengths = torch.full([batch_size], fill_value=188, device=device, dtype=torch.long)
label_lengths = torch.full([batch_size], fill_value=90, device=device, dtype=torch.long)
labels = torch.randint(0, 1022, size=[batch_size, 90], dtype=torch.long, device=device)
print(f"Peak memory before loss: {torch.cuda.max_memory_allocated() / (2 ** 30):.2f} GB")

loss_value = loss(acts=logits, act_lens=encoder_lengths, labels=labels, label_lens=label_lengths)
loss_value.sum().backward()
print(f"Peak memory after loss: {torch.cuda.max_memory_allocated() / (2 ** 30):.2f} GB")

Collection: [ASR]

Changelog

In transducer-related functions that extend autograd, save tensors using ctx.save_for_backward(...) instead of directly assigning tensors according to PyTorch documentation.

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>

Signed-off-by: artbataev <artbataev@users.noreply.github.com>

hainan-xv

Great discovery of the memory usage issue and a very clean fix! Approved and thanks.

github-actions · 2024-11-04T18:10:49Z

[🤖]: Hi @artbataev 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully

So it might be time to merge this PR or get some approvals

I'm just a bot so I'll leave it you what to do next.

//cc @pablo-garay @ko3n1g

titu1994

Thanks for the fix !

* Fix RNN-T memory usage Signed-off-by: artbataev <artbataev@users.noreply.github.com> --------- Signed-off-by: Vladimir Bataev <vbataev@nvidia.com> Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* Fix RNN-T memory usage Signed-off-by: artbataev <artbataev@users.noreply.github.com> --------- Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>

artbataev added 2 commits November 4, 2024 19:23

Fix RNN-T memory usage

e448571

Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>

Fix TDT

c089014

Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>

artbataev requested review from titu1994 and hainan-xv November 4, 2024 15:52

github-actions bot added the ASR label Nov 4, 2024

artbataev and others added 2 commits November 4, 2024 15:53

Apply isort and black reformatting

26ca72d

Signed-off-by: artbataev <artbataev@users.noreply.github.com>

Merge branch 'main' into fix_rnnt_memory_usage

eb7ad19

artbataev added the Run CICD label Nov 4, 2024

hainan-xv approved these changes Nov 4, 2024

View reviewed changes

titu1994 approved these changes Nov 4, 2024

View reviewed changes

artbataev merged commit d19e9d3 into main Nov 4, 2024
162 of 163 checks passed

artbataev deleted the fix_rnnt_memory_usage branch November 4, 2024 19:09

lilyw97 pushed a commit to lilyw97/NeMo that referenced this pull request Nov 13, 2024

Fix RNN-T loss memory usage (NVIDIA#11144)

6ac83cf

* Fix RNN-T memory usage Signed-off-by: artbataev <artbataev@users.noreply.github.com> --------- Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>

HuiyingLi pushed a commit to HuiyingLi/NeMo that referenced this pull request Nov 15, 2024

Fix RNN-T loss memory usage (NVIDIA#11144)

4476313

* Fix RNN-T memory usage Signed-off-by: artbataev <artbataev@users.noreply.github.com> --------- Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>

yashaswikarnati pushed a commit that referenced this pull request Nov 21, 2024

Fix RNN-T loss memory usage (#11144)

1be5825

* Fix RNN-T memory usage Signed-off-by: artbataev <artbataev@users.noreply.github.com> --------- Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix RNN-T loss memory usage #11144

Fix RNN-T loss memory usage #11144

artbataev commented Nov 4, 2024 •

edited

Loading

hainan-xv left a comment

github-actions bot commented Nov 4, 2024

titu1994 left a comment

Fix RNN-T loss memory usage #11144

Fix RNN-T loss memory usage #11144

Conversation

artbataev commented Nov 4, 2024 • edited Loading

What does this PR do ?

Changelog

Usage

GitHub Actions CI

Before your PR is "Ready for review"

Who can review?

Additional Information

hainan-xv left a comment

Choose a reason for hiding this comment

github-actions bot commented Nov 4, 2024

titu1994 left a comment

Choose a reason for hiding this comment

artbataev commented Nov 4, 2024 •

edited

Loading