RNN-T and TDT inference: use CUDA graphs by default #8972

artbataev · 2024-04-18T15:03:05Z

What does this PR do ?

enable CUDA graphs by default for inference (transcribe_speech.py, speech_to_text_eval.py)
add fallback behavior with partial CUDA graphs for Label-Looping algorithm implementation (TDT, RNN-T)
- Use Cuda graphs with while loops when all requirements are met (driver 545+, Cuda 12.3+, cuda-python package)
- Use Cuda graphs without while loops (pure PyTorch functionality) in most other cases. The idea: use several separate Cuda graphs for parts of the decoding algorithm (<graph before outer loop> -> while loop (python) -> <graph before inner loop> -> inner while loop (python) -> <graph for inner loop code> etc.)

On my local machine, FastConformer L, LibriSpeech test-other decoding time, bfloat16, bs=16

	no graphs	partial graphs	full graph
unsorted batch	34s	21s	19s
sorted batch	23s	15s	14s

Collection: [ASR]

Changelog

Add specific line by line info of high level changes in this PR.

Usage

python examples/asr/speech_to_text_eval.py <with rnnt/tdt model>

Jenkins CI

To run Jenkins, a NeMo User with write access must comment jenkins on the PR.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>

artbataev · 2024-04-18T16:11:51Z

jenkins

Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>

artbataev · 2024-04-25T18:53:52Z

jenkins

Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>

artbataev · 2024-04-25T19:12:41Z

@galv, @titu1994 Please, review the PR

The only issue I see for now is that the Frame-Looping algorithm does not have fallback behavior, and setting loop_labels=false without use_cuda_graph_decoder=false (default - true) can cause failures (if not all requirements are met for Frame-Looping+CUDA graphs).

On the other hand, this case is not important. Most users will use the Label-Looping algorithm since it is default and produces the same result as the Frame-Looping algorithm.

galv · 2024-04-26T21:32:30Z

There is something I'm not quite understanding. Is there a reason why you did not set the default value of these values to true?

NeMo/nemo/collections/asr/parts/submodules/rnnt_greedy_decoding.py

Line 592 in 7e805bc

use_cuda_graph_decoder: bool = False,

and

NeMo/nemo/collections/asr/parts/submodules/rnnt_greedy_decoding.py

Line 2305 in 7e805bc

use_cuda_graph_decoder: bool = False

and

NeMo/nemo/collections/asr/parts/submodules/rnnt_greedy_decoding.py

Line 2655 in 7e805bc

use_cuda_graph_decoder: bool = False,

It looks like you are turning cuda graphs on by default only when someone runs transcribe_speech.py or transcribe_speech_parallel.py right now.

galv · 2024-04-26T21:37:57Z

nemo/collections/asr/parts/submodules/rnnt_loop_labels_computer.py

+        if self.cuda_graphs_mode is self.CudaGraphsMode.FULL_GRAPH:
+            self.full_graph.replay()
+        elif self.cuda_graphs_mode is self.CudaGraphsMode.NO_WHILE_LOOPS:
+            self.separate_graphs.before_outer_loop.replay()


This is cool. I would like to speak to you about a way we could possibly do this more easily using torch.cond and torch.while_loop. They're not in the newest versions of pytorch yet.

As we improve our implementations of beam search, I don't think it is realistic to keep doing special case code like this. I'm thinking we can get the right level of abstraction with torch.cond and torch.while_loop, such that people can write things more naturally.

galv · 2024-04-26T21:50:52Z

nemo/collections/asr/parts/submodules/rnnt_loop_labels_computer.py

+            # cuda graphs are allowed
+            # check basic requirements for cuda graphs
+            if self.max_symbols is None:
+                logging.warning("Max symbols is None, which is not allowed with Cuda graphs.")


People don't check warnings often enough. I recommend you throw an exception here.

Max symbols is none on some older models if i recall, that means out of the box they will crash to infer due to error for unrelated reason of cuda graphs not supporting None. It should remain a warning, but i agree that instead of crashing, set it to large default value for cuda graphs..

Ie warn that its not supported so instead a default of 10 timesteps or higher is being used for cuda graphs optimization

I agree with Som's suggestion, actually. Using a large value like 10 when someone passes in None seems like the right move.

Fully agree, thanks! I fixed the behavior.

galv · 2024-04-26T21:52:24Z

nemo/collections/asr/parts/submodules/rnnt_loop_labels_computer.py


-        self.state: Optional[LoopLabelsState] = None
+    def force_cuda_graphs_mode(self, mode: Optional[Union[str, CudaGraphsMode]]):


Do we need this? I don't see any usages of it.

Thanks! I forgot to add an explicit test. Fixed.
This is useful for debugging to set no_graphs mode (since it's impossible to debug CUDA graphs directly).

galv · 2024-04-26T21:56:17Z

On the other hand, this case is not important. Most users will use the Label-Looping algorithm since it is default and produces the same result as the Frame-Looping algorithm.

I agree. That is was my first attempt. It was very educational, but I don't believe that we need to do the work to add a feature to it that won't be used very much.

galv · 2024-04-26T21:56:56Z

I didn't hit approve yet. The changes seem good to me. But I want to give some time to defer to @titu1994

titu1994

Some important comments for logging, otherwise looks good

titu1994 · 2024-04-26T23:00:59Z

examples/asr/transcribe_speech.py

@@ -161,7 +162,9 @@ class TranscriptionConfig:
    ctc_decoding: CTCDecodingConfig = CTCDecodingConfig()

    # Decoding strategy for RNNT models
-    rnnt_decoding: RNNTDecodingConfig = RNNTDecodingConfig(fused_batch_size=-1)
+    rnnt_decoding: RNNTDecodingConfig = RNNTDecodingConfig(
+        fused_batch_size=-1, greedy=GreedyBatchedRNNTInferConfig(use_cuda_graph_decoder=True)


You can set the default config inside GreedyBatchedRNNTInferConfig to have this set to True by default rather than do it with this explicit override

Same issue as with setting use_cuda_graph_decoder=True as default in the classes.
I added comments above this line.
RNNTDecodingConfig is used, e.g., in change_vocabulary. After this operation (useful for finetuning), the model will have the field in the config use_cuda_graph_decoder=True, and further training will use the CUDA decoder. Since it is not compatible with bucketing (pre-allocated memory for maximum batch_size*sequence_length can be too large in this case), I prefer to conservatively enable it only for transcription.

The alternative is to enable it everywhere by default, but in the training loop, explicitly use the decoder without CUDA graphs. However, this can make the code too complicated.

If you see that there is a more straightforward solution for this issue, please let's discuss it!

The alternative is to enable it everywhere by default, but in the training loop, explicitly use the decoder without CUDA graphs. However, this can make the code too complicated.

Moved to this solution. I think it is much cleaner. So, set everywhere use_cuda_graph_decoder=True by default

titu1994 · 2024-04-26T23:04:25Z

nemo/collections/asr/parts/submodules/rnnt_loop_labels_computer.py

+            # cuda graphs are allowed
+            # check basic requirements for cuda graphs
+            if self.max_symbols is None:
+                logging.warning("Max symbols is None, which is not allowed with Cuda graphs.")


Max symbols is none on some older models if i recall, that means out of the box they will crash to infer due to error for unrelated reason of cuda graphs not supporting None. It should remain a warning, but i agree that instead of crashing, set it to large default value for cuda graphs..

Ie warn that its not supported so instead a default of 10 timesteps or higher is being used for cuda graphs optimization

titu1994 · 2024-04-26T23:06:48Z

nemo/collections/asr/parts/submodules/rnnt_loop_labels_computer.py

+                    check_cuda_python_cuda_graphs_conditional_nodes_supported()
+                    self.cuda_graphs_mode = self.CudaGraphsMode.FULL_GRAPH
+                except (ImportError, ModuleNotFoundError) as e:
+                    logging.warning(


Im wondering if this should be visible to users. The problem is the vast majority of users will NOT be on latest driver and cudapython install - ie the vast majority wont be using Cuda graphs and will log this warning - repeatedly polluting the inference in a loop call to transcribe().

We need to make this log just once - see loggermode and how to pass it inside of logging messages.

As I see it, the transcribe does not change the decoding strategy. Since this is logged only when instantiating the class (when instantiating the model or changing the decoding strategy), this should be fine (no repeated logs).

titu1994 · 2024-04-26T23:09:46Z

tests/collections/asr/decoding/test_cuda_graph_rnnt_greedy_decoding.py


 from nemo.collections.asr.models import ASRModel
 from nemo.core.utils.cuda_python_utils import skip_cuda_python_test_if_cuda_graphs_conditional_nodes_not_supported


+@pytest.fixture(scope="module")


Shouldnt this be with a marker for with_download

Pytest marks (with_downloads) are useful only for tests, not fixtures. All the tests that use these fixtures are marked with with_downloads tag.

Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>

titu1994 · 2024-04-30T20:38:51Z

nemo/collections/asr/models/asr_model.py

@@ -171,6 +173,51 @@ def on_after_backward(self):
                logging.warning(f'detected inf or nan values in gradients! Setting gradients to zero.')
                self.zero_grad()

+    @classmethod


These two methods should be class methods of WithOptionalCYDAGraphs

I think, generally, this is not a good idea to introduce a 2-way dependency WithOptionalCudaGraphs <-> ASRModel (actually, EncDecRNNTModel, since decoding is only in this model).
I made the method more abstract to separate the logic, separating the path in the model and the lookup logic.

nemo/collections/asr/models/asr_model.py

…UDA graphs in `ASRModel` Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>

nemo/collections/asr/models/asr_model.py

Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>

titu1994

Looks good to me, @galv for final approval

galv

The issue you raised about memory usage when doing inference during training is a very good one. I believe we can figure out a solution that is less instrusive in the future if we point it out to the right people (follow up with me!).

galv · 2024-05-02T20:37:15Z

nemo/collections/asr/models/asr_model.py

+        EncDecRNNTModel.decoding.decoding is the inference class with CUDA graphs.
+        """
+        WithOptionalCudaGraphs.disable_cuda_graphs_recursive(self, attribute_path="decoding.decoding")
+        return super().on_validation_epoch_end()


Why do you call the superclass's implementation only for this method, but not for the others?

These hooks return None in the PyTorch-Lightning interface, and basically, there is no code in such hooks. But ModelPT defines the on_validation_epoch_end hook for all models with the customized return type, so I need to call it.

galv · 2024-05-02T21:04:32Z

nemo/collections/asr/parts/submodules/rnnt_greedy_decoding.py

+                if not self.use_cuda_graph_decoder:
+                    self._greedy_decode = self._greedy_decode_blank_as_pad_loop_frames
+                else:
+                    if self.preserve_alignments:


I'm surprsied that you would silently change the behavior on lines 630 to 639 rather than throw an exception in these cases, to be honest.

Meanwhile, the situation where we set the symbols_per_step to 10 if it is None seems okay because it is unlikely to change the results, since 10 is such a large number.

I'm not going to hold up merging this because of this concern, anyway, since it is a code path most people won't see.

I made this fallback behavior to prevent crashes when the user wants to change some parameters since use_cuda_graph_decoder is True by default now. Since it's only about speed (not quality), it is acceptable to switch silently between implementations instead of requiring the user to understand all the nuances of the available parameter combinations.
LoopLabelsComputer(s) are designed to handle all situations without explicit errors (e.g., when cuda is unavailable, etc.).

galv · 2024-05-02T21:21:22Z

nemo/collections/asr/parts/submodules/rnnt_greedy_decoding.py

+                                RNNTGreedyDecodeCudaGraph,
+                            )
+
+                            self._greedy_decode = RNNTGreedyDecodeCudaGraph(max_symbols_per_step, self)


I am not certain, but it currently looks like we will throw an exception if max_symbols_per_step is None, rather than overriding it to 10 for the frame-loop decoder right now.

Yep, thanks for catching this. I will address this in a follow-up PR

galv · 2024-05-02T21:23:13Z

nemo/collections/asr/parts/submodules/rnnt_loop_labels_computer.py

            self.state.alignments.add_results_masked_no_checks_(
                active_mask=self.state.active_mask,
                time_indices=self.state.time_indices_current_labels,
                logits=logits if self.preserve_alignments else None,
                labels=self.state.labels if self.preserve_alignments else None,
-                confidence=self._get_confidence_tensor(F.log_softmax(logits, dim=-1))
+                confidence=self._get_confidence_tensor(F.log_softmax(logits, dim=-1)).to(dtype=float_dtype)


As a point of clarification, why did you need to add this to() call? It doesn't seem to be related to the rest of the changes.

I would be happy to avoid this, but without casting, the code will fail with mixed bf16 precision:

log_softmax returns the value of float32 type in bfloat16 mixed precision (amp)

alignments storage is initialized with bf16 type, and adding confidence values inside add_results_masked_no_checks_ will fail

This is the same issue I observed when reviewing Sasha's PR related to TDT confidence #8982
Since I enabled computing confidence in tests for RNN-T, I caught this bug and fixed it here https://github.com/NVIDIA/NeMo/pull/8972/files#diff-d8ba9ce8e77769e06174cf0d16842d130debb4a289e92fea5296b081f5a4deabR133

Okay, I understand! I am planning to redo #9086 so that we will do inference in pure bfloat16 or float16, rather than using AMP. Basically, running in AMP can actually slow you down compared to running in float32 in inference mode, because it caches the down-casted versions of parameters only "requires_grad=False" for those parameters.

It should be safe to do softmax with fp16 inputs and outputs at inference time. The accumulations are done in fp32, which is the important part.

After we move away from AMP for inference (which might take a while since NeMo was written with that assumption for a long time), we can get rid of the need for the cast.

* Use Cuda graphs by default for RNN-T and TDT Signed-off-by: Vladimir Bataev <vbataev@nvidia.com> --------- Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>

artbataev added 3 commits April 18, 2024 18:18

Use Cuda graphs by default for transcription

7441a2f

Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>

RNN-T Loop Labels + Cuda graphs user-friendly

4761e68

Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>

Fix Cuda graphs mode

2b13f31

Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>

github-actions bot added the ASR label Apr 18, 2024

Fuse graphs

d391e98

Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>

artbataev changed the title ~~RNN-T and TDT decoding: use CUDA graphs by default~~ RNN-T and TDT inference: use CUDA graphs by default Apr 18, 2024

Enable by default Cuda graphs for TDT

0205101

Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>

artbataev added 9 commits April 19, 2024 16:37

Merge branch 'main' into rnnt_cuda_graphs_default

0cac599

Merge branch 'main' into rnnt_cuda_graphs_default

cb68701

Merge branch 'main' into rnnt_cuda_graphs_default

b38ff6f

Add test

25235bb

Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>

Speedup init state

f175c8a

Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>

Add comments

1672739

Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>

Speedup tests

9e18150

Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>

Add comments

94ba6af

Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>

Fix tests for alignments

7b9d619

Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>

Fix test

2d3b083

Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>

artbataev marked this pull request as ready for review April 25, 2024 19:06

artbataev requested review from galv and titu1994 April 25, 2024 19:06

Merge branch 'main' into rnnt_cuda_graphs_default

7e805bc

galv reviewed Apr 26, 2024

View reviewed changes

titu1994 reviewed Apr 26, 2024

View reviewed changes

Clean up

35564df

Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>

artbataev added Run CICD and removed Run CICD labels Apr 30, 2024

artbataev requested a review from titu1994 April 30, 2024 18:15

artbataev added 2 commits April 30, 2024 22:16

Clean up

7192acc

Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>

Clean up

d4a27f6

Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>

artbataev added Run CICD and removed Run CICD labels Apr 30, 2024

titu1994 reviewed Apr 30, 2024

View reviewed changes

Extract toggling CUDA graphs logic to WithOptionalCudaGraphs. Fix C…

c0877f2

…UDA graphs in `ASRModel` Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>

artbataev added Run CICD and removed Run CICD labels May 2, 2024

artbataev requested a review from titu1994 May 2, 2024 14:40

github-advanced-security bot found potential problems May 2, 2024

View reviewed changes

nemo/collections/asr/models/asr_model.py Fixed Show fixed Hide fixed

Fix unused imports

b880510

Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>

artbataev added Run CICD and removed Run CICD labels May 2, 2024

Merge branch 'main' into rnnt_cuda_graphs_default

82f83dc

artbataev added Run CICD and removed Run CICD labels May 2, 2024

Fix hook (failing tests)

4e47010

Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>

artbataev added Run CICD and removed Run CICD labels May 2, 2024

titu1994 approved these changes May 2, 2024

View reviewed changes

galv approved these changes May 2, 2024

View reviewed changes

artbataev merged commit 894e502 into main May 3, 2024
148 of 254 checks passed

artbataev deleted the rnnt_cuda_graphs_default branch May 3, 2024 11:10


		self.state: Optional[LoopLabelsState] = None
		def force_cuda_graphs_mode(self, mode: Optional[Union[str, CudaGraphsMode]]):

RNN-T and TDT inference: use CUDA graphs by default #8972

RNN-T and TDT inference: use CUDA graphs by default #8972

Conversation

artbataev commented Apr 18, 2024 • edited Loading

What does this PR do ?

Changelog

Usage

Jenkins CI

Before your PR is "Ready for review"

Who can review?

Additional Information

artbataev commented Apr 18, 2024

artbataev commented Apr 25, 2024

artbataev commented Apr 25, 2024 • edited Loading

galv commented Apr 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

galv commented Apr 26, 2024

galv commented Apr 26, 2024

titu1994 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

artbataev Apr 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

titu1994 left a comment

Choose a reason for hiding this comment

galv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

artbataev commented Apr 18, 2024 •

edited

Loading

artbataev commented Apr 25, 2024 •

edited

Loading

artbataev Apr 29, 2024 •

edited

Loading