-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RNN-T and TDT inference: use CUDA graphs by default #8972
Conversation
Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>
Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>
Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>
Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>
Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>
jenkins |
Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>
Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>
Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>
Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>
Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>
jenkins |
@galv, @titu1994 Please, review the PR The only issue I see for now is that the Frame-Looping algorithm does not have fallback behavior, and setting On the other hand, this case is not important. Most users will use the Label-Looping algorithm since it is default and produces the same result as the Frame-Looping algorithm. |
There is something I'm not quite understanding. Is there a reason why you did not set the default value of these values to true?
and
and
It looks like you are turning cuda graphs on by default only when someone runs transcribe_speech.py or transcribe_speech_parallel.py right now. |
if self.cuda_graphs_mode is self.CudaGraphsMode.FULL_GRAPH: | ||
self.full_graph.replay() | ||
elif self.cuda_graphs_mode is self.CudaGraphsMode.NO_WHILE_LOOPS: | ||
self.separate_graphs.before_outer_loop.replay() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is cool. I would like to speak to you about a way we could possibly do this more easily using torch.cond and torch.while_loop. They're not in the newest versions of pytorch yet.
As we improve our implementations of beam search, I don't think it is realistic to keep doing special case code like this. I'm thinking we can get the right level of abstraction with torch.cond and torch.while_loop, such that people can write things more naturally.
# cuda graphs are allowed | ||
# check basic requirements for cuda graphs | ||
if self.max_symbols is None: | ||
logging.warning("Max symbols is None, which is not allowed with Cuda graphs.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
People don't check warnings often enough. I recommend you throw an exception here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Max symbols is none on some older models if i recall, that means out of the box they will crash to infer due to error for unrelated reason of cuda graphs not supporting None. It should remain a warning, but i agree that instead of crashing, set it to large default value for cuda graphs..
Ie warn that its not supported so instead a default of 10 timesteps or higher is being used for cuda graphs optimization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Som's suggestion, actually. Using a large value like 10 when someone passes in None seems like the right move.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fully agree, thanks! I fixed the behavior.
|
||
self.state: Optional[LoopLabelsState] = None | ||
def force_cuda_graphs_mode(self, mode: Optional[Union[str, CudaGraphsMode]]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this? I don't see any usages of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I forgot to add an explicit test. Fixed.
This is useful for debugging to set no_graphs
mode (since it's impossible to debug CUDA graphs directly).
I agree. That is was my first attempt. It was very educational, but I don't believe that we need to do the work to add a feature to it that won't be used very much. |
I didn't hit approve yet. The changes seem good to me. But I want to give some time to defer to @titu1994 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some important comments for logging, otherwise looks good
examples/asr/transcribe_speech.py
Outdated
@@ -161,7 +162,9 @@ class TranscriptionConfig: | |||
ctc_decoding: CTCDecodingConfig = CTCDecodingConfig() | |||
|
|||
# Decoding strategy for RNNT models | |||
rnnt_decoding: RNNTDecodingConfig = RNNTDecodingConfig(fused_batch_size=-1) | |||
rnnt_decoding: RNNTDecodingConfig = RNNTDecodingConfig( | |||
fused_batch_size=-1, greedy=GreedyBatchedRNNTInferConfig(use_cuda_graph_decoder=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can set the default config inside GreedyBatchedRNNTInferConfig
to have this set to True by default rather than do it with this explicit override
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same issue as with setting use_cuda_graph_decoder=True
as default in the classes.
I added comments above this line.
RNNTDecodingConfig
is used, e.g., in change_vocabulary
. After this operation (useful for finetuning), the model will have the field in the config use_cuda_graph_decoder=True
, and further training will use the CUDA decoder. Since it is not compatible with bucketing (pre-allocated memory for maximum batch_size*sequence_length can be too large in this case), I prefer to conservatively enable it only for transcription.
The alternative is to enable it everywhere by default, but in the training loop, explicitly use the decoder without CUDA graphs. However, this can make the code too complicated.
If you see that there is a more straightforward solution for this issue, please let's discuss it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The alternative is to enable it everywhere by default, but in the training loop, explicitly use the decoder without CUDA graphs. However, this can make the code too complicated.
Moved to this solution. I think it is much cleaner. So, set everywhere use_cuda_graph_decoder=True
by default
# cuda graphs are allowed | ||
# check basic requirements for cuda graphs | ||
if self.max_symbols is None: | ||
logging.warning("Max symbols is None, which is not allowed with Cuda graphs.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Max symbols is none on some older models if i recall, that means out of the box they will crash to infer due to error for unrelated reason of cuda graphs not supporting None. It should remain a warning, but i agree that instead of crashing, set it to large default value for cuda graphs..
Ie warn that its not supported so instead a default of 10 timesteps or higher is being used for cuda graphs optimization
check_cuda_python_cuda_graphs_conditional_nodes_supported() | ||
self.cuda_graphs_mode = self.CudaGraphsMode.FULL_GRAPH | ||
except (ImportError, ModuleNotFoundError) as e: | ||
logging.warning( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im wondering if this should be visible to users. The problem is the vast majority of users will NOT be on latest driver and cudapython install - ie the vast majority wont be using Cuda graphs and will log this warning - repeatedly polluting the inference in a loop call to transcribe().
We need to make this log just once - see loggermode and how to pass it inside of logging messages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I see it, the transcribe
does not change the decoding strategy. Since this is logged only when instantiating the class (when instantiating the model or changing the decoding strategy), this should be fine (no repeated logs).
|
||
from nemo.collections.asr.models import ASRModel | ||
from nemo.core.utils.cuda_python_utils import skip_cuda_python_test_if_cuda_graphs_conditional_nodes_not_supported | ||
|
||
|
||
@pytest.fixture(scope="module") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldnt this be with a marker for with_download
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pytest marks (with_downloads
) are useful only for tests, not fixtures. All the tests that use these fixtures are marked with with_downloads
tag.
@@ -171,6 +173,51 @@ def on_after_backward(self): | |||
logging.warning(f'detected inf or nan values in gradients! Setting gradients to zero.') | |||
self.zero_grad() | |||
|
|||
@classmethod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two methods should be class methods of WithOptionalCYDAGraphs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, generally, this is not a good idea to introduce a 2-way dependency WithOptionalCudaGraphs
<-> ASRModel
(actually, EncDecRNNTModel
, since decoding is only in this model).
I made the method more abstract to separate the logic, separating the path in the model and the lookup logic.
…UDA graphs in `ASRModel` Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>
Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>
Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, @galv for final approval
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue you raised about memory usage when doing inference during training is a very good one. I believe we can figure out a solution that is less instrusive in the future if we point it out to the right people (follow up with me!).
EncDecRNNTModel.decoding.decoding is the inference class with CUDA graphs. | ||
""" | ||
WithOptionalCudaGraphs.disable_cuda_graphs_recursive(self, attribute_path="decoding.decoding") | ||
return super().on_validation_epoch_end() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you call the superclass's implementation only for this method, but not for the others?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These hooks return None
in the PyTorch-Lightning interface, and basically, there is no code in such hooks. But ModelPT
defines the on_validation_epoch_end
hook for all models with the customized return type, so I need to call it.
if not self.use_cuda_graph_decoder: | ||
self._greedy_decode = self._greedy_decode_blank_as_pad_loop_frames | ||
else: | ||
if self.preserve_alignments: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprsied that you would silently change the behavior on lines 630 to 639 rather than throw an exception in these cases, to be honest.
Meanwhile, the situation where we set the symbols_per_step to 10 if it is None seems okay because it is unlikely to change the results, since 10 is such a large number.
I'm not going to hold up merging this because of this concern, anyway, since it is a code path most people won't see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made this fallback behavior to prevent crashes when the user wants to change some parameters since use_cuda_graph_decoder
is True
by default now. Since it's only about speed (not quality), it is acceptable to switch silently between implementations instead of requiring the user to understand all the nuances of the available parameter combinations.
LoopLabelsComputer
(s) are designed to handle all situations without explicit errors (e.g., when cuda is unavailable, etc.).
RNNTGreedyDecodeCudaGraph, | ||
) | ||
|
||
self._greedy_decode = RNNTGreedyDecodeCudaGraph(max_symbols_per_step, self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not certain, but it currently looks like we will throw an exception if max_symbols_per_step is None, rather than overriding it to 10 for the frame-loop decoder right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, thanks for catching this. I will address this in a follow-up PR
self.state.alignments.add_results_masked_no_checks_( | ||
active_mask=self.state.active_mask, | ||
time_indices=self.state.time_indices_current_labels, | ||
logits=logits if self.preserve_alignments else None, | ||
labels=self.state.labels if self.preserve_alignments else None, | ||
confidence=self._get_confidence_tensor(F.log_softmax(logits, dim=-1)) | ||
confidence=self._get_confidence_tensor(F.log_softmax(logits, dim=-1)).to(dtype=float_dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a point of clarification, why did you need to add this to() call? It doesn't seem to be related to the rest of the changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be happy to avoid this, but without casting, the code will fail with mixed bf16 precision:
log_softmax
returns the value offloat32
type in bfloat16 mixed precision (amp)- alignments storage is initialized with bf16 type, and adding confidence values inside
add_results_masked_no_checks_
will fail
This is the same issue I observed when reviewing Sasha's PR related to TDT confidence #8982
Since I enabled computing confidence in tests for RNN-T, I caught this bug and fixed it here https://github.com/NVIDIA/NeMo/pull/8972/files#diff-d8ba9ce8e77769e06174cf0d16842d130debb4a289e92fea5296b081f5a4deabR133
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I understand! I am planning to redo #9086 so that we will do inference in pure bfloat16 or float16, rather than using AMP. Basically, running in AMP can actually slow you down compared to running in float32 in inference mode, because it caches the down-casted versions of parameters only "requires_grad=False" for those parameters.
It should be safe to do softmax with fp16 inputs and outputs at inference time. The accumulations are done in fp32, which is the important part.
After we move away from AMP for inference (which might take a while since NeMo was written with that assumption for a long time), we can get rid of the need for the cast.
* Use Cuda graphs by default for RNN-T and TDT Signed-off-by: Vladimir Bataev <vbataev@nvidia.com> --------- Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>
What does this PR do ?
transcribe_speech.py
,speech_to_text_eval.py
)<graph before outer loop>
->while loop (python)
-><graph before inner loop>
->inner while loop (python)
-><graph for inner loop code>
etc.)On my local machine, FastConformer L, LibriSpeech test-other decoding time, bfloat16, bs=16
Collection: [ASR]
Changelog
Usage
Jenkins CI
To run Jenkins, a NeMo User with write access must comment
jenkins
on the PR.Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information