Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding cache-aware streaming Conformer with look-ahead support #3888

Merged
merged 247 commits into from
Aug 3, 2022

Conversation

VahidooX
Copy link
Collaborator

@VahidooX VahidooX commented Mar 26, 2022

What does this PR do ?

Adding cache-aware streaming Conformer training and inference with look-ahead support. It is achieved by training a model with limited effective right context and then perform the streaming with activation caching support. Limiting the right context would reduce the accuracy in compare to the an offline model but it gives better accuracy and significantly higher throughput by dropping duplicates in the computations which happens in buffered-based streaming.Large right context decreases the WER while increasing the latency.

It supports the three following modes:
1-fully causal model with zero look-ahead with zero latency
2-regular look-ahead
3-chunk-aware look-ahead with small duplication in computations.

It supports both Conformer-CTC and Conformer-Transducer and they can get trained with regular scripts but the configs files in the following folder:
NeMo/examples/asr/conf/conformer/streaming/

A model trained in streaming mode can get evaluated with the following script:
NeMo/examples/asr/conf/conformer/streaming/speech_to_text_streaming_infer.py

This script would simulate the streaming inference for a single audio or a manifest of audio files. Streaming can be done in multi-streaming mode (batched inference) for the manifest file to speed up the streaming. It can also compare the results with offline evaluation and report the differences in both the WER and models' outputs.

The accuracy of the model in both the offline evaluation and streaming is going to be exactly the same. In offline mode, the whole audio is passed through the model while in streaming audio is passed chunk by chunk.

Changelog

  • Added frame-wise streaming Conformer models with look-ahead support and caching mechanism for streaming inference.

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

PR Type:

  • [x ] New Feature
  • Bugfix
  • Documentation

Signed-off-by: Vahid <vnoroozi@nvidia.com>
Signed-off-by: Vahid <vnoroozi@nvidia.com>
Signed-off-by: Vahid <vnoroozi@nvidia.com>
Signed-off-by: Vahid <vnoroozi@nvidia.com>
Signed-off-by: Vahid <vnoroozi@nvidia.com>
Signed-off-by: Vahid <vnoroozi@nvidia.com>
Signed-off-by: Vahid <vnoroozi@nvidia.com>
Signed-off-by: Vahid <vnoroozi@nvidia.com>
Signed-off-by: Vahid <vnoroozi@nvidia.com>
Signed-off-by: Vahid <vnoroozi@nvidia.com>
Signed-off-by: Vahid <vnoroozi@nvidia.com>
Signed-off-by: Vahid <vnoroozi@nvidia.com>
Signed-off-by: Vahid <vnoroozi@nvidia.com>
Signed-off-by: Vahid <vnoroozi@nvidia.com>
Signed-off-by: Vahid <vnoroozi@nvidia.com>
Signed-off-by: Vahid <vnoroozi@nvidia.com>
Signed-off-by: Vahid <vnoroozi@nvidia.com>
Signed-off-by: Vahid <vnoroozi@nvidia.com>
Signed-off-by: Vahid <vnoroozi@nvidia.com>
Signed-off-by: Vahid <vnoroozi@nvidia.com>
Signed-off-by: Vahid <vnoroozi@nvidia.com>
Signed-off-by: Vahid <vnoroozi@nvidia.com>
Signed-off-by: Vahid <vnoroozi@nvidia.com>
Signed-off-by: Vahid <vnoroozi@nvidia.com>
Signed-off-by: Vahid <vnoroozi@nvidia.com>
Signed-off-by: Vahid <vnoroozi@nvidia.com>
Signed-off-by: Vahid <vnoroozi@nvidia.com>
Signed-off-by: Vahid <vnoroozi@nvidia.com>
Signed-off-by: Vahid <vnoroozi@nvidia.com>
@lgtm-com
Copy link

lgtm-com bot commented Aug 2, 2022

This pull request introduces 7 alerts and fixes 4 when merging c2cfe4e into 8e1436b - view on LGTM.com

new alerts:

  • 6 for Unused local variable
  • 1 for Non-callable called

fixed alerts:

  • 4 for Unused import

Signed-off-by: Vahid <vnoroozi@nvidia.com>
@lgtm-com
Copy link

lgtm-com bot commented Aug 2, 2022

This pull request introduces 7 alerts and fixes 4 when merging 7589f88 into aaeac3c - view on LGTM.com

new alerts:

  • 6 for Unused local variable
  • 1 for Non-callable called

fixed alerts:

  • 4 for Unused import

@lgtm-com
Copy link

lgtm-com bot commented Aug 2, 2022

This pull request introduces 7 alerts and fixes 4 when merging 0bde720 into 5c8fe3a - view on LGTM.com

new alerts:

  • 6 for Unused local variable
  • 1 for Non-callable called

fixed alerts:

  • 4 for Unused import

Signed-off-by: Vahid <vnoroozi@nvidia.com>
@lgtm-com
Copy link

lgtm-com bot commented Aug 2, 2022

This pull request introduces 7 alerts and fixes 4 when merging 090f838 into 5c8fe3a - view on LGTM.com

new alerts:

  • 6 for Unused local variable
  • 1 for Non-callable called

fixed alerts:

  • 4 for Unused import

Signed-off-by: Vahid <vnoroozi@nvidia.com>
Signed-off-by: Vahid <vnoroozi@nvidia.com>
Signed-off-by: Vahid <vnoroozi@nvidia.com>
@lgtm-com
Copy link

lgtm-com bot commented Aug 3, 2022

This pull request introduces 9 alerts and fixes 4 when merging 463aed6 into 5c8fe3a - view on LGTM.com

new alerts:

  • 8 for Unused local variable
  • 1 for Non-callable called

fixed alerts:

  • 4 for Unused import

titu1994
titu1994 previously approved these changes Aug 3, 2022
Copy link
Collaborator

@titu1994 titu1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving for now since we're out of times. But before merge, rename the function to cache_aware_stream_step - basic stream_step is too generic and does not inform what is being used, and is not future proof.

start_time = time.time()
for sample_idx, sample in enumerate(samples):
processed_signal, processed_signal_length, stream_id = streaming_buffer.append_audio_file(
sample['audio_filepath'], stream_id=-1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to document this script a lot more in the branch cut for 1.11. For now its fine

if (sample_idx + 1) % args.batch_size == 0 or sample_idx == len(samples) - 1:
logging.info(f"Starting to stream samples {sample_idx - len(streaming_buffer) + 1} to {sample_idx}...")
streaming_tran, offline_tran = perform_streaming(
asr_model=asr_model,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ comment

if hasattr(self.input_module, 'forward_for_export'):
encoder_output = self.input_module.forward_for_export(input, length)
if cache_last_channel is None and cache_last_time is None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok leaving this comment unresolved for later check then.

nemo/collections/asr/parts/mixins/streaming.py Outdated Show resolved Hide resolved
Signed-off-by: Vahid <vnoroozi@nvidia.com>
@lgtm-com
Copy link

lgtm-com bot commented Aug 3, 2022

This pull request introduces 9 alerts and fixes 4 when merging 194581f into 498ff20 - view on LGTM.com

new alerts:

  • 8 for Unused local variable
  • 1 for Non-callable called

fixed alerts:

  • 4 for Unused import

@VahidooX VahidooX merged commit eae1684 into NVIDIA:main Aug 3, 2022
@effendijohanes
Copy link

Hi @VahidooX , thanks for the examples you made. I tried with 2 minutes wav file using stt_en_conformer_transducer_small.nemo model,

python examples/asr/asr_streaming/speech_to_text_streaming_infer.py --asr_model stt_en_conformer_transducer_small.nemo --audio_file test.wav

but I get this error during online mode:

Traceback (most recent call last):
  File ".../nemo/sandbox/../examples/asr/asr_streaming/speech_to_text_streaming_infer.py", line 333, in <module>
    main()
  File ".../nemo/sandbox/../examples/asr/asr_streaming/speech_to_text_streaming_infer.py", line 261, in main
    perform_streaming(
  File ".../nemo/sandbox/../examples/asr/asr_streaming/speech_to_text_streaming_infer.py", line 128, in perform_streaming
    ) = asr_model.conformer_stream_step(
  File ".../nemo/nemo/collections/asr/parts/mixins/mixins.py", line 441, in conformer_stream_step
    (encoded, encoded_len, cache_last_channel_next, cache_last_time_next) = self.encoder.cache_aware_stream_step(
  File ".../nemo/nemo/collections/asr/parts/mixins/streaming.py", line 61, in cache_aware_stream_step
    encoder_output = self(
  File ".../nemo/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File ".../nemo/nemo/core/classes/common.py", line 1084, in __call__
    outputs = wrapped(*args, **kwargs)
  File ".../nemo/nemo/collections/asr/modules/conformer_encoder.py", line 382, in forward
    cache_last_channel_next = torch.zeros(
RuntimeError: Trying to create tensor with negative dimension -7204: [16, 1, -7204, 176]

do you have any idea what might have happened? Thanks!

@VahidooX
Copy link
Collaborator Author

VahidooX commented Aug 3, 2022

Hi @VahidooX , thanks for the examples you made. I tried with 2 minutes wav file using stt_en_conformer_transducer_small.nemo model,

python examples/asr/asr_streaming/speech_to_text_streaming_infer.py --asr_model stt_en_conformer_transducer_small.nemo --audio_file test.wav

but I get this error during online mode:

Traceback (most recent call last):
  File ".../nemo/sandbox/../examples/asr/asr_streaming/speech_to_text_streaming_infer.py", line 333, in <module>
    main()
  File ".../nemo/sandbox/../examples/asr/asr_streaming/speech_to_text_streaming_infer.py", line 261, in main
    perform_streaming(
  File ".../nemo/sandbox/../examples/asr/asr_streaming/speech_to_text_streaming_infer.py", line 128, in perform_streaming
    ) = asr_model.conformer_stream_step(
  File ".../nemo/nemo/collections/asr/parts/mixins/mixins.py", line 441, in conformer_stream_step
    (encoded, encoded_len, cache_last_channel_next, cache_last_time_next) = self.encoder.cache_aware_stream_step(
  File ".../nemo/nemo/collections/asr/parts/mixins/streaming.py", line 61, in cache_aware_stream_step
    encoder_output = self(
  File ".../nemo/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File ".../nemo/nemo/core/classes/common.py", line 1084, in __call__
    outputs = wrapped(*args, **kwargs)
  File ".../nemo/nemo/collections/asr/modules/conformer_encoder.py", line 382, in forward
    cache_last_channel_next = torch.zeros(
RuntimeError: Trying to create tensor with negative dimension -7204: [16, 1, -7204, 176]

do you have any idea what might have happened? Thanks!

This approach need you to train a model in streaming mode to get the best results which means with limited right and left context and no normalization in feature extraction. While it can be possible to try offline models with this approach, the accuracy would not be great. I have not added the support of offline models in this PR, I would look into it and add it soon.

@itzsimpl
Copy link
Contributor

itzsimpl commented Aug 3, 2022

@VahidooX are there perhaps any pre-trained streaming models already available?

@VahidooX
Copy link
Collaborator Author

VahidooX commented Aug 3, 2022

@VahidooX are there perhaps any pre-trained streaming models already available?

Not yet, I am still working on training them on nemo asrset. Hopefully there will be some uploaded on NGC by the end of this month.

@effendijohanes
Copy link

Hi @VahidooX , looking forward to the support of offline models, thank you very much!

@VahidooX
Copy link
Collaborator Author

VahidooX commented Aug 5, 2022

Hi @VahidooX , looking forward to the support of offline models, thank you very much!

Here is the draft PR to add support for models trained with full context to be used with cache-aware streaming in chunk-aware look-ahead style:

#4687

Just note that the results would be significantly worse than when you train the model in streaming mode. I will share some numbers in the PR when they are ready. The main advantage of using this approach on an offline model comparing to the buffered streaming is just using less computations. Cache-aware approach is unlikely to give better results in terms of accuracy for such models as they don't use overlapping chunks in chunk-aware mode. I would try to add the support for regular look-ahead which uses overlapping chunks.

@effendijohanes
Copy link

Thanks for the PR @VahidooX , let me study your code.

Davood-M pushed a commit to Davood-M/NeMo that referenced this pull request Aug 9, 2022
…A#3888)

Signed-off-by: David Mosallanezhad <dmosallanezh@nvidia.com>
piraka9011 pushed a commit to piraka9011/NeMo that referenced this pull request Aug 25, 2022
…A#3888)

Signed-off-by: Anas Abou Allaban <aabouallaban@pm.me>
@shahin-trunk
Copy link

@VahidooX are there perhaps any pre-trained streaming models already available?

Not yet, I am still working on training them on nemo asrset. Hopefully there will be some uploaded on NGC by the end of this month.

@VahidooX any update on pre-trained models. Not able to get the models converge without initializing the weights.

@Higher08
Copy link

@VahidooX Did you manage to train these models on Nemo ASRSET? If yes, can you send files?

hainan-xv pushed a commit to hainan-xv/NeMo that referenced this pull request Nov 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants