Unable to get NeMo working with canary-1b #8389

cschmittiey · 2024-02-09T19:00:27Z

cschmittiey
Feb 9, 2024

Hi all, hoping for some pointers i'm running into on an issue with trying to get the canary model running locally on some radio recordings I want to transcribe.

When using NeMo/examples/asr/transcribe_speech.py with this manifest:

{"audio_filepath": "/home/csmith/projects/simple-tr-transcription/temp_1399-1707503502_852412500.1-call_11866.wav", "duration": 20.48, "taskname": "asr", "source_lang": "en", "target_lang": "en", "pnc": "yes"}

I get the following output and no transcription:

python ~/projects/NeMo/examples/asr/transcribe_speech.py pretrained_name="nvidia/canary-1b" dataset_manifest="temp_1399-1707503502_852412500.1-call_11866.json"
[NeMo W 2024-02-09 13:50:24 nemo_logging:349] /home/csmith/.conda/envs/nemo1/lib/python3.10/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
    See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
      ret = run_job(

[NeMo I 2024-02-09 13:50:24 transcribe_speech:193] Hydra config: model_path: null
    pretrained_name: nvidia/canary-1b
    audio_dir: null
    dataset_manifest: temp_1399-1707503502_852412500.1-call_11866.json
    channel_selector: null
    audio_key: audio_filepath
    eval_config_yaml: null
    output_filename: null
    batch_size: 32
    num_workers: 0
    append_pred: false
    pred_name_postfix: null
    random_seed: null
    compute_timestamps: false
    preserve_alignment: false
    compute_langs: false
    cuda: null
    allow_mps: false
    amp: false
    amp_dtype: float16
    audio_type: wav
    overwrite_transcripts: true
    ctc_decoding:
      strategy: greedy
      preserve_alignments: null
      compute_timestamps: null
      word_seperator: ' '
      ctc_timestamp_type: all
      batch_dim_index: 0
      greedy:
        preserve_alignments: false
        compute_timestamps: false
        preserve_frame_confidence: false
        confidence_method_cfg:
          name: entropy
          entropy_type: tsallis
          alpha: 0.33
          entropy_norm: exp
          temperature: DEPRECATED
      beam:
        beam_size: 4
        search_type: default
        preserve_alignments: false
        compute_timestamps: false
        return_best_hypothesis: true
        beam_alpha: 1.0
        beam_beta: 0.0
        kenlm_path: null
        flashlight_cfg:
          lexicon_path: null
          boost_path: null
          beam_size_token: 16
          beam_threshold: 20.0
          unk_weight: -.inf
          sil_weight: 0.0
        pyctcdecode_cfg:
          beam_prune_logp: -10.0
          token_min_logp: -5.0
          prune_history: false
          hotwords: null
          hotword_weight: 10.0
      confidence_cfg:
        preserve_frame_confidence: false
        preserve_token_confidence: false
        preserve_word_confidence: false
        exclude_blank: true
        aggregation: min
        method_cfg:
          name: entropy
          entropy_type: tsallis
          alpha: 0.33
          entropy_norm: exp
          temperature: DEPRECATED
      temperature: 1.0
    rnnt_decoding:
      model_type: rnnt
      strategy: greedy_batch
      compute_hypothesis_token_set: false
      preserve_alignments: null
      confidence_cfg:
        preserve_frame_confidence: false
        preserve_token_confidence: false
        preserve_word_confidence: false
        exclude_blank: true
        aggregation: min
        method_cfg:
          name: entropy
          entropy_type: tsallis
          alpha: 0.33
          entropy_norm: exp
          temperature: DEPRECATED
      fused_batch_size: -1
      compute_timestamps: null
      compute_langs: false
      word_seperator: ' '
      rnnt_timestamp_type: all
      greedy:
        max_symbols_per_step: 10
        preserve_alignments: false
        preserve_frame_confidence: false
        confidence_method_cfg:
          name: entropy
          entropy_type: tsallis
          alpha: 0.33
          entropy_norm: exp
          temperature: DEPRECATED
        loop_labels: false
      beam:
        beam_size: 4
        search_type: default
        score_norm: true
        return_best_hypothesis: true
        tsd_max_sym_exp_per_step: 50
        alsd_max_target_len: 1.0
        nsc_max_timesteps_expansion: 1
        nsc_prefix_alpha: 1
        maes_num_steps: 2
        maes_prefix_alpha: 1
        maes_expansion_gamma: 2.3
        maes_expansion_beta: 2
        language_model: null
        softmax_temperature: 1.0
        preserve_alignments: false
        ngram_lm_model: null
        ngram_lm_alpha: 0.0
        hat_subtract_ilm: false
        hat_ilm_weight: 0.0
      temperature: 1.0
      durations: []
      big_blank_durations: []
    multitask_decoding:
      strategy: beam
      compute_hypothesis_token_set: false
      preserve_alignments: null
      compute_langs: false
      beam:
        beam_size: 1
        search_type: default
        len_pen: 1.0
        max_generation_delta: 20
        return_best_hypothesis: true
        preserve_alignments: false
      temperature: 1.0
    decoder_type: null
    att_context_size: null
    model_change:
      conformer:
        self_attention_model: null
        att_context_size: null
    calculate_wer: true
    clean_groundtruth_text: false
    langid: en
    use_cer: false
    return_transcriptions: false
    return_hypotheses: true
    gt_text_attr_name: text
    allow_partial_transcribe: false

[NeMo I 2024-02-09 13:50:24 transcribe_speech:239] Inference will be done on device: cpu
[NeMo I 2024-02-09 13:50:26 mixins:196] _setup_tokenizer: detected an aggregate tokenizer
[NeMo I 2024-02-09 13:50:26 mixins:330] Tokenizer SentencePieceTokenizer initialized with 32 tokens
[NeMo I 2024-02-09 13:50:26 mixins:330] Tokenizer SentencePieceTokenizer initialized with 1024 tokens
[NeMo I 2024-02-09 13:50:26 mixins:330] Tokenizer SentencePieceTokenizer initialized with 1024 tokens
[NeMo I 2024-02-09 13:50:26 mixins:330] Tokenizer SentencePieceTokenizer initialized with 1024 tokens
[NeMo I 2024-02-09 13:50:26 mixins:330] Tokenizer SentencePieceTokenizer initialized with 1024 tokens
[NeMo I 2024-02-09 13:50:26 aggregate_tokenizer:72] Aggregate vocab size: 4128
[NeMo W 2024-02-09 13:50:27 modelPT:165] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config :
    tarred_audio_filepaths: null
    manifest_filepath: null
    sample_rate: 16000
    shuffle: true
    batch_size: null
    num_workers: 8
    use_lhotse: true
    max_duration: 40
    pin_memory: true
    text_field: answer
    lang_field: target_lang
    use_bucketing: false
    batch_duration: 360
    quadratic_duration: 15
    num_buckets: 1
    bucket_duration_bins: null
    bucket_buffer_size: 20000
    shuffle_buffer_size: 10000

[NeMo W 2024-02-09 13:50:27 modelPT:172] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s).
    Validation config :
    manifest_filepath: null
    sample_rate: 16000
    batch_size: 8
    shuffle: false
    num_workers: 0
    pin_memory: true
    tarred_audio_filepaths: null
    use_lhotse: true
    text_field: answer
    lang_field: target_lang
    use_bucketing: false

[NeMo W 2024-02-09 13:50:27 modelPT:178] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple_test_data() method and provide a valid configuration file to setup the test data loader(s).
    Test config :
    manifest_filepath: null
    sample_rate: 16000
    batch_size: 32
    shuffle: false
    num_workers: 0
    pin_memory: true
    tarred_audio_filepaths: null
    use_lhotse: true
    text_field: answer
    lang_field: target_lang
    use_bucketing: false

[NeMo I 2024-02-09 13:50:27 features:289] PADDING: 0
[NeMo I 2024-02-09 13:50:34 save_restore_connector:249] Model EncDecMultiTaskModel was successfully restored from /home/csmith/.cache/huggingface/hub/models--nvidia--canary-1b/snapshots/ff5eb5e26fb215bfe35496638f2ec74b2fe4d1a1/canary-1b.nemo.
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
[NeMo I 2024-02-09 13:50:34 aed_multitask_models:214] Changed decoding strategy to
    strategy: beam
    compute_hypothesis_token_set: false
    preserve_alignments: false
    compute_langs: false
    beam:
      beam_size: 1
      search_type: default
      len_pen: 1.0
      max_generation_delta: 20
      return_best_hypothesis: true
      preserve_alignments: false
    temperature: 1.0

[NeMo W 2024-02-09 13:50:34 aed_multitask_models:388] return_hypotheses=True is currently not supported, returning text instead.
We will be using a Lhotse DataLoader.
Initializing Lhotse CutSet from a single NeMo manifest (non-tarred): '/tmp/tmp3y38tq75/manifest.json'
Creating a Lhotse DynamicCutSampler (bucketing is disabled, (max_batch_duration=None max_batch_size=1)
Transcribing: 0it [00:00, ?it/s]Error executing job with overrides: ['pretrained_name=nvidia/canary-1b', 'dataset_manifest=temp_1399-1707503502_852412500.1-call_11866.json']
Traceback (most recent call last):
  File "/home/csmith/projects/NeMo/examples/asr/transcribe_speech.py", line 414, in <module>
    main()  # noqa pylint: disable=no-value-for-parameter
  File "/home/csmith/.conda/envs/nemo1/lib/python3.10/site-packages/nemo/core/config/hydra_runner.py", line 129, in wrapper
    _run_hydra(
  File "/home/csmith/.conda/envs/nemo1/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    _run_app(
  File "/home/csmith/.conda/envs/nemo1/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app
    run_and_report(
  File "/home/csmith/.conda/envs/nemo1/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
    raise ex
  File "/home/csmith/.conda/envs/nemo1/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
  File "/home/csmith/.conda/envs/nemo1/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
  File "/home/csmith/.conda/envs/nemo1/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
  File "/home/csmith/.conda/envs/nemo1/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/home/csmith/.conda/envs/nemo1/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
  File "/home/csmith/projects/NeMo/examples/asr/transcribe_speech.py", line 366, in main
    transcriptions = asr_model.transcribe(
  File "/home/csmith/.conda/envs/nemo1/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/csmith/.conda/envs/nemo1/lib/python3.10/site-packages/nemo/collections/asr/models/aed_multitask_models.py", line 472, in transcribe
    for test_batch in tqdm(temporary_datalayer, desc="Transcribing", disable=not verbose):
  File "/home/csmith/.conda/envs/nemo1/lib/python3.10/site-packages/tqdm/asyncio.py", line 33, in __init__
    self.iterable_iterator = iter(iterable)
  File "/home/csmith/.conda/envs/nemo1/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 439, in __iter__
    return self._get_iterator()
  File "/home/csmith/.conda/envs/nemo1/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 387, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "/home/csmith/.conda/envs/nemo1/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1085, in __init__
    self._reset(loader, first_iter=True)
  File "/home/csmith/.conda/envs/nemo1/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1118, in _reset
    self._try_put_index()
  File "/home/csmith/.conda/envs/nemo1/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1352, in _try_put_index
    index = self._next_index()
  File "/home/csmith/.conda/envs/nemo1/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 621, in _next_index
    return next(self._sampler_iter)  # may raise StopIteration
  File "/home/csmith/.conda/envs/nemo1/lib/python3.10/site-packages/lhotse/dataset/sampling/base.py", line 295, in __next__
    batch = self._next_batch()
  File "/home/csmith/.conda/envs/nemo1/lib/python3.10/site-packages/lhotse/dataset/sampling/dynamic.py", line 223, in _next_batch
    batch = next(self.cuts_iter)
  File "/home/csmith/.conda/envs/nemo1/lib/python3.10/site-packages/lhotse/dataset/sampling/dynamic.py", line 272, in __iter__
    yield self._collect_batch()
  File "/home/csmith/.conda/envs/nemo1/lib/python3.10/site-packages/lhotse/dataset/sampling/dynamic.py", line 298, in _collect_batch
    next_cut_or_tpl = next(self.cuts_iter)
  File "/home/csmith/.conda/envs/nemo1/lib/python3.10/site-packages/lhotse/dataset/sampling/dynamic.py", line 364, in __iter__
    for item in self.iterator:
  File "/home/csmith/.conda/envs/nemo1/lib/python3.10/site-packages/lhotse/cut/set.py", line 2525, in __iter__
    yield from self.cuts
  File "/home/csmith/.conda/envs/nemo1/lib/python3.10/site-packages/lhotse/cut/set.py", line 2525, in __iter__
    yield from self.cuts
  File "/home/csmith/.conda/envs/nemo1/lib/python3.10/site-packages/nemo/collections/common/data/lhotse/nemo_adapters.py", line 84, in __iter__
    text=data[self.text_field],
KeyError: 'answer'
Transcribing: 0it [00:00, ?it/s]

Please let me know if there's any other information that would be helpful in diagnosing what I'm missing - would really love to give this model a shot!

Answered by stevehuang52

Feb 9, 2024

Sorry about the confusion. In current NeMo design, the groundtruth field ("answer") is needed for inference and we need to explicitly set it to a (any) dummy string, but later we will make that field optional so that users won't need to manually set the dummy groundtruth.

View full answer

cschmittiey · 2024-02-09T19:35:01Z

cschmittiey
Feb 9, 2024
Author

I got it working by adding an "answer": "random text" to the manifest - any idea what that should be set to? any reason it's required?

4 replies

stevehuang52 Feb 9, 2024
Collaborator

Sorry about the confusion. In current NeMo design, the groundtruth field ("answer") is needed for inference and we need to explicitly set it to a (any) dummy string, but later we will make that field optional so that users won't need to manually set the dummy groundtruth.

Answer selected by cschmittiey

cschmittiey Feb 9, 2024
Author

Thank you for the quick response! I am up and running now and appreicate the help :)

Kurumindla-Kranthivardhan Apr 25, 2024

Transcribing:
0/? [00:00<?, ?it/s]

AssertionError Traceback (most recent call last)
in <cell line: 1>()
----> 1 transcriptions = asr_model.transcribe(paths2audio_files="/content/test.jsonl",batch_size=4)

7 frames
/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py in decorate_context(*args, **kwargs)
113 def decorate_context(*args, **kwargs):
114 with ctx_factory():
--> 115 return func(*args, **kwargs)
116
117 return decorate_context

/usr/local/lib/python3.10/dist-packages/nemo/collections/asr/models/aed_multitask_models.py in transcribe(self, paths2audio_files, batch_size, logprobs, return_hypotheses, num_workers, channel_selector, augmentor, verbose)
470
471 temporary_datalayer = self._setup_transcribe_dataloader(config)
--> 472 for test_batch in tqdm(temporary_datalayer, desc="Transcribing", disable=not verbose):
473 log_probs, encoded_len, enc_states, enc_mask = self.forward(
474 input_signal=test_batch[0].to(device), input_signal_length=test_batch[1].to(device)

/usr/local/lib/python3.10/dist-packages/tqdm/notebook.py in iter(self)
248 try:
249 it = super(tqdm_notebook, self).iter()
--> 250 for obj in it:
251 # return super(tqdm...) will not catch exception
252 yield obj

/usr/local/lib/python3.10/dist-packages/tqdm/std.py in iter(self)
1179
1180 try:
-> 1181 for obj in iterable:
1182 yield obj
1183 # Update and possibly print the progressbar.

/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py in next(self)
629 # TODO(pytorch/pytorch#76750)
630 self._reset() # type: ignore[call-arg]
--> 631 data = self._next_data()
632 self._num_yielded += 1
633 if self._dataset_kind == _DatasetKind.Iterable and \

/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py in _next_data(self)
1344 else:
1345 del self._task_info[idx]
-> 1346 return self._process_data(data)
1347
1348 def _try_put_index(self):

/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py in _process_data(self, data)
1370 self._try_put_index()
1371 if isinstance(data, ExceptionWrapper):
-> 1372 data.reraise()
1373 return data
1374

/usr/local/lib/python3.10/dist-packages/torch/_utils.py in reraise(self)
720 # instantiate since we don't know how to
721 raise RuntimeError(msg) from None
--> 722 raise exception
723
724

AssertionError: Caught AssertionError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
data = fetcher.fetch(index)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 53, in fetch
data = self.dataset[possibly_batched_index]
File "/usr/local/lib/python3.10/dist-packages/nemo/collections/asr/data/audio_to_text_lhotse_prompted.py", line 55, in getitem
tokens = self.prompt_format_fn(cuts, self.tokenizer)
File "/usr/local/lib/python3.10/dist-packages/nemo/collections/asr/data/audio_to_text_lhotse_prompted.py", line 127, in canary
assert isinstance(cut, MonoCut), "Expected MonoCut."
AssertionError: Expected MonoCut.

Kurumindla-Kranthivardhan Apr 25, 2024

can i know why iam getting this error after setting "answer" to a random text

stevehuang52 · 2024-05-06T03:36:05Z

stevehuang52
May 6, 2024
Collaborator

can i know why iam getting this error after setting "answer" to a random text

Hi @Kurumindla-Kranthivardhan, we currently only support single-channel audio input, while your input may have more than one channel. We will add support for multi-channel @pzelasko

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to get NeMo working with canary-1b #8389

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Unable to get NeMo working with canary-1b #8389

cschmittiey Feb 9, 2024

Replies: 2 comments · 4 replies

cschmittiey Feb 9, 2024 Author

stevehuang52 Feb 9, 2024 Collaborator

cschmittiey Feb 9, 2024 Author

Kurumindla-Kranthivardhan Apr 25, 2024

Transcribing: 0/? [00:00<?, ?it/s]

Kurumindla-Kranthivardhan Apr 25, 2024

stevehuang52 May 6, 2024 Collaborator

cschmittiey
Feb 9, 2024

Replies: 2 comments 4 replies

cschmittiey
Feb 9, 2024
Author

stevehuang52 Feb 9, 2024
Collaborator

cschmittiey Feb 9, 2024
Author

Transcribing:
0/? [00:00<?, ?it/s]

stevehuang52
May 6, 2024
Collaborator