Canary training with punctuated/capitalized ASR transcripts is normalizing transcripts to lower case and no special characters #9398

DeveshS1209 · 2024-06-06T13:43:02Z

DeveshS1209
Jun 6, 2024

So I was trying to train my CANARY model with punctuated data. I have created my training and validation manifest files with punctuated text as shown below:

{"audio_filepath": "audio.wav", "duration": 5.39, "taskname": "asr", "source_lang": "en", "target_lang": "en", "pnc": "yes", "answer": "na", "text": "She is not available that is fine, ok sir can I update this address now ok na."}

However, when I start training my model, all the punctuation and capitalization is gone as shown below:

I am pretty sure that I have removed all the normalization that is happening before putting it to train. I have also trained my tokenizer on punctuated data.

I want to know, if there is any specific implementation in CANARY that is again normalizing my text.

@titu1994

Answered by MedAymenF

Jun 8, 2024

Which command have you used to train the tokenizer?
"${NEMO_ROOT}/scripts/tokenizers/process_asr_text_tokenizer.py" applies nmt_nfkc_cf normalization by default.

View full answer

MedAymenF · 2024-06-08T22:47:11Z

MedAymenF
Jun 8, 2024

Which command have you used to train the tokenizer?
"${NEMO_ROOT}/scripts/tokenizers/process_asr_text_tokenizer.py" applies nmt_nfkc_cf normalization by default.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Canary training with punctuated/capitalized ASR transcripts is normalizing transcripts to lower case and no special characters #9398

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Canary training with punctuated/capitalized ASR transcripts is normalizing transcripts to lower case and no special characters #9398

DeveshS1209 Jun 6, 2024

Replies: 1 comment

MedAymenF Jun 8, 2024

DeveshS1209
Jun 6, 2024

MedAymenF
Jun 8, 2024