🇺🇦 Speech Recognition & Synthesis for Ukrainian

Model	WER	CER	Accuracy, %	WER^+LM	CER^+LM	Accuracy^+LM, %
Yehor/wav2vec2-xls-r-1b-uk-with-lm	0.1807	0.0317	81.93%	0.1193	0.0218	88.07%
Yehor/wav2vec2-xls-r-1b-uk-with-binary-news-lm	0.1807	0.0317	81.93%	0.0997	0.0191	90.03%
Yehor/wav2vec2-xls-r-300m-uk-with-lm	0.2906	0.0548	70.94%	0.172	0.0355	82.8%
Yehor/wav2vec2-xls-r-300m-uk-with-news-lm	0.2027	0.0365	79.73%	0.0929	0.019	90.71%
Yehor/wav2vec2-xls-r-300m-uk-with-wiki-lm	0.2027	0.0365	79.73%	0.1045	0.0208	89.55%
Yehor/wav2vec2-xls-r-base-uk-with-small-lm	0.4441	0.0975	55.59%	0.2878	0.0711	71.22%
robinhad/wav2vec2-xls-r-300m-uk	0.2736	0.0537	72.64%	-	-	-
arampacha/wav2vec2-xls-r-1b-uk	0.1652	0.0293	83.48%	0.0945	0.0175	90.55%

`Citrinet`

lm-4gram-500k is used as the LM

Model	WER	CER	Accuracy, %	WER^+LM	CER^+LM	Accuracy^+LM, %
nvidia/stt_uk_citrinet_1024_gamma_0_25	0.0432	0.0094	95.68%	0.0352	0.0079	96.48%
neongeckocom/stt_uk_citrinet_512_gamma_0_25	0.0746	0.016	92.54%	0.0563	0.0128	94.37%

`ContextNet`

Model	WER	CER	Accuracy, %
theodotus/stt_uk_contextnet_512	0.0669	0.0145	93.31%

`FastConformer P&C`

This model supports text punctuation and capitalization

Model	WER	CER	Accuracy, %	WER^+P&C	CER^+P&C	Accuracy^+P&C, %
theodotus/stt_ua_fastconformer_hybrid_large_pc	0.0400	0.0102	96.00%	0.0710	0.0167	92.90%

`Squeezeformer`

lm-4gram-500k is used as the LM

Model	WER	CER	Accuracy, %	WER^+LM	CER^+LM	Accuracy^+LM, %
theodotus/stt_uk_squeezeformer_ctc_xs	0.1078	0.0229	89.22%	0.0777	0.0174	92.23%
theodotus/stt_uk_squeezeformer_ctc_sm	0.082	0.0175	91.8%	0.0605	0.0142	93.95%
theodotus/stt_uk_squeezeformer_ctc_ml	0.0591	0.0126	94.09%	0.0451	0.0105	95.49%

`Flashlight`

lm-4gram-500k is used as the LM

Model	WER	CER	Accuracy, %	WER^+LM	CER^+LM	Accuracy^+LM, %
Flashlight Conformer	0.1915	0.0244	80.85%	0.0907	0.0198	90.93%

`data2vec`

Model	WER	CER	Accuracy, %
robinhad/data2vec-large-uk	0.3117	0.0731	68.83%

`VOSK`

Model	WER	CER	Accuracy, %
v3	0.5325	0.3878	46.75%

`m-ctc-t`

Model	WER	CER	Accuracy, %
speechbrain/m-ctc-t-large	0.57	0.1094	43%

`whisper`

Model	WER	CER	Accuracy, %
tiny	0.6308	0.1859	36.92%
base	0.521	0.1408	47.9%
small	0.3057	0.0764	69.43%
medium	0.1873	0.044	81.27%
large (v1)	0.1642	0.0393	83.58%
large (v2)	0.1372	0.0318	86.28%

Fine-tuned version for Ukrainian:

Model	WER	CER	Accuracy, %
small	0.2704	0.0565	72.96%
large	0.2482	0.055	75.18%

If you want to fine-tune a Whisper model on own data, then use this repository: https://github.com/egorsmkv/whisper-ukrainian

`DeepSpeech`

Model	WER	CER	Accuracy, %
v0.5	0.7025	0.2009	29.75%

📖 Development

How to train own model using Kaldi (in Russian): https://github.com/egorsmkv/speech-recognition-uk/blob/master/vosk-model-creation/INSTRUCTION.md
How to train a KenLM model based on Ukrainian Wikipedia data: https://github.com/egorsmkv/ukwiki-kenlm
Export a traced JIT version of wav2vec2 models: https://github.com/egorsmkv/wav2vec2-jit

📚 Datasets

Compiled dataset from different open sources + Companies + Community = 188.31GB / ~1200 hours 💪

Storage Share powered by Nextcloud: https://nx16725.your-storageshare.de/s/cAbcBeXtdz7znDN (use Wget to download, downloading in a browser has speed limitations)
Torrent file: https://academictorrents.com/details/fcf8bb60c59e9eb583df003d54ed61776650beb8 (188.31 GB)

⭐ Related works

Language models

Ukrainian LMs: https://huggingface.co/Yehor/kenlm-ukrainian

Inverse Text Normalization:

WFST for Ukrainian Inverse Text Normalization: https://github.com/lociko/ukraine_itn_wfst

Text Enhancement

Punctuation and capitalization model: https://huggingface.co/dchaplinsky/punctuation_uk_bert (demo: https://huggingface.co/spaces/Yehor/punctuation-uk)

Aligners

Aligner for wav2vec2-bert models: https://github.com/egorsmkv/w2v2-bert-aligner
Aligner based on FasterWhisper (mostly for TTS): https://github.com/patriotyk/narizaka
Aligner based on Kaldi: https://github.com/proger/uk

📢 Text-to-Speech

Test sentence with stresses:

К+ам'ян+ець-Под+ільський - м+істо в Хмельн+ицькій +області Укра+їни, ц+ентр Кам'ян+ець-Под+ільської міськ+ої об'+єднаної територі+альної гром+ади +і Кам'ян+ець-Под+ільського рай+ону.

Without stresses:

Кам'янець-Подільський - місто в Хмельницькій області України, центр Кам'янець-Подільської міської об'єднаної територіальної громади і Кам'янець-Подільського району.

📦 Implementations

StyleTTS2

StyleTTS2 demo & the code

P-Flow TTS

P-Flow TTS

audio.mp4

RAD-TTS

RAD-TTS, the voice "Lada"
RAD-TTS with three voices, voices of Lada, Tetiana, and Mykyta

demo.mp4

Coqui TTS

v1.0.0 using M-AILABS dataset: https://github.com/robinhad/ukrainian-tts/releases/tag/v1.0.0 (200,000 steps)
v2.0.0 using Mykyta/Olena dataset: https://github.com/robinhad/ukrainian-tts/releases/tag/v2.0.0 (140,000 steps)

tts_output.mp4

Neon TTS

Coqui TTS model implemented in the Neon Coqui TTS Python Plugin. An interactive demo is available on huggingface. This model and others can be downloaded from huggingface and more information can be found at neon.ai

neon_tts.mp4

FastPitch

NVIDIA FastPitch: https://huggingface.co/theodotus/tts_uk_fastpitch

Balacoon TTS

Balacoon TTS, voices of Lada, Tetiana and Mykyta. Blog post on model release.

balacoon_tts.mp4

📚 Datasets

Open Text-to-Speech voices for 🇺🇦 Ukrainian: https://huggingface.co/datasets/Yehor/opentts-uk
- Voice "LADA", female
- Voice "TETIANA", female
- Voice "KATERYNA", female
- Voice "MYKYTA", male
- Voice "OLEKSA", male

⭐ Related works

Accentors

Misc

Tool to make high quality text to speech (TTS) corpus from audio + text books: https://github.com/patriotyk/narizaka
A model to do Text Normalization: https://huggingface.co/skypro1111/mbart-large-50-verbalization

Name		Name	Last commit message	Last commit date
Latest commit History 256 Commits
archives		archives
speech-to-text		speech-to-text
tts-demos		tts-demos
vosk-model-creation		vosk-model-creation
README.md		README.md

egorsmkv/speech-recognition-uk

Folders and files

Latest commit

History

Repository files navigation

🇺🇦 Speech Recognition & Synthesis for Ukrainian

Overview

Community

🎤 Speech-to-Text

📦 Implementations

📊 Benchmarks

wav2vec2-bert

wav2vec2

Citrinet

ContextNet

FastConformer P&C

Squeezeformer

Flashlight

data2vec

VOSK

m-ctc-t

whisper

DeepSpeech

📖 Development

📚 Datasets

Compiled dataset from different open sources + Companies + Community = 188.31GB / ~1200 hours 💪

Voice of America (398 hours)

FLEURS

YODAS2

Companies

Ukrainian podcasts

Cleaned Common Voice 10 (test set)

Noised Common Voice 10

Community

Other

⭐ Related works

Language models

Inverse Text Normalization:

Text Enhancement

Aligners

📢 Text-to-Speech

📦 Implementations

📚 Datasets

⭐ Related works

Accentors

Misc

About

Topics

Resources

Stars

Watchers

Forks

Contributors 9

Languages

`wav2vec2-bert`

`wav2vec2`

`Citrinet`

`ContextNet`

`FastConformer P&C`

`Squeezeformer`

`Flashlight`

`data2vec`

`VOSK`

`m-ctc-t`

`whisper`

`DeepSpeech`