SOVA Dataset

SOVA Dataset is free public STT/ASR dataset.

Key facts:

Dataset composition

Name		Lang	Hours	Size	Source	Equipment	Annotation	Speech type	Augmentation	Quality
EngAudiobooksOriginal	Download	EN	7 130	743 Gb	audiobook	professional	forced alignment	reading	none	95%
EngAudiobooksNoisy	Download	EN	3 873	310 Gb	audiobook	professional	forced alignment	reading	phone calls	95%
RuAudiobooksDevices	Download	RU	298	30,24 Gb	audiobook	unprofessional	manual	reading	none	99%
RuDevices	Download	RU	101	10,42 Gb	audio records	unprofessional	manual	live speech	none	98%
RuYoutube	Download	RU	17 451	1 873 Gb	audio records	unprofessional	asr	live speech	none	95%
ZhYoutube	Download	CN	3 475,1	321 Gb	audio records	unprofessional	asr	live speech	none	97.83%
TOTAL	-	-	32 328,1	3 287,66 Gb (3,21 TB)	-	-	-	-	-	-

For all questions please feel free to contact us support@sova.ai

SOVA Dataset is licensed under Creative Commons BY 4.0 license by Virtual Assistant, LLC.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
LICENSE		LICENSE
README.md		README.md