Repository for different Speech Datasets and Models for Indo-Aryan languages prepared by the Dr. Bhimrao Ambedkar University and Council for Strategic and Defense Research under different projects, in collaboration with Karya Inc. and UnReaL-TecE LLP.
This repository currently contains the transcription of the speech data collected through the Karya App for the pilot project of the SpeeD-IA project in four languages - Awadhi, Bhojpuri, Braj and Magahi.
The audio could be downloaded here. SpeeD-IA Audio and Transcription is licensed under CC BY-NC-SA 4.0 . For commercial licensing of the dataset, contact UnReaL-TecE LLP.
If you are using the data, please cite the following paper
@inproceedings{interspeech2022,
author = {Kumar, Ritesh and Singh, Siddharth and Ratan, Shyam and Raj, Mohit and Sinha, Sonal and lahiri, bornini and Seshadri, Vivek and Bali, Kalika and Ojha, Atul Kr.},
title = {Annotated Speech Corpus for Low Resource Indian Languages: Awadhi, Bhojpuri, Braj and Magahi},
booktitle = {Proceedings of Speech for Social Good Workshop, Interspeech 2022},
year = {2022}
}
For any queries, please feel free to contact at riteshkr[dot]kmi
- the email is at the most popular email domain stating with 'g'.