IndoWordnet is a linked structure of wordnets of major Indian languages from Indo-Aryan, Dravidian and Sino-Tibetan families. Synsets are linked across many languages. Every synset in every language contains a gloss and example usage sentence/phrase. In a large number of cases, the example and gloss sentences across languages are translations. Hence, IndoWordNet is a source of parallel corpora across multiple Indian languages.
The corpus contains about 6.3 million parallel segments across 18 Indian languages from 3 languages families.
NEWS! WMT 2020 is using this corpus for the shared task on similar language translation
You can read more about the corpus in this document: pdf
You can download the corpus HERE
- v0.2 (14 May 2020): Bug fixes to address problems with extraction in v0.1.
- v0.1 (25 March 2020): Initial release (BUGGY: don't use this version, use v0.2)
This dataset is released under the Creative Commons Attribution Share Alike 4.0 International license.
If you use this dataset, please include the following citation:
@misc{kunchukuttan2020iwnparallel,
author = "Anoop Kunchukuttan",
title = "IndoWordnet Parallel Corpus",
year = "2020",
howpublished={\url{https://github.com/anoopkunchukuttan/indowordnet_parallel}}
}
We would like to hear from you if:
- You are using our resources. Please let us know how you are putting these resources to use.
- You have any feedback on these resources.