Skip to content

Parallel corpus mined from IndoWordnet synset gloss and examples

Notifications You must be signed in to change notification settings

anoopkunchukuttan/indowordnet_parallel

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 

Repository files navigation

The IndoWordnet Parallel Corpus

IndoWordnet is a linked structure of wordnets of major Indian languages from Indo-Aryan, Dravidian and Sino-Tibetan families. Synsets are linked across many languages. Every synset in every language contains a gloss and example usage sentence/phrase. In a large number of cases, the example and gloss sentences across languages are translations. Hence, IndoWordNet is a source of parallel corpora across multiple Indian languages.

The corpus contains about 6.3 million parallel segments across 18 Indian languages from 3 languages families.

NEWS! WMT 2020 is using this corpus for the shared task on similar language translation

Documentation

You can read more about the corpus in this document: pdf

Download the corpus

You can download the corpus HERE

Version History

  • v0.2 (14 May 2020): Bug fixes to address problems with extraction in v0.1.
  • v0.1 (25 March 2020): Initial release (BUGGY: don't use this version, use v0.2)

License

This dataset is released under the Creative Commons Attribution Share Alike 4.0 International license.

Citing this dataset

If you use this dataset, please include the following citation:

@misc{kunchukuttan2020iwnparallel,
author = "Anoop Kunchukuttan",
title = "IndoWordnet Parallel Corpus",
year = "2020",
howpublished={\url{https://github.com/anoopkunchukuttan/indowordnet_parallel}}
}

We would like to hear from you if:

  • You are using our resources. Please let us know how you are putting these resources to use.
  • You have any feedback on these resources.

About

Parallel corpus mined from IndoWordnet synset gloss and examples

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published