L3Cube-HingCorpus

L3Cube-HingCorpus is the first large-scale real Hindi-English code mixed data in a Roman. It consists of 52.93M sentences and 1.04B tokens, scraped from Twitter. We also present HingBERT, HingMBERT, HingRoBERTa, and HingGPT. The BERT models have been pre-trained on codemixed HingCorpus. The evaluation details are mentioned in our paper link .

The full HingCorpus(roman) is shared here .

Hing BERT models and Hing Fast Text model

Model	Description	Link
HingBERT	Base-BERT	roman
HingRoBERTa	RoBERTa	roman , roman + devanagari
HingMBERT	mBERT	roman , roman + devanagari
HingGPT	GPT2	roman devanagari
HingFT	Fast Text	link

L3Cube-HingLID

The L3Cube-HingLID is the Hindi-English code-mixed language identification dataset. It consists of 31756, 6420, and 6279 train, test, and validation samples respectively. The dataset is shared in the folder L3Cube-HingLID/. The HingBERT-LID model is shared here .

L3Cube-MeCorpus

L3Cube-MeCorpus is a first-of-its-kind large code-mixed Marathi-English (Mr-En) corpus with 10 million social media sentences released in paper . The dataset details and code-mixed MeBERT models are shared in the MarathiNLP repo.

L3Cube-MeSent, MeHate, and MeLID

MeSent, MeHate, and MeLID are the first code-mixed Marathi-English Sentiment Analysis, Hate Speech Identification, and Language Identification datasets respectively released in paper . The datasets are shared here .

License

L3Cube-HingCorpus, and L3Cube-HingLID is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Citing

@article{nayak2022l3cube,
  title={L3Cube-HingCorpus and HingBERT: A Code Mixed Hindi-English Dataset and BERT Language Models},
  author={Nayak, Ravindra and Joshi, Raviraj},
  journal={arXiv preprint arXiv:2204.08398},
  year={2022}
}

This project is co-ordinated and mentored by Raviraj Joshi under L3Cube Pune. For any queries contact ravirajoshi@gmail.com .

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
L3Cube-HingLID		L3Cube-HingLID
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

L3Cube-HingCorpus

Hing BERT models and Hing Fast Text model

L3Cube-HingLID

L3Cube-MeCorpus

L3Cube-MeSent, MeHate, and MeLID

License

Citing

About

Releases

Packages

Contributors 2

l3cube-pune/code-mixed-nlp

Folders and files

Latest commit

History

Repository files navigation

L3Cube-HingCorpus

Hing BERT models and Hing Fast Text model

L3Cube-HingLID

L3Cube-MeCorpus

L3Cube-MeSent, MeHate, and MeLID

License

Citing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages