L3Cube-HingCorpus is the first large-scale real Hindi-English code mixed data in a Roman. It consists of 52.93M sentences and 1.04B tokens, scraped from Twitter. We also present HingBERT, HingMBERT, HingRoBERTa, and HingGPT. The BERT models have been pre-trained on codemixed HingCorpus. The evaluation details are mentioned in our paper link .
The full HingCorpus(roman) is shared here .
Model | Description | Link |
---|---|---|
HingBERT | Base-BERT | roman |
HingRoBERTa | RoBERTa | roman , roman + devanagari |
HingMBERT | mBERT | roman , roman + devanagari |
HingGPT | GPT2 | roman devanagari |
HingFT | Fast Text | link |
The L3Cube-HingLID is the Hindi-English code-mixed language identification dataset. It consists of 31756, 6420, and 6279 train, test, and validation samples respectively. The dataset is shared in the folder L3Cube-HingLID/. The HingBERT-LID model is shared here .
L3Cube-MeCorpus is a first-of-its-kind large code-mixed Marathi-English (Mr-En) corpus with 10 million social media sentences released in paper . The dataset details and code-mixed MeBERT models are shared in the MarathiNLP repo.
MeSent, MeHate, and MeLID are the first code-mixed Marathi-English Sentiment Analysis, Hate Speech Identification, and Language Identification datasets respectively released in paper . The datasets are shared here .
L3Cube-HingCorpus, and L3Cube-HingLID is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
@article{nayak2022l3cube,
title={L3Cube-HingCorpus and HingBERT: A Code Mixed Hindi-English Dataset and BERT Language Models},
author={Nayak, Ravindra and Joshi, Raviraj},
journal={arXiv preprint arXiv:2204.08398},
year={2022}
}
This project is co-ordinated and mentored by Raviraj Joshi under L3Cube Pune. For any queries contact ravirajoshi@gmail.com .