This repository contains State of the Art Tokenizer, Language model and Classifier for Nepali, which is official language of Nepal and one of the official status gained language of India.
-
Download Nepali Wikipedia Articles Dataset (38,757 articles) which I scraped, cleaned and used to train the language model
-
Download Nepali News classification Dataset which I scraped and used to train the classifier
on 30% validation set
- Perplexity of language model: ~32
- Accuracy of classification model: ~97%
- Kappa score of classification model: ~96
Download pretrained Language Model from here
Download classifier from here
Trained tokenizer using Google's sentencepiece
Download the trained model and vocabulary from here