NLP for Nepali

This repository contains State of the Art Tokenizer, Language model and Classifier for Nepali, which is official language of Nepal and one of the official status gained language of India.

Dataset

Download Nepali Wikipedia Articles Dataset (38,757 articles) which I scraped, cleaned and used to train the language model
Download Nepali News classification Dataset which I scraped and used to train the classifier

Results

Language Model

on 30% validation set

Perplexity of language model: ~32

Classifier

Accuracy of classification model: ~97%
Kappa score of classification model: ~96

Pretrained Language Model

Download pretrained Language Model from here

Classifier

Download classifier from here

Tokenizer

Trained tokenizer using Google's sentencepiece

Download the trained model and vocabulary from here

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.idea		.idea
classification		classification
datasets-preparation		datasets-preparation
language-model		language-model
tokenizer		tokenizer
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLP for Nepali

Dataset

Results

Language Model

Classifier

Pretrained Language Model

Classifier

Tokenizer

About

Releases

Packages

Languages

BijayOCT25/nlp-for-nepali

Folders and files

Latest commit

History

Repository files navigation

NLP for Nepali

Dataset

Results

Language Model

Classifier

Pretrained Language Model

Classifier

Tokenizer

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages