Codes, datasets, and explanations for some basic natural language tasks and models.
This repository is a set of 5 tutorials which provide a basic knowledge about NLP, and applying it to text to get results without needing a deep mathematical knowledge of the working behind the models. NLTK, Keras and sklearn are the main libraries used in the tutorials. Each folder contains the datasets and a Jupyter Notebook for that tutorial. I've also written detailed Medium articles to explain the code in the Notebooks, which are linked below.
-
NLP Preprocessing
Explains the basic preprocessing tasks to be performed before training almost any model. Covers stemming and lemmatization and their differences.
Medium article -
Language Modeling
Building and studying statistical language models from a corpus dataset. Unigram, bigram, trigram and fourgram models are covered.
Medium article -
Classifier Models
Building and comparing the accuracy of Naive Bayes and LSTM models on a given dataset using NLTK and Keras.
Medium article -
Conditional Random Fields
Experimenting with POS tagging, a standard sequence labeling task using CRFs with the help of the scikit-learn crfsuite wrapper.
Medium article -
Word and Character Based LSTM Models
Building and analyzing word and character based LSTM models. Two different character based models are also trained and compared.
Medium article