A web application to predict flair (tag) of any post on India Subreddit using Machine Learning Algorithms.
- Download Git Large File Storage (LFS) from https://git-lfs.github.com/ if you don't have it already.
- Open the Terminal.
- Use
git lfs install
to set up Git LFS for your user account. - Clone the repository by typing
git clone https://github.com/Gunnika/Reddit-Flair-Detector.git
. - Ensure that Python3 and pip are installed on the device.
- Change to the cloned directory by entering
cd Reddit-Flair-Detector
. - Run
pip install -r requirements.txt
. - Enter the python shell and
import nltk
. - Execute
nltk.download('stopwords')
andnltk.download('punkt')
. Exit the shell. - Run
python app.py
to start the application on a local host. - Go to http://0.0.0.0:5000/ on the web browser to use the application.
- Data Aquisition
- Exploratory Data Analysis
- Data Pre-Processing
- Building a Flair Detector
- Building a Flask Application
- Deploying as a Web Service
- Automated Testing
The whole process is nicely explained with code in this Jupyter Notebook.
PRAW: The Python Reddit API Wrapper was used for extracting data. There are a number of Reddit datasets available on Bigquery and Kaggle as well. For the purpose of creating my own dataset instead of the readily available alternatives, I went ahead with PRAW.
The following attributes made more sense in indicating the flair of a post
- title
- url
- text
- comments
Initial investigations of data included analysing the data distribution amongst classes wherein an imbalanced distribution was observed. The [R]eddiquette class had low data as compared to the other classes which can result in the minority class being treated as outlier and ignored.
The reason for this imbalance was found to be discontinuation of the [R]eddiquette flair 7 months ago. The class was then dropped from the dataset
The Data pre-processing step involved cleaning the data for better representation and usability. In this:
- The stop words were removed
- words tokenized
- words converted into lowercase
- Useful words concatenated to a sentence
Different models analysed:
- Logistic Regression
- Linear Support Vector Machine
- Naive Bayes Classifier
- Decision Trees
- Random forest
The best results were obtained using Random Forest (62.67%) To improve the accuracy even more, some deep learning techniques can be incorporated. BERT(Bidirectional Encoder Representations from Transformers) can be used to generate text embeddings and a better accuracy as well.
A flask application was developed in which the trained model was integrated. An automated_testing endpoint was generated for automatic retrieval of predictions by providing a text file of urls.
The application was then deployed to Heroku.
A POST Request with key as upload_file and value as a text file consisting of URLs can be sent to https://redditflair-detector.herokuapp.com/automated_testing.
It will return a JSON object with the URL as the key and Prediction as the value.
Please note that due to the limitations of PRAW, around 50 URLs can be processed at a time. Heroku can give a timeout error otherwise.