redditNLP

What is this?

The redditNLP project is the implementation of a telegram bot which is able to classify a reddit post, given its url, in the correct or most suitable subreddit based on the text content. The dataset used to train the classfier is this. Before developing the classifier and the bot, the dataset has been preprocessed using a series of functions, in order to reduce the dimensionality and to normalize the text of the examples it contains. In particular, only a subset of subreddit has been chose for the training, starting from a total of 1000 subreddit to a subset of 300 elements, and the text of each post has been cut to 550 characters, in order to speed up the execution.

This github project contains the version of the classifier implemented using TF-IDF and Naive Bayes. Out of all the implementations, this one was the most performing one. The other implementations can be found in the Google colab notebook I used to implement and test all the system. It also includes a brief evaluation table with the accuracy, precision and recall values for all the developed implementations.

Implemented classifier

TF-IDF + Naive Bayes
TF-IDF + Logistic Regression
Word2Vec + Naive Bayes
Word2Vec + Logistic Regression

Files

This project is composed by 4 main files:

bot.py: contains all the codes needed to connect to the Telegram API and to analyze the messages the user sends, after applying all the clean function imported from the processing file. It also has the task to reply with the predicted subreddit;
preprocessing.py: contains all the preprocessing and normalization functions needed to prepare the message to the classification step;
classification.py: imports the TF-IDF and Naive Bayes classifier in order to predict the subreddit. The vectorizer and the classifier has been implemented and executed in the Google colab notebook. I saved the results and uploaded them in the files directory;
reddit.py: contains the tokens for the reddit API.

How to run redditNLP

Packages needed for the bot:

python-telegram-bot
PRAW

Packages needed for the classification task

pandas
nltk
sklearn
gensim

In order to execute the bot, you just need to run python bot.py. Remember to use your API tokens.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

redditNLP

What is this?

Implemented classifier

Files

How to run redditNLP

Packages needed for the bot:

Packages needed for the classification task

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
files		files
README.md		README.md
bot.py		bot.py
classification.py		classification.py
preprocessing.py		preprocessing.py
reddit.py		reddit.py

fbacci/redditNLP

Folders and files

Latest commit

History

Repository files navigation

redditNLP

What is this?

Implemented classifier

Files

How to run redditNLP

Packages needed for the bot:

Packages needed for the classification task

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages