Telegram-Topic-Classifier

Why we are doing it?

In many telegram groups we found the necessity to classify different images in various subjects for exemple math, physics chemestry and many others so that after that we know the subject we can automatically tag people expert in that topic hence helping more people and in a more eficcient manner.

How we are doing it

The project is composed of mainly 3 parts

Collect the necessary data
Preprocess the data and write a model that can predicts the classes
Write the telegram bot that automatically classifies images and then depending on the subject tags all the experts

I will now briefly explain all the parts.

1)

For the dataset we are using all the images sent in over 2 years in the telegram group Best Group ever that is composed of more than 2000 images. We used telegram user Apis to fetch all of this data. We then performed OCR on all the images and extracted the text information using Google OCR for python. Now the hard task was to label all of the dataset for doing that we leveraged the enormous amount of people found in our Group Network. We wrote a simple but effective telegram bot that sends an image and then asks to press a button corresponding to the subject of that particular image. We kinda used a mutex_lock/unlock so that all the people recived different images and that we could minimize the number of images labeled multiple times, you can learn more by looking at the code.

2)

Preprocessing: We exported the DB as a pickle object so we could work with it easily. First we performed some exploratory data analysis to learn more about the distribution of the dataset. We discarded 5 of the 8 classes because we didn't have enough data about them. We prepocessed the data by removing all of the punctuation and stopwords and then we tokenized the sentences. Model: We encoded the classes using label encoding anothe approach is possible using one-hot vector encoding. The sentences are transformed in numbers using the bag-of-word technic but it is also possible and maybe reccomended to use word2vec. We tried different model like NN, logistic regression, support vector machines and Naive Bayes. The best result for now is 86% of accuracy but there is margin of improvement .

3)

We modify a version of Emanuele beautiful poke bot so that the bot perform the actions in the seguent order. When a photo is recived it extract the text using google ocr we then perform inference using the preferred model and then the bot tags all of the scientis corresponfing to that subject.

If you have any suggestion or question feel free to write me! <3

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
README.md		README.md
SubjectClassifier.ipynb		SubjectClassifier.ipynb
bot-trainer.py		bot-trainer.py
tagged_db.pickle		tagged_db.pickle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Telegram-Topic-Classifier

Why we are doing it?

How we are doing it

1)

2)

3)

About

Releases

Packages

Languages

Hitomamacs/Telegram-Topic-Classifier

Folders and files

Latest commit

History

Repository files navigation

Telegram-Topic-Classifier

Why we are doing it?

How we are doing it

1)

2)

3)

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages