Text classification project to detect suicide ideation and depression. The dataset has been obtained from Kaggle, which is a collection from "SuicideWatch" and "depression" subreddits of the Reddit platform. Thanks to NIKHILESWAR KOMATI for creating the Dataset.
I have used the Naive Bayes classification for classifying the texts into suicide and non-suicide.
I discovered probabilistic classifiers in my major, but although they are quite simple, I didn't pay much attention to them. It wasn't until I came across a video from StatQuest, which explained how Naive Bayes works in such a simple way that I immediately felt eager to try it out.
Regarding the chosen dataset, I feel it is a really interesting topic with a huge impact nowadays. Around 4000 people commit suicide in Spain every year, according to the Spanish Ministry of Health. Moreover, it has become the first cause of death among people aged between 14 to 29. Loneliness, depression, bullying, fear, and stigma are the main issues that hide beneath this silent pandemic.
Being able to detect depression and suicidal thoughts could help us to identify and prevent suicides.
SpaCy library has been used in order to process the documents. To extract useful information for every document, the following techniques have been performed:
- Words with non-alphabetical characters have been removed.
- Stop words have been removed since it is considered that they do not provide any useful information.
- Nouns, verbs, adjectives, and adverbs have been lemmatized.
Lemmatization is a technique to reduce inflection and derived forms of words. It is similar to stemming, however, lemmatization takes into account the context in which the word is used to extract the proper lemma. Lemmatization is considered a more accurate and refined way of obtaining the lemma of a word, whereas stemming is usually faster. In this case, the decision was to use lemmatization to try to achieve better results.
Naive Bayes is a supervised classification algorithm that is based on Bayes' Theorem. It makes the assumption that the features used to classify an object are independent of each other. The algorithm computes the probability of an object belonging to a group given the features of that object. Finally, the group with the highest probability is selected.
Multinomial Bayes is a variant for multinomially distributed data, where the features have discrete counts such as the word frequencies. This method seems more adequate for this specific case.
There are several ways to improve the classifier:
- Typo correction
- Hyper-parameter tuning
- Multiprocessing of the texts