Skip to content

RiddhiRex/SpamMails_Filter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 

Repository files navigation

SpamFilter

Introduction:

A spam filter is a program which is used to identify unsolicited and unwanted emails and prevent them from going to the user's inbox. It achieves this by making decision whether a mail is spam or ham based on certain pre-learnt judgments. We make the model learn to identify a spam and ham mail by passing the features to the Naive Bayes model.

Tools:

• Python 2.7 • Numpy • Pandas • NLTK

Feature Extraction:

Feature extraction is an important step where we extract features from datasets which is in text format to a format that is supported by machine learning algorithms. Here we process the parse and get email id, ham/spam, contents of the email. The contents of the email are processed to find the total count of words, count of spam words and ham words in it.

Pre-processing:

The contents of the mail are scanned and stop words are not considered for calculation. NLTK’s stop words corpus is used for this purpose. Also NLTK Parts of Words tagging is done to omit the words such as ‘in’, ‘to’, ‘for’, ‘the’. The words with pos as 'PRP','IN','DT','WDT','WP','WRB','TO','MD','EX' are not considered.

Classifier:

Naive Bayes models are commonly used technique for spam filtering. It is one of the supervised learning methods. Certain words have high probabilities of occurring in spam email and in a legitimate email. For example, the word “Limited time”, “Earn Millions”, “Limited Offer”, “Deal” are frequently seen in spam email than in other email. So this knowledge has to be applied to classify if an email is Spam or not. Multinomial Naive Bayes model is based on Naive Bayes theorem which computes the probability of an event, based on prior knowledge that we hold that might be relevant to the event. So we keep track of the words occurring in all the mails, ham mails and spam mails and we calculate the probability of the occurrences of each word in both spam and ham mail. This learning is done on the contents of the training file. Then this knowledge will be applied on the test files mails to predict the target variable (spam/ham).

Metrics:

Metrics that is used to evaluate the performance of the model are listed below: • Accuracy, Precision • Recall • Root mean squared error • Mean absolute error

Accuracy:

• Accuracy score: 72.472472472472475 • R^2: -0.13046251337338477 • Mean squared error: 0.27527527527527529 • Root mean squared error: 0.524666823113 • Mean absolute error: 0.27527527527527529

About

Detect if a mail is Spam or not

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published