This repository contains an R implementation of a Naive Bayes classifier for detecting SMS spam messages. The classifier uses text preprocessing techniques and a Document-Term Matrix (DTM) to train and evaluate the model. The dataset used is the publicly available SMS Spam Collection dataset from the UCI Machine Learning Repository.
The goal of this project is to create a machine learning model that classifies SMS messages as either "ham" (non-spam) or "spam." The implementation uses the Naive Bayes algorithm, which is well-suited for text classification tasks.
The dataset used for this project is the SMS Spam Collection, which can be downloaded from the UCI Machine Learning Repository. It contains 5,574 labeled SMS messages with the following columns:
- Label:
ham
for non-spam messages andspam
for spam messages. - Message: The SMS text message.
Save the dataset as data/sms_spam.txt
in the repository folder.
- R (version 4.0 or later)
- RStudio (optional but recommended)
Install the following R libraries:
install.packages(c("dplyr", "readr", "caret", "e1071", "tm"))
-
Clone this repository:
git clone https://github.com/alexanderk001/sms-spam-filter.git cd sms-spam-filter
-
Place the dataset in the
data/
directory assms_spam.txt
. -
Open the R script in RStudio and run it step by step.
-
Results, including accuracy and confusion matrix, will be printed in the console.
The dataset is loaded using the readr
package, and the labels (ham
/spam
) are converted into factors for modeling.
The text data is cleaned using the following steps:
- Convert to lowercase.
- Remove punctuation and numbers.
- Remove stop words.
- Strip extra whitespace.
The cleaned messages are converted into a DTM, which represents the frequency of terms in each document (message). Sparsity is reduced by keeping only frequently used terms.
The DTM is converted to binary values:
1
if a term is present.0
if a term is absent.
The dataset is split into training (80%) and testing (20%) sets. A Naive Bayes classifier is trained on the training set, and predictions are made on the test set.
The performance of the classifier is evaluated using:
- Confusion Matrix: Shows the number of true positives, true negatives, false positives, and false negatives.
- Accuracy: Proportion of correctly classified messages.
Example output:
Confusion Matrix and Statistics
Reference
Prediction ham spam
ham 963 15
spam 2 134
Accuracy : 98.47%
Key Metrics:
- Accuracy: 98.47%
- Sensitivity: 99.79%
- Specificity: 89.93%