SMS Spam Detection Analysis using Python (NLP) | Machine Learning Project Tutorial

Hackers Realm
Apr 30, 2022
4 min read

Updated: Jun 2, 2023

Combat SMS spam using Python! This tutorial delves into NLP techniques and machine learning algorithms for accurate spam detection. Learn to preprocess text data, extract meaningful features, and build models that can distinguish between legitimate and spam messages. Enhance your skills in natural language processing, machine learning, and contribute to a safer communication environment. Join this comprehensive project tutorial to unravel the world of SMS spam detection with Python. #SMSSpamDetection #Python #NLP #MachineLearning #TextClassification #SpamDetection

SMS Spam Detection Analysis NLP Python — SMS Spam Detection Analysis

In this project tutorial we are going to analyze and classify the text messages from the dataset using a classifying model with pipelines.

You can watch the step by step explanation video tutorial down below

Dataset Information

The "spam" concept is diverse: advertisements for products/web sites, make money fast schemes, chain letters, pornography, etc.,

The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged according being ham (legitimate) or spam.

Attributes

SMS Messages
Label (spam/ham)

Download the dataset here

Import modules

import pandas as pd
import numpy as np
import nltk
import re
from nltk.corpus import stopwords

pandas - used to perform data manipulation and analysis
numpy - used to perform a wide variety of mathematical operations on arrays
nltk – a natural language processing toolkit module associated in anaconda
re – used as a regular expression to find particular patterns and process it
stopwords - used to remove stop words from the text data

Loading the dataset

Now we load the dataset for preprocessing

df = pd.read_csv('spam.csv')
df.head()

Relevant columns are v1 and v2
Other columns are null, unnecessary for processing

Let us extract the relevant data for preprocessing

# get necessary columns for processing
df = df[['v2', 'v1']]
# df.rename(columns={'v2': 'messages', 'v1': 'label'}, inplace=True)
df = df.rename(columns={'v2': 'messages', 'v1': 'label'})
df.head()

Columns renamed to relate better in the codes
Two ways listed to rename the columns, either one is viable

Preprocessing the dataset

# check for null values
df.isnull().sum()

messages 0 label 0 dtype: int64

Checks and shows the no. of null values in the two columns.
In case of null values you must filter it for easier processing

STOPWORDS = set(stopwords.words('english'))

def clean_text(text):
    # convert to lowercase
    text = text.lower()
    # remove special characters
    text = re.sub(r'[^0-9a-zA-Z]', ' ', text)
    # remove extra spaces
    text = re.sub(r'\s+', ' ', text)
    # remove stopwords
    text = " ".join(word for word in text.split() if word not in STOPWORDS)
    return text

Defined to call and clean the text to avoid repeating line by line if further cleaning is needed
set(stopwords.words('...')) - used to load the unique list of common stop words from the specified language as tokens
Stop words are not meaningful words, deleting those words will not affect the results
Text are converted to lower case to avoid mismatching
Special characters and extra spaces are removed
Stop words removed from text by splitting the original text and comparing with the STOPWORDS list

Now let us clean the text messages

# clean the messages
df['clean_text'] = df['messages'].apply(clean_text)
df.head()

New column created to visualize the results from the text cleaning

Input Split

Let us split the data for training

X = df['clean_text']
y = df['label']

X - input attribute
y - output attribute

Model Training

from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer

def classify(model, X, y):
    # train test split
    x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, shuffle=True, stratify=y)
    # model training
    pipeline_model = Pipeline([('vect', CountVectorizer()),
                               ('tfidf',TfidfTransformer()),
                               ('clf', model)])
    pipeline_model.fit(x_train, y_train)
    
    print('Accuracy:', pipeline_model.score(x_test, y_test)*100)
    
#     cv_score = cross_val_score(model, X, y, cv=5)
#     print("CV Score:", np.mean(cv_score)*100)
    y_pred = pipeline_model.predict(x_test)
    print(classification_report(y_test, y_pred))

Pipeline - used for a linear sequence of data transforms to be chained together culminating in a modeling process that can be evaluated.
train_test_split() - used to estimate the performance of machine learning algorithms when they are used to make predictions on data not used to train the model.
cross_val_score() - used to split the data into (x) equal files, trains the data in (y) combinations and returns the (cv) calculated accuracy of the given model.
CountVectorizer - used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text.
TfidfVectorizer - used to perform both word frequency and inverse document frequency of the text.
TfidfTransformer - used to transform text into a meaningful representation of numbers which is used to fit machine algorithm for prediction.

from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
classify(model, X, y)

Accuracy: 96.8413496051687

Classification Report for Logistic Regression

Results using the Logistic Regression model

from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
classify(model, X, y)

Accuracy: 96.69777458722182

The accuracy got decreased a little comparing logistic regression model

from sklearn.svm import SVC
model = SVC(C=3)
classify(model, X, y)

Accuracy: 98.27709978463747

SVC model giving better results comparing to the above models

from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
classify(model, X, y)

Accuracy: 97.4156496769562

Accuracy decreased a little comparing to SVC model

Final Thoughts

SVC model has the best accuracy with 98.28
You may use different machine learning models of your preference for comparison
Pipeline is used to chain multiple estimators into one and automate the machine learning process. This is extremely useful as there are often a fixed sequence of steps in processing the data.
Simplifying and filtering text can achieve cleaner data to process, giving better results

In this project tutorial, we have explored the SMS Spam Detection Analysis dataset as a classification machine learning project in NLP. The data has been preprocessed with custom cleaning functions and processed using pipelines.

Get the project notebook from here Thanks for reading the article!!! Check out more project videos from the YouTube channel Hackers Realm