• Hackers Realm

SMS Spam Detection Analysis using Python (NLP) | Machine Learning Project Tutorial

SMS Spam Detection Analysis is a classification project that comes under Natural Language Processing. The objective of the project is to analyze the text messages and classify whether the message is ham (legitimate) or spam.


In this project tutorial we are going to analyze and classify the text messages from the dataset using a classifying model with pipelines.



You can watch the step by step explanation video tutorial down below


Dataset Information


The "spam" concept is diverse: advertisements for products/web sites, make money fast schemes, chain letters, pornography...


The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged according being ham (legitimate) or spam.


Attributes

  • SMS Messages

  • Label (spam/ham)


Download the dataset here



Import modules


import pandas as pd
import numpy as np
import nltk
import re
from nltk.corpus import stopwords
  • pandas - used to perform data manipulation and analysis

  • numpy - used to perform a wide variety of mathematical operations on arrays

  • nltk – a natural language processing toolkit module associated in anaconda

  • re – used as a regular expression to find particular patterns and process it

  • stopwords - used to remove stop words from the text data


Loading the dataset


Now we load the dataset for preprocessing

df = pd.read_csv('spam.csv')
df.head()
  • Relevant columns are v1 and v2

  • Other columns are null, unnecessary for processing



Let us extract the relevant data for preprocessing

# get necessary columns for processing
df = df[['v2', 'v1']]
# df.rename(columns={'v2': 'messages', 'v1': 'label'}, inplace=True)
df = df.rename(columns={'v2': 'messages', 'v1': 'label'})
df.head()
  • Columns renamed to relate better in the codes

  • Two ways listed to rename the columns, either one is viable



Preprocessing the dataset


# check for null values
df.isnull().sum()

messages 0 label 0 dtype: int64

  • Checks and shows the no. of null values in the two columns.

  • In case of null values you must filter it for easier processing


STOPWORDS = set(stopwords.words('english'))

def clean_text(text):
    # convert to lowercase
    text = text.lower()
    # remove special characters
    text = re.sub(r'[^0-9a-zA-Z]', ' ', text)
    # remove extra spaces
    text = re.sub(r'\s+', ' ', text)
    # remove stopwords
    text = " ".join(word for word in text.split() if word not in STOPWORDS)
    return text
  • Defined to call and clean the text to avoid repeating line by line if further cleaning is needed

  • set(stopwords.words('...')) - used to load the unique list of common stop words from the specified language as tokens

  • Stop words are not meaningful words, deleting those words will not affect the results

  • Text are converted to lower case to avoid mismatching

  • Special characters and extra spaces are removed

  • Stop words removed from text by splitting the original text and comparing with the STOPWORDS list



Now let us clean the text messages

# clean the messages
df['clean_text'] = df['messages'].apply(clean_text)
df.head()
  • New column created to visualize the results from the text cleaning


Input Split


Let us split the data for training

X = df['clean_text']
y = df['label']
  • X - input attribute

  • y - output attribute



Model Training


from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer

def classify(model, X, y):
    # train test split
    x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, shuffle=True, stratify=y)
    # model training
    pipeline_model = Pipeline([('vect', CountVectorizer()),
                               ('tfidf',TfidfTransformer()),
                               ('clf', model)])
    pipeline_model.fit(x_train, y_train)
    
    print('Accuracy:', pipeline_model.score(x_test, y_test)*100)
    
#     cv_score = cross_val_score(model, X, y, cv=5)
#     print("CV Score:", np.mean(cv_score)*100)
    y_pred = pipeline_model.predict(x_test)
    print(classification_report(y_test, y_pred))
  • Pipeline - used for a linear sequence of data transforms to be chained together culminating in a modeling process that can be evaluated.

  • train_test_split() - used to estimate the performance of machine learning algorithms when they are used to make predictions on data not used to train the model.

  • cross_val_score() - used to split the data into (x) equal files, trains the data in (y) combinations and returns the (cv) calculated accuracy of the given model.

  • CountVectorizer - used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text.

  • TfidfVectorizer - used to perform both word frequency and inverse document frequency of the text.

  • TfidfTransformer - used to transform text into a meaningful representation of numbers which is used to fit machine algorithm for prediction.



from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
classify(model, X, y)

Accuracy: 96.8413496051687


  • Results using the Logistic Regression model



from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
classify(model, X, y)

Accuracy: 96.69777458722182


  • The accuracy got decreased a little comparing logistic regression model



from sklearn.svm import SVC
model = SVC(C=3)
classify(model, X, y)

Accuracy: 98.27709978463747



  • SVC model giving better results comparing to the above models



from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
classify(model, X, y)

Accuracy: 97.4156496769562



  • Accuracy decreased a little comparing to SVC model



Final Thoughts


  • SVC model has the best accuracy with 98.28

  • You may use different machine learning models of your preference for comparison

  • Pipeline is used to chain multiple estimators into one and automate the machine learning process. This is extremely useful as there are often a fixed sequence of steps in processing the data.

  • Simplifying and filtering text can achieve cleaner data to process, giving better results


In this project tutorial, we have explored the SMS Spam Detection Analysis dataset as a classification machine learning project in NLP. The data has been preprocessed with custom cleaning functions and processed using pipelines. Get the project notebook from here Thanks for reading the article!!! Check out more project videos from the YouTube channel Hackers Realm

217 views