Hackers Realm
Fake News Detection Analysis using Python | LSTM Classification | Deep Learning Project Tutorial
Updated: 2 days ago
Equip yourself with the tools to combat fake news using Python! This tutorial explores LSTM classification, a powerful deep learning technique, for detection and analyzing fake news. Learn to build a robust model that can identify misinformation and enhance your skills in natural language processing. Dive into the world of deep learning and gain insights into the fascinating field of fake news detection. Arm yourself with knowledge and contribute to a more informed society. #FakeNewsDetection #Python #LSTM #DeepLearning #NaturalLanguageProcessing #Misinformation

In this project tutorial we are going to analyze and classify a dataset of articles as reliable or unreliable information and visualize frequent words through a plot graph.
You can watch the step by step explanation video tutorial down below
Dataset Information
Develop a Deep learning program to identify if an article might be fake news or not.
Attributes
id: unique id for a news article
title: the title of a news article
author: author of the news article
text: the text of the article; could be incomplete
label: a label that marks the article as potentially unreliable
1: unreliable
0: reliable
Download the dataset here
Import Modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
import re
import nltk
import warnings
%matplotlib inline
warnings.filterwarnings('ignore')
pandas - used to perform data manipulation and analysis
numpy - used to perform a wide variety of mathematical operations on arrays
matplotlib - used for data visualization and graphical plotting
seaborn - built on top of matplotlib with similar functionalities
re – used as a regular expression to find particular patterns and process it
nltk – a natural language processing toolkit module
warnings - to manipulate warnings details
%matplotlib inline - to enable the inline plotting
filterwarnings('ignore') is to ignore the warnings thrown by the modules (gives clean results)
Loading the Dataset
df = pd.read_csv('train.csv')
df.head()

We can see the top 5 samples from the data
Important information is in the 'text' column and the label column so other columns are irrelevant for the process
Let us visualize the title and the text of the first article.
df['title'][0]
'House Dem Aide: We Didn’t Even See Comey’s Letter Until Jason Chaffetz Tweeted It'
df['text'][0]
'House Dem Aide: We Didn’t Even See Comey’s Letter Until Jason Chaffetz Tweeted It By Darrell Lucus on October 30, 2016 Subscribe Jason Chaffetz on the stump in American Fork, Utah ( image courtesy Michael Jolley, available under a Creative Commons-BY license) \nWith apologies to Keith Olbermann, there is no doubt who the Worst Person in The World is this week–FBI Director James Comey. But according to a House Democratic aide, it looks like we also know who the second-worst person is as well. It turns out that when Comey sent his now-infamous letter announcing that the FBI was looking into emails that may be related to Hillary Clinton’s email server, the ranking Democrats on the relevant committees didn’t hear about it from Comey. They found out via a tweet from one of the Republican committee chairmen. \nAs we now know, Comey notified the Republican chairmen and Democratic ranking members of the House Intelligence, Judiciary, and Oversight committees that his agency was reviewing emails it had recently discovered in order to see if they contained classified information. Not long after this letter went out, Oversight Committee Chairman Jason Chaffetz set the political world ablaze with this tweet. FBI Dir just informed me, "The FBI has learned of the existence of emails that appear to be pertinent to the investigation." Case reopened \n— Jason Chaffetz (@jasoninthehouse) October 28, 2016 \nOf course, we now know that this was not the case . Comey was actually saying that it was reviewing the emails in light of “an unrelated case”–which we now know to be Anthony Weiner’s sexting with a teenager. But apparently such little things as facts didn’t matter to Chaffetz. The Utah Republican had already vowed to initiate a raft of investigations if Hillary wins–at least two years’ worth, and possibly an entire term’s worth of them. Apparently Chaffetz thought the FBI was already doing his work for him–resulting in a tweet that briefly roiled the nation before cooler heads realized it was a dud. \nBut according to a senior House Democratic aide, misreading that letter may have been the least of Chaffetz’ sins. That aide told Shareblue that his boss and other Democrats didn’t even know about Comey’s letter at the time–and only found out when they checked Twitter. “Democratic Ranking Members on the relevant committees didn’t receive Comey’s letter until after the Republican Chairmen. In fact, the Democratic Ranking Members didn’ receive it until after the Chairman of the Oversight and Government Reform Committee, Jason Chaffetz, tweeted it out and made it public.” \nSo let’s see if we’ve got this right. The FBI director tells Chaffetz and other GOP committee chairmen about a major development in a potentially politically explosive investigation, and neither Chaffetz nor his other colleagues had the courtesy to let their Democratic counterparts know about it. Instead, according to this aide, he made them find out about it on Twitter. \nThere has already been talk on Daily Kos that Comey himself provided advance notice of this letter to Chaffetz and other Republicans, giving them time to turn on the spin machine. That may make for good theater, but there is nothing so far that even suggests this is the case. After all, there is nothing so far that suggests that Comey was anything other than grossly incompetent and tone-deaf. \nWhat it does suggest, however, is that Chaffetz is acting in a way that makes Dan Burton and Darrell Issa look like models of responsibility and bipartisanship. He didn’t even have the decency to notify ranking member Elijah Cummings about something this explosive. If that doesn’t trample on basic standards of fairness, I don’t know what does. \nGranted, it’s not likely that Chaffetz will have to answer for this. He sits in a ridiculously Republican district anchored in Provo and Orem; it has a Cook Partisan Voting Index of R+25, and gave Mitt Romney a punishing 78 percent of the vote in 2012. Moreover, the Republican House leadership has given its full support to Chaffetz’ planned fishing expedition. But that doesn’t mean we can’t turn the hot lights on him. After all, he is a textbook example of what the House has become under Republican control. And he is also the Second Worst Person in the World. About Darrell Lucus \nDarrell is a 30-something graduate of the University of North Carolina who considers himself a journalist of the old school. An attempt to turn him into a member of the religious right in college only succeeded in turning him into the religious right\'s worst nightmare--a charismatic Christian who is an unapologetic liberal. His desire to stand up for those who have been scared into silence only increased when he survived an abusive three-year marriage. You may know him on Daily Kos as Christian Dem in NC . Follow him on Twitter @DarrellLucus or connect with him on Facebook . Click here to buy Darrell a Mello Yello. Connect'
Punctuations and escape characters are present in the text, they can be filtered to keep only meaningful information
Let us see the datatypes and no. of samples in the dataframe
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 20800 entries, 0 to 20799 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 20800 non-null int64 1 title 20242 non-null object 2 author 18843 non-null object 3 text 20761 non-null object 4 label 20800 non-null int64 dtypes: int64(2), object(3) memory usage: 812.6+ KB
Total of 20800 articles in the dataset
There are less data in the text meaning the remaining has null values.
Data Preprocessing
Now we filter the data for processing
# drop unnecessary columns
df = df.drop(columns=['id', 'title', 'author'], axis=1)
# drop null values
df = df.dropna(axis=0)
len(df)
20761
Drops entire row if it has a NULL value
# remove special characters and punctuations
df['clean_news'] = df['text'].str.lower()
df['clean_news']
0 house dem aide: we didn’t even see comey’s let... 1 ever get the feeling your life circles the rou... 2 why the truth might get you fired october 29, ... 3 videos 15 civilians killed in single us airstr... 4 print \nan iranian woman has been sentenced to... ... 20795 rapper t. i. unloaded on black celebrities who... 20796 when the green bay packers lost to the washing... 20797 the macy’s of today grew from the union of sev... 20798 nato, russia to hold parallel exercises in bal... 20799 david swanson is an author, activist, journa... Name: clean_news, Length: 20761, dtype: object
str.lower() - converts all characters to lower case
Now we proceed in removing the punctuations and special characters
df['clean_news'] = df['clean_news'].str.replace('[^A-Za-z0-9\s]', '')
df['clean_news'] = df['clean_news'].str.replace('\n', '')
df['clean_news'] = df['clean_news'].str.replace('\s+', ' ')
df['clean_news']
0 house dem aide we didnt even see comeys letter... 1 ever get the feeling your life circles the rou... 2 why the truth might get you fired october 29 2... 3 videos 15 civilians killed in single us airstr... 4 print an iranian woman has been sentenced to s... ... 20795 rapper t i unloaded on black celebrities who m... 20796 when the green bay packers lost to the washing... 20797 the macys of today grew from the union of seve... 20798 nato russia to hold parallel exercises in balk... 20799 david swanson is an author activist journalis... Name: clean_news, Length: 20761, dtype: object
All special characters and punctuations are removed
Escape characters are removed
Extra spaces are removed
# remove stopwords
from nltk.corpus import stopwords
stop = stopwords.words('english')
df['clean_news'] = df['clean_news'].apply(lambda x: " ".join([word for word in x.split() if word not in stop]))
df.head()

Stop words are meaningless information, removing them simplifies the text data for good feature extraction
Stop words are removed from text by splitting the original text and comparing with the STOPWORDS list
Exploratory Data Analysis
# visualize the frequent words
all_words = " ".join([sentence for sentence in df['clean_news']])
wordcloud = WordCloud(width=800, height=500, random_state=42, max_font_size=100).generate(all_words)
# plot the graph
plt.figure(figsize=(15, 9))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

Concatenation of all the sentences from clean_news column
The most frequent words are larger and less frequent words are smaller
Visualization of frequent words from genuine and fake news.
# visualize the frequent words for genuine news
all_words = " ".join([sentence for sentence in df['clean_news'][df['label']==0]])
wordcloud = WordCloud(width=800, height=500, random_state=42, max_font_size=100).generate(all_words)
# plot the graph
plt.figure(figsize=(15, 9))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

Concatenation of sentences of genuine news only
Visualization of most frequent words of genuine news
# visualize the frequent words for fake news
all_words = " ".join([sentence for sentence in df['clean_news'][df['label']==1]])
wordcloud = WordCloud(width=800, height=500, random_state=42, max_font_size=100).generate(all_words)
# plot the graph
plt.figure(figsize=(15, 9))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

Concatenation of sentences of fake news only
Visualization of most frequent words of fake news
Compared with the plot of genuine news, there's a difference in the frequency of the words, including different words
Create Word Embeddings
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
Tokenizer - used for loading the text and convert them into a token
pad_sequences - used for equal distribution of words in sentences filling the remaining spaces with zeros
# tokenize text
tokenizer = Tokenizer()
tokenizer.fit_on_texts(df['clean_news'])
word_index = tokenizer.word_index
vocab_size = len(word_index)
vocab_size
199536
Returns all unique words as tokens
vocab_size returns the total number of unique words from the data
# padding data
sequences = tokenizer.texts_to_sequences(df['clean_news'])
padded_seq = pad_sequences(sequences, maxlen=500, padding='post', truncating='post')
Padding the data equalizes the length of all sentences
For this project we determine the max length to 500 words for faster processing, normally you must find the max length of a sentence in the whole dataset
# create embedding index
embedding_index = {}
with open('glove.6B.100d.txt', encoding='utf-8') as