The IMDB Movies Dataset is a content based recommendation engine in python where it will filter the movies based on category, genre, directors, actors, year, etc., and list them as the most recommended to watch for the user.
In this project tutorial, we will build a recommendation engine to load and process the content from the IMDB Movies dataset and return a filtered list of recommended movies by popularity.
You can watch the step by step explanation video tutorial down below
Dataset Information
Data on Movies from IMDB (Includes Some Television as Well). Movie IDs to help gather much of this data come from one or two Kaggle projects. There is a workflow from original cobbled together spreadsheets to the final product with 27 variables and over 5000 observations.
The dataset contains the top 250 movies from IMDB with 27 attributes like Title, Director, Genre, Plot, Ratings, etc.,
Download the dataset here
Import Modules
import pandas as pd
import numpy as np
import re
import nltk
pd.set_option('display.max_columns', None)
pandas - used to perform data manipulation and analysis
numpy - used to perform a wide variety of mathematical operations on arrays
re – used as a regular expression to find particular patterns and process it
df = pd.read_csv("IMDB_Top250Engmovies2_OMDB_Detailed.csv")
df.head()
Display of the first 5 samples from the dataset
There are various columns in the data, you may remove some columns of preference for quicker processing.
The more columns you use for the engine, the better the filtering since it has more data to reference.
len(df)
250
Length of the dataset
df['Plot'][0]
'Two imprisoned men bond over a number of years, finding solace and eventual redemption through acts of common decency.'
Viewing how the information in the plot column is stored to determine how to preprocess it.
Data Preprocessing
# convert lowercase and remove numbers, punctuations, spaces, etc.,
df['clean_plot'] = df['Plot'].str.lower()
df['clean_plot'] = df['clean_plot'].apply(lambda x: re.sub('[^a-zA-Z]', ' ', x))
df['clean_plot'] = df['clean_plot'].apply(lambda x: re.sub('\s+', ' ', x))
df['clean_plot']
0 two imprisoned men bond over a number of years... 1 the aging patriarch of an organized crime dyna... 2 the early life and career of vito corleone in ... 3 when the menace known as the joker emerges fro... 4 a jury holdout attempts to prevent a miscarria... ... 245 the desperate life of a chronic alcoholic is f... 246 a something supervising staff member of a resi... 247 a newspaper editor uses every trick in the boo... 248 an old man makes a long journey by lawn mover ... 249 a mumbai teen reflects on his upbringing in th...
Preprocess to clean the text, changing the characters into lower case, eliminating extra spaces and removing special characters.
# tokenize the sentence
df['clean_plot'] = df['clean_plot'].apply(lambda x: nltk.word_tokenize(x))
df['clean_plot']
0 [two, imprisoned, men, bond, over, a, number, ...
1 [the, aging, patriarch, of, an, organized, cri...
2 [the, early, life, and, career, of, vito, corl...
3 [when, the, menace, known, as, the, joker, eme...
4 [a, jury, holdout, attempts, to, prevent, a, m...
...
245 [the, desperate, life, of, a, chronic, alcohol...
246 [a, something, supervising, staff, member, of,...
247 [a, newspaper, editor, uses, every, trick, in,...
248 [an, old, man, makes, a, long, journey, by, la...
249 [a, mumbai, teen, reflects, on, his, upbringin...
The words are now tokenized into individual words for better processing
Now we remove the stop words, which are unnecessary data for the result
# remove stopwords
stop_words = nltk.corpus.stopwords.words('english')
plot = []
for sentence in df['clean_plot']:
temp = []
for word in sentence:
if word not in stop_words and len(word) >= 3:
temp.append(word)
plot.append(temp)
plot
[['two', 'imprisoned', 'men', 'bond', 'number', 'years', 'finding', 'solace', 'eventual', 'redemption', 'acts', 'common', 'decency'],
['aging', 'patriarch', 'organized', 'crime', 'dynasty', 'transfers', 'control', 'clandestine', 'empire', 'reluctant', 'son'],
['early', 'life', 'career', 'vito', 'corleone', 'new', 'york', 'portrayed', 'son', 'michael', 'expands', 'tightens', 'grip', 'family', 'crime', 'syndicate'],
['menace', 'known', 'joker', 'emerges', 'mysterious', 'past', 'wreaks', 'havoc', 'chaos', 'people', 'gotham', 'dark', 'knight', 'must', 'accept', 'one', 'greatest', 'psychological', 'physical', 'tests', 'ability', 'fight', 'injustice'],
['jury', 'holdout', 'attempts', 'prevent', 'miscarriage', 'justice', 'forcing', 'colleagues', 'reconsider', 'evidence'],
['german', 'occupied', 'poland', 'world', 'war', 'oskar', 'schindler', 'gradually', 'becomes', 'concerned', 'jewish', 'workforce', 'witnessing', 'persecution', 'nazi', 'germans'],
['gandalf', 'aragorn', 'lead', 'world', 'men', 'sauron', 'army', 'draw', 'gaze', 'frodo', 'sam', 'approach', 'mount', 'doom', 'one', 'ring'],
['lives', 'two', 'mob', 'hit', 'men', 'boxer', 'gangster', 'wife', 'pair', 'diner', 'bandits', 'intertwine', 'four', 'tales', 'violence', 'redemption'],
['insomniac', 'office', 'worker', 'looking', 'way', 'change', 'life', 'crosses', 'paths', 'devil', 'may', 'care', 'soap', 'maker', 'forming', 'underground', 'fight', 'club', 'evolves', 'something', 'much', 'much'],
...
['meek', 'hobbit', 'shire', 'eight', 'companions', 'set', 'journey', 'destroy', 'powerful', 'one', 'ring', 'save', 'middle', 'earth', 'dark', 'lord', 'sauron'],
df['clean_plot'] = plot
Now we save the new clean plot data in the dataframe
df['clean_plot']
0 [two, imprisoned, men, bond, number, years, fi... 1 [aging, patriarch, organized, crime, dynasty, ... 2 [early, life, career, vito, corleone, new, yor... 3 [menace, known, joker, emerges, mysterious, pa... 4 [jury, holdout, attempts, prevent, miscarriage... ... 245 [desperate, life, chronic, alcoholic, followed... 246 [something, supervising, staff, member, reside... 247 [newspaper, editor, uses, every, trick, book, ... 248 [old, man, makes, long, journey, lawn, mover, ... 249 [mumbai, teen, reflects, upbringing, slums, ac...
Now we have meaningful words in the plot column for better reference
df.head()
Now we will extract the most relevant columns from the dataframe
df['Genre'] = df['Genre'].apply(lambda x: x.split(','))
df['Actors'] = df['Actors'].apply(lambda x: x.split(',')[:4])
df['Director'] = df['Director'].apply(lambda x: x.split(','))
df['Actors'][0]
['Tim Robbins', ' Morgan Freeman', ' Bob Gunton', ' William Sadler']
Listing the actors, as we can see it also need to be cleaned for better processing
def clean(sentence):
temp = []
for word in sentence:
temp.append(word.lower().replace(' ', ''))
return temp
Now we apply the cleaning function to the data
df['Genre'] = [clean(x) for x in df['Genre']]
df['Actors'] = [clean(x) for x in df['Actors']]
df['Director'] = [clean(x) for x in df['Director']]
df['Actors'][0]
['timrobbins', 'morganfreeman', 'bobgunton', 'williamsadler']
Now we have preprocessed text
# combining all the columns data
columns = ['clean_plot', 'Genre', 'Actors', 'Director']
l = []
for i in range(len(df)):
words = ''
for col in columns:
words += ' '.join(df[col][i]) + ' '
l.append(words)
l
['two imprisoned men bond number years finding solace eventual redemption acts common decency crime drama timrobbins morganfreeman bobgunton williamsadler frankdarabont ', 'aging patriarch organized crime dynasty transfers control clandestine empire reluctant son crime drama marlonbrando alpacino jamescaan richards.castellano francisfordcoppola ', 'early life career vito corleone new york portrayed son michael expands tightens grip family crime syndicate crime drama alpacino robertduvall dianekeaton robertdeniro francisfordcoppola ', 'menace known joker emerges mysterious past wreaks havoc chaos people gotham dark knight must accept one greatest psychological physical tests ability fight injustice action crime drama christianbale heathledger aaroneckhart michaelcaine christophernolan ', 'jury holdout attempts prevent miscarriage justice forcing colleagues reconsider evidence crime drama martinbalsam johnfiedler leej.cobb e.g.marshall sidneylumet ', 'german occupied poland world war oskar schindler gradually becomes concerned jewish workforce witnessing persecution nazi germans biography drama history liamneeson benkingsley ralphfiennes carolinegoodall stevenspielberg ', 'gandalf aragorn lead world men sauron army draw gaze frodo sam approach mount doom one ring adventure drama fantasy noelappleby aliastin seanastin davidaston peterjackson ', 'lives two mob hit men boxer gangster wife pair diner bandits intertwine four tales violence redemption crime drama timroth amandaplummer lauralovelace johntravolta quentintarantino ',
'insomniac office worker looking way change life crosses paths devil may care soap maker forming underground fight club evolves something much much drama edwardnorton bradpitt meatloaf zachgrenier davidfincher ', 'meek hobbit shire eight companions set journey destroy powerful one ring save middle earth dark lord sauron adventure drama fantasy alanhoward noelappleby seanastin salabaker peterjackson ', 'intelligent forrest gump accidentally present many historic moments true love jenny curran eludes comedy drama romance tomhanks rebeccawilliams sallyfield michaelconnerhumphreys robertzemeckis ', 'rebels overpowered empire newly established base luke skywalker begins jedi training master yoda friends accept shelter questionable ally darth vader hunts plan capture luke action adventure fantasy markhamill harrisonford carriefisher billydeewilliams irvinkershner ', 'thief steals corporate secrets use dream sharing technology given inverse task planting idea mind ceo action adventure sci-fi leonardodicaprio josephgordon-levitt ellenpage tomhardy christophernolan ', 'frodo sam edge closer mordor help shifty gollum divided fellowship makes stand sauron new ally saruman hordes isengard adventure drama fantasy bruceallpress seanastin johnbach salabaker peterjackson ', 'criminal pleads insanity getting trouble mental institution rebels oppressive nurse rallies scared patients drama michaelberryman peterbrocco deanr.brooks alonzobrown milosforman ', 'story henry hill life teen years years mafia covering relationship wife karen hill mob partners jimmy conway tommy devitto italian american crime syndicate crime drama robertdeniro rayliotta joepesci lorrainebracco martinscorsese ', 'computer hacker learns mysterious rebels true nature reality role war controllers action sci-fi keanureeves laurencefishburne carrie-annemoss hugoweaving lanawachowski lillywachowski ',
'luke skywalker joins forces jedi knight cocky pilot wookiee two droids save galaxy empire world destroying battle station also attempting rescue princess leia evil darth vader action adventure fantasy markhamill harrisonford carriefisher petercushing georgelucas ', 'two detectives rookie veteran hunt serial killer uses seven deadly sins motives crime drama mystery morganfreeman andrewkevinwalker kevinspacey danielzacapa davidfincher ', 'angel sent heaven help desperately frustrated businessman showing life would like never existed drama family fantasy jamesstewart donnareed lionelbarrymore thomasmitchell frankcapra ', 'young cadet must confide incarcerated manipulative killer receive help catching another serial killer skins victims crime drama thriller jodiefoster lawrencea.bonney kasilemmons lawrencet.wrentz jonathandemme ', 'sole survivor tells twisty events leading horrific gun battle boat began five criminals met seemingly random police lineup crime drama mystery stephenbaldwin gabrielbyrne beniciodeltoro kevinpollak bryansinger ', 'mathilda year old girl reluctantly taken professional assassin family murdered mathilda form unusual relationship becomes prot learns assassin trade crime drama thriller jeanreno garyoldman natalieportman dannyaiello lucbesson ',
...
'sly business manager two wacky friends two opera singers help achieve success humiliating stuffy snobbish enemies comedy music musical grouchomarx chicomarx harpomarx kittycarlisle samwood edmundgoulding ', 'crooks plan execute daring race track robbery crime drama film-noir sterlinghayden coleengray vinceedwards jayc.flippen stanleykubrick ', 'earth mightiest heroes must come together learn fight team stop mischievous loki alien army enslaving humanity action sci-fi robertdowneyjr. chrisevans markruffalo chrishemsworth josswhedon ', 'woman asked spy group nazi friends south america far ingratiate drama film-noir romance carygrant ingridbergman clauderains louiscalhern alfredhitchcock ', 'due insistence invisible six foot tall rabbit best friend whimsical middle aged man thought family insane may wiser anyone knows comedy drama fantasy wallaceford williamh.lynn victoriahorne jessewhite henrykoster ', 'astronaut becomes stranded mars team assume dead must rely ingenuity find way signal earth alive adventure drama sci-fi mattdamon jessicachastain kristenwiig jeffdaniels ridleyscott ', 'teenage girl possessed mysterious entity mother seeks help two priests save daughter horror ellenburstyn maxvonsydow leej.cobb kittywinn williamfriedkin ', 'small town sheriff american west enlists help cripple drunk young gunfighter efforts hold jail brother local bad guy action drama western johnwayne deanmartin rickynelson angiedickinson howardhawks ', 'rich woman husband tabloid type reporter turn planned remarriage begins learn truth comedy romance carygrant katharinehepburn jamesstewart ruthhussey georgecukor ', 'two young men strangle inferior classmate hide body apartment invite friends family dinner party means challenge perfection crime crime drama thriller johndall farleygranger edithevanson douglasdick alfredhitchcock ', 'private detective philip marlowe hired rich family complex case seen murder blackmail might love crime film-noir mystery humphreybogart laurenbacall johnridgely marthavickers howardhawks ',
'confined troubled rock star descends madness midst physical social isolation everyone animation drama fantasy bobgeldof christinehargreaves jameslaurenson eleanordavid alanparker ', 'story king george united kingdom great britain northern ireland impromptu ascension throne speech therapist helped unsure monarch become worthy biography drama colinfirth helenabonhamcarter derekjacobi robertportal tomhooper ', 'young boy named ralphie attempts convince parents teacher santa red ryder gun really perfect christmas gift comedy family melindadillon darrenmcgavin peterbillingsley scottschwartz bobclark ', 'disillusioned college graduate finds torn older lover daughter comedy drama annebancroft dustinhoffman katharineross williamdaniels mikenichols ', 'new orleans discovers kennedy assassination official story drama history thriller sallykirkland anthonyramirez raylepere stevereed oliverstone ', 'karl childers simple man hospitalized since childhood murder mother lover released start new life small town drama billybobthornton dwightyoakam j.t.walsh johnritter billybobthornton ', 'fisherman smuggler syndicate businessmen match wits possession priceless diamond adventure drama thriller leonardodicaprio djimonhounsou jenniferconnelly kagisokuypers edwardzwick ', 'epic mosaic interrelated characters search love forgiveness meaning san fernando valley drama pathealy genevievezweig markflanagan neilflynn paulthomasanderson ', 'selfish yuppie charlie babbitt father left fortune savant brother raymond pittance charlie travel cross country drama dustinhoffman tomcruise valeriagolino geraldr.molen barrylevinson ',
'frontiersman fur trading expedition fights survival mauled bear left dead members hunting team adventure drama thriller leonardodicaprio tomhardy domhnallgleeson willpoulter alejandrogonzáleziñárritu ', "jack skellington king halloween town discovers christmas town attempts bring christmas home cause confusion animation family fantasy dannyelfman chrissarandon catherineo'hara williamhickey henryselick ", 'former prisoner war brainwashed unwitting assassin international communist conspiracy drama thriller franksinatra laurenceharvey janetleigh angelalansbury johnfrankenheimer ', 'fast talking mercenary morbid sense humor subjected rogue experiment leaves accelerated healing powers quest revenge action adventure comedy ryanreynolds karansoni edskrein michaelbenyaer timmiller ', "aging group outlaws look one last big score traditional american west disappearing around action adventure western williamholden ernestborgnine robertryan edmondo'brien sampeckinpah ", 'street urchin vies love beautiful princess uses genie magic power make prince order marry animation adventure comedy scottweinger robinwilliams lindalarkin jonathanfreeman ronclements johnmusker ', 'frustrated son tries determine fact fiction dying father life adventure drama fantasy ewanmcgregor albertfinney billycrudup jessicalange timburton ', 'world war phase career controversial american general george patton biography drama war georgec.scott karlmalden stephenyoung michaelstrong franklinj.schaffner ', 'desperate life chronic alcoholic followed four day drinking bout drama film-noir raymilland janewyman phillipterry howarddasilva billywilder ', 'something supervising staff member residential treatment facility navigates troubled waters world alongside worker longtime boyfriend drama brielarson johngallagherjr. stephaniebeatriz ramimalek destindanielcretton ', 'newspaper editor uses every trick book keep ace reporter wife remarrying comedy drama romance carygrant rosalindrussell ralphbellamy genelockhart howardhawks ', 'old man makes long journey lawn mover tractor mend relationship ill brother biography drama sissyspacek janegallowayheitz josepha.carpenter donaldwiegert davidlynch ', 'mumbai teen reflects upbringing slums accused cheating indian version wants millionaire drama devpatel saurabhshukla anilkapoor rajzutshi dannyboyle loveleentandan ']
List of the combined data (resumed for better viewing)
df['clean_input'] = l
df = df[['Title', 'clean_input']]
df.head()
Now we have the title with the input to process
Feature Extraction
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
tfidf = TfidfVectorizer()
features = tfidf.fit_transform(df['clean_input'])
# create cosine similarity matrix
from sklearn.metrics.pairwise import cosine_similarity
cosine_sim = cosine_similarity(features, features)
print(cosine_sim)
[[1. 0.02693012 0.02567883 ... 0.00369409 0.00358765 0.00349224] [0.02693012 1. 0.18053579 ... 0.00386089 0.00374964 0.00364992] [0.02567883 0.18053579 1. ... 0.00368149 0.00357542 0.00348033] ... [0.00369409 0.00386089 0.00368149 ... 1. 0.00373351 0.00363421] [0.00358765 0.00374964 0.00357542 ... 0.00373351 1. 0.0035295 ] [0.00349224 0.00364992 0.00348033 ... 0.00363421 0.0035295 1. ]]
Scores of similarities between movies
Movie Recommendation
index = pd.Series(df['Title'])
index.head()
0 The Shawshank Redemption 1 The Godfather 2 The Godfather: Part II 3 The Dark Knight 4 12 Angry Men
Now we will create the function for the recommendation engine
def recommend_movies(title):
movies = []
idx = index[index == title].index[0]
# print(idx)
score = pd.Series(cosine_sim[idx]).sort_values(ascending=False)
top10 = list(score.iloc[1:11].index)
# print(top10)
for i in top10:
movies.append(df['Title'][i])
return movies
recommend_movies('The Dark Knight Rises')
['The Dark Knight', 'Inception', 'Batman Begins', 'The Lord of the Rings: The Fellowship of the Ring', 'Die Hard', 'Sin City', 'The Prestige', 'Star Wars: Episode IV - A New Hope', 'Mad Max: Fury Road', 'Django Unchained']
Top 10 movies recommended referencing a specific title
index[index == 'The Dark Knight Rises'].index[0]
51
Index no. of the movie from the data
pd.Series(cosine_sim[3]).sort_values(ascending=False)
3 1.000000 51 0.193658 89 0.192066 40 0.126955 187 0.088099 ... 217 0.000000 150 0.000000 30 0.000000 145 0.000000 70 0.000000
Listing the indexes with its corresponding score from the specific index, so it will list the following 10 movies with similar scores.
recommend_movies('The Shawshank Redemption')
['Pulp Fiction', 'Se7en', 'Rope', 'Goodfellas', "Hachi: A Dog's Tale", 'The Green Mile', 'The Great Escape', 'Million Dollar Baby', 'Beauty and the Beast', 'Unforgiven']
recommend_movies('The Avengers')
['Guardians of the Galaxy Vol. 2', 'Aliens', 'Guardians of the Galaxy', 'The Martian', 'Interstellar', 'Blade Runner', 'Kill Bill: Vol. 1', 'The Thing', 'Spider-Man: Homecoming', 'The Terminator']
len(df)
250
Length of the dataframe, still equal to the initial length, meaning no data was left out during the process
Final Thoughts
This model can be reused differently depending on the data set and parameters, including content, collaborative and popularity.
You can change certain structures in the dataframe to your preference for processing.
Larger dataset can take a lot of time to process, for quicker results you can shorten the sample size of the data.
In this project tutorial, we have explored the IMDB Movies dataset as a recommendation engine project. We explored the dataset with different recommendation values, based on genre, plot, writers and directors.
Get the project notebook from here
Thanks for reading the article!!!
Check out more project videos from the YouTube channel Hackers Realm
Comments