In today's data-driven world, extracting information from websites has become an essential skill for data analysts, researchers, and developers. Web scraping, the process of automating the extraction of data from websites, provides a valuable means to gather structured information for various purposes, including data analysis, research, and business intelligence. Beautiful Soup, a Python library, is a powerful tool that simplifies the process of web scraping by allowing users to parse and navigate HTML and XML documents effortlessly.
In this article we will embark on a journey to harness the power of Beautiful Soup, to scrape IMDb's Top Movies. We will walk you through the steps of fetching the IMDb website, parsing the HTML content, and extracting essential movie details such as titles, ratings, and cast information.
You can watch the video-based tutorial with step by step explanation down below.
Import Modules
from bs4 import BeautifulSoup
import requests
import pandas as pd
BeautifulSoup - used for web scraping and parsing HTML and XML documents.
requests - used for making HTTP requests to web servers.
pandas - used for data manipulation and analysis.
Request page source from URL
We will request the page content. First we will define the URL from where we will have to request the content.
url = "https://www.imdb.com/chart/top/"
It creates a variable called url and stores the web address "https://www.imdb.com/chart/top/" as its value.
This URL points to IMDb's Top 250 Movies page.
Next let us define the headers.
HEADERS = {'User-Agent': 'Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148'}
The HEADERS dictionary contains an HTTP header known as the "User-Agent" header. In web scraping, setting a User-Agent header can be important because it informs the web server about the client making the request.
'User-Agent': This is the name of the HTTP header.
'Mozilla/5.0': This part is a common identifier and indicates that the user agent is compatible with Mozilla browsers. It's often used to avoid websites from blocking requests from bots or non-browser clients.
'(iPad; CPU OS 12_2 like Mac OS X)': This simulates the user agent of an iPad running iOS 12.2.
'AppleWebKit/605.1.15': This part refers to the browser rendering engine, which is WebKit in this case.
'(KHTML, like Gecko)': It indicates compatibility with KHTML (an open-source layout engine) and Gecko (the layout engine used by Mozilla Firefox).
'Mobile/15E148': This suggests that the user agent is for a mobile device, and "15E148" is a specific identifier for the device.
Next send an https request to the page.
page = requests.get(url, headers=HEADERS)
page
<Response [200]>
page = requests.get(url, headers=HEADERS): This line sends an HTTP GET request to the URL stored in the url variable ("https://www.imdb.com/chart/top/") while also including the custom User-Agent header from the HEADERS dictionary. The response from the server is stored in the page variable.
page: The page variable now contains the HTTP response object returned by the requests.get() function. You can use this object to access various properties of the response, such as the content, status code, headers, and more.
Response 200 means that the request is executed without errors.
Next parse the content of the HTTP response obtained from the requests.get() request.
soup = BeautifulSoup(page.content, "html.parser")
page.content: This accesses the content of the HTTP response object stored in the page variable. The content attribute contains the raw bytes of the response body, which includes the HTML content of the web page retrieved from the URL specified in the previous requests.get() request.
"html.parser": This argument specifies the parser to be used by BeautifulSoup when parsing the HTML content. In this case, you're using the built-in HTML parser provided by Python's standard library. This parser is a good choice for parsing well-formed HTML documents.
BeautifulSoup(page.content, "html.parser"): This line creates a BeautifulSoup object named soup by passing in the content of the HTTP response (page.content) and specifying the HTML parser. The soup object represents the parsed HTML document, and you can use it to navigate, search, and extract data from the web page.
Next scrape movie names from the IMDb web page.
# scrap movie names
scraped_movies = soup.find_all('td', class_='titleColumn')
scraped_movies
soup.find_all('td', class_='titleColumn'): This line of code uses BeautifulSoup's find_all() method to search for all HTML <td> elements with the class attribute set to 'titleColumn'. In the IMDb page's HTML structure, these elements contain information about each movie's title and additional details.
scraped_movies: The result of the find_all() method is a list of BeautifulSoup Tag objects that match the specified criteria. Each element in the list represents a movie entry on the IMDb page.
Next parse and clean the movie names extracted from the IMDb web page using Beautiful Soup.
# parse movie names
movies = []
for movie in scraped_movies:
movie = movie.get_text().replace('\n', "")
movie = movie.strip(" ")
movies.append(movie)
movies
movies = []: This line initializes an empty list named movies to store the parsed and cleaned movie names.
for movie in scraped_movies: This loop iterates through the list of BeautifulSoup Tag objects stored in the scraped_movies variable, which represent the movie entries on the IMDb page.
movie = movie.get_text().replace('\n', ""): For each movie entry, you use the get_text() method to extract the text content, which includes the movie name. The replace('\n', "") part is used to remove newline characters from the text.
movie = movie.strip(" "): You strip any leading and trailing spaces from the movie name using the strip() method.
movies.append(movie): Finally, the cleaned movie name is added to the movies list.
The movies list will contain the parsed and cleaned movie names from the IMDb page.
Next scrape movie ratings from the IMDb web page.
# scrap rating for movies
scraped_ratings = soup.find_all('td', class_='ratingColumn imdbRating')
scraped_ratings
soup.find_all('td', class_='ratingColumn imdbRating'): This line of code uses BeautifulSoup's find_all() method to search for all HTML <td> elements with the class attribute set to 'ratingColumn imdbRating'. In the IMDb page's HTML structure, these elements typically contain information about each movie's rating.
scraped_ratings: The result of the find_all() method is a list of BeautifulSoup Tag objects that match the specified criteria. Each element in the list represents a movie's rating on the IMDb page.
Next parse and clean the movie ratings extracted from the IMDb web page using Beautiful Soup.
# parse ratings
ratings = []
for rating in scraped_ratings:
rating = rating.get_text().replace('\n', '')
ratings.append(rating)
ratings
ratings = []: This line initializes an empty list named ratings to store the parsed and cleaned movie ratings.
for rating in scraped_ratings: This loop iterates through the list of BeautifulSoup Tag objects stored in the scraped_ratings variable, which represent the movie ratings on the IMDb page.
rating = rating.get_text().replace('\n', ''): For each movie rating, you use the get_text() method to extract the text content, which includes the rating. The replace('\n', '') part is used to remove newline characters from the text.
ratings.append(rating): Finally, the cleaned rating is added to the ratings list.
The ratings list will contain the parsed and cleaned movie ratings from the IMDb page.
Store the Scraped Data
Next create a DataFrame and store movie names and ratings in it.
data = pd.DataFrame()
data['Movie Names'] = movies
data['Ratings'] = ratings
data.head()
data = pd.DataFrame(): This line initializes an empty DataFrame named data.
data['Movie Names'] = movies: You create a new column in the DataFrame called "Movie Names" and assign the list of movie names (movies) to this column. Each element in the list will correspond to a row in this column.
data['Ratings'] = ratings: Similarly, you create another column in the DataFrame called "Ratings" and assign the list of movie ratings (ratings) to this column. Each element in the list will correspond to a row in this column.
data.head(): Finally, you display the first few rows of the DataFrame using the head() method. This gives you a preview of the data contained in the DataFrame.
Next save the Pandas DataFrame data to a CSV (Comma-Separated Values) file.
data.to_csv('IMDB Top Movies.csv', index=False)
data: This is the Pandas DataFrame you want to export to a CSV file.
.to_csv(): This is a Pandas method used to write the DataFrame to a CSV file.
'IMDB Top Movies.csv': This is the filename for the CSV file you want to create. You can specify a different filename if needed.
index=False: This parameter specifies that you do not want to include the index (row numbers) of the DataFrame as a separate column in the CSV file. Setting it to False ensures that the index is not included in the output.
Final Thoughts
Practice ethical scraping by avoiding aggressive scraping, excessive requests, and any actions that may disrupt the website's normal operation.
Websites may change their HTML structure over time. Ensure that your scraping code is robust and can handle variations in the structure gracefully.
Web-scraped data may require cleaning and validation. IMDb data may contain inconsistencies or missing values, so be prepared to handle data preprocessing.
Consider how you want to store the scraped data. Saving it in a structured format like CSV, JSON, or a database allows for easy retrieval and analysis.
In summary, web scraping with Beautiful Soup can be a powerful tool for extracting data from websites like IMDb, but it should be done responsibly, ethically, and in compliance with relevant policies and laws. Here we saw how to scrape the data from a website using Beautiful soup , in other articles let us explore other scraping techniques.
Get the project notebook from here
Thanks for reading the article!!!
Check out more project videos from the YouTube channel Hackers Realm
Comentários