Hackers Realm

Apr 28, 20224 min

Million Songs Dataset Analysis using Python | Recommendation Engine | Machine Learning Tutorial

Updated: Jun 2, 2023

Immerse yourself in the world of music analysis with Python! This tutorial explores the Million Songs Dataset, delving into recommendation engine techniques and machine learning algorithms to uncover music insights and create personalized recommendations. Enhance your skills in data analysis, machine learning, and unlock the power of music recommendation. Join this comprehensive project tutorial to dive into the vast world of the Million Songs Dataset and revolutionize the way we discover and enjoy music. #MillionSongsDataset #Python #RecommendationEngine #MachineLearning #DataAnalysis #MusicRecommendation

Song Recommendation Engine

In this project tutorial, we are going to build a recommendation engine of songs from the Million songs dataset depending on the user's song history and list recommended songs by popularity.

You can watch the step by step explanation video tutorial down below

Dataset Information

Million Songs Dataset contains of two files: triplet_file and metadata_file. The triplet_file contains user_id, song_id and listen time. The metadata_file contains song_id, title, release, year and artist_name. Million Songs Dataset is a mixture of song from various website with the rating that users gave after listening to the song.

There are 3 types of recommendation system: content-based, collaborative and popularity.

Download the dataset here

Import modules

import pandas as pd
import numpy as np
import Recommenders as Recommenders

pandas - used to perform data manipulation and analysis
numpy - used to perform a wide variety of mathematical operations on arrays
Recommenders - custom python file for recommendation system

Loading the dataset

Let us load the first data for processing

song_df_1 = pd.read_csv('triplets_file.csv')
song_df_1.head()

user_id - unique value of the user in the dataset
song_id - unique name of the song in the dataset
listen_count - no. of times that song was listened by the user

Let us load the second data for processing

song_df_2 = pd.read_csv('song_data.csv')
song_df_2.head()

Song Dataset

title - full name of the song
release - name of the album the song was released
artist_name - name of the artist or band
year - year of the song it was released
If the release year is not given it will appear as 0

Now we combine the two data into one

# combine both data
song_df = pd.merge(song_df_1, song_df_2.drop_duplicates(['song_id']), on='song_id', how='left')
song_df.head()

Combined Dataset

Both data were combined and appended into one whole data using song_id
song_df_2.drop_duplicates(['song_id']) - Eliminates any duplicate songs by the user for better processing

Let us see the length of the data

print(len(song_df_1), len(song_df_2))

2000000 1000000

first data has two million user data
second data has one million song data

len(song_df)

2000000

length of the combined data

Data Preprocessing

We are going to combine two keywords into a new feature

# creating new feature combining title and artist name
song_df['song'] = song_df['title']+' - '+song_df['artist_name']
song_df.head()

Two features are combined into a new feature called song
Title and artist column can be eliminated for cleaner results

# taking top 10k samples for quick results
song_df = song_df.head(10000)

Shortening the data set for quicker processing

# cumulative sum of listen count of the songs
song_grouped = song_df.groupby(['song']).agg({'listen_count':'count'}).reset_index()
song_grouped.head()

Songs listed with cumulative listen count in the dataset

grouped_sum = song_grouped['listen_count'].sum()
song_grouped['percentage'] = (song_grouped['listen_count'] / grouped_sum ) * 100
song_grouped.sort_values(['listen_count', 'song'], ascending=[0,1])

List of the most listened song in ascending order
Percentage determines how much the song was listened by users in the data

Popularity Recommendation Engine

pr = Recommenders.popularity_recommender_py()

Initialization of the class for further processing

pr.create(song_df, 'user_id', 'song')

Creation of the recommendation data based on the rating of the songs

Now we will display the top 10 popular songs

# display the top 10 popular songs
pr.recommend(song_df['user_id'][5])

Display of top 10 recommended songs for a user

pr.recommend(song_df['user_id'][100])

Display of top 10 recommended songs for another user

Item Similarity Recommendation

ir = Recommenders.item_similarity_recommender_py()
ir.create(song_df, 'user_id', 'song')

Initialization and creation of the class for processing

user_items = ir.get_user_items(song_df['user_id'][5])

Loading the song history of a particular user

Now we will display the song history from the user

# display user songs history
for user_item in user_items:
print(user_item)

The Cove - Jack Johnson
Entre Dos Aguas - Paco De Lucia
Stronger - Kanye West
Constellations - Jack Johnson
Learn To Fly - Foo Fighters
Apuesta Por El Rock 'N' Roll - Héroes del Silencio
Paper Gangsta - Lady GaGa
Stacked Actors - Foo Fighters
Sehr kosmisch - Harmonia
Heaven's gonna burn your eyes - Thievery Corporation feat. Emiliana Torrini
Let It Be Sung - Jack Johnson / Matt Costa / Zach Gill / Dan Lebowitz / Steve Adams
I'll Be Missing You (Featuring Faith Evans & 112)(Album Version) - Puff Daddy
Love Shack - The B-52's
Clarity - John Mayer
I?'m A Steady Rollin? Man - Robert Johnson
The Old Saloon - The Lonely Island
Behind The Sea [Live In Chicago] - Panic At The Disco
Champion - Kanye West
Breakout - Foo Fighters
Ragged Wood - Fleet Foxes
Mykonos - Fleet Foxes
Country Road - Jack Johnson / Paula Fuga
Oh No - Andrew Bird
Love Song For No One - John Mayer
Jewels And Gold - Angus & Julia Stone
Warning - Incubus
83 - John Mayer
Neon - John Mayer
The Middle - Jimmy Eat World
High and dry - Jorge Drexler
All That We Perceive - Thievery Corporation
The Christmas Song (LP Version) - King Curtis
Our Swords (Soundtrack Version) - Band Of Horses
Are You In? - Incubus
Drive - Incubus
Generator - Foo Fighters
Come Back To Bed - John Mayer
He Doesn't Know Why - Fleet Foxes
Trani - Kings Of Leon
Bigger Isn't Better - The String Cheese Incident
Sun Giant - Fleet Foxes
City Love - John Mayer
Right Back - Sublime
Moonshine - Jack Johnson
Holes To Heaven - Jack Johnson

# give song recommendation for that user
ir.recommend(song_df['user_id'][5])

No. of unique songs for the user: 45
no. of unique songs in the training set: 5151
Non zero values in cooccurence_matrix :6844

Based on the songs listened to the user, cooccurrence matrix is constructed based on the score and rank of the songs

# give related songs based on the words
ir.get_similar_items(['Oliver James - Fleet Foxes', 'The End - Pearl Jam'])

no. of unique songs in the training set: 5151
Non zero values in cooccurence_matrix :75

Display of cooccurrence matrix based on words

Final Thoughts

This model can be reused differently depending on the data set and parameters, including content-based, collaborative and popularity.
You can change certain structures in the Recommenders to your preference.
Larger dataset can take a lot of time to process, for quicker results you can shorten the sample size of the data.

In this project tutorial, we have explored the Millions Songs dataset as a recommendation engine machine learning project. We explored the dataset with different recommendation engines, based on popularity and item content from the data by the user's song history.

Get the project notebook from here

Thanks for reading the article!!!

Check out more project videos from the YouTube channel Hackers Realm

3694