Hackers Realm

Apr 28, 20224 min

Million Songs Dataset Analysis using Python | Recommendation Engine | Machine Learning Tutorial

Updated: Jun 2, 2023

Immerse yourself in the world of music analysis with Python! This tutorial explores the Million Songs Dataset, delving into recommendation engine techniques and machine learning algorithms to uncover music insights and create personalized recommendations. Enhance your skills in data analysis, machine learning, and unlock the power of music recommendation. Join this comprehensive project tutorial to dive into the vast world of the Million Songs Dataset and revolutionize the way we discover and enjoy music. #MillionSongsDataset #Python #RecommendationEngine #MachineLearning #DataAnalysis #MusicRecommendation

Song Recommendation Engine

In this project tutorial, we are going to build a recommendation engine of songs from the Million songs dataset depending on the user's song history and list recommended songs by popularity.

You can watch the step by step explanation video tutorial down below

Dataset Information

Million Songs Dataset contains of two files: triplet_file and metadata_file. The triplet_file contains user_id, song_id and listen time. The metadata_file contains song_id, title, release, year and artist_name. Million Songs Dataset is a mixture of song from various website with the rating that users gave after listening to the song.

There are 3 types of recommendation system: content-based, collaborative and popularity.

Download the dataset here

Import modules

import pandas as pd
 
import numpy as np
 
import Recommenders as Recommenders

  • pandas - used to perform data manipulation and analysis

  • numpy - used to perform a wide variety of mathematical operations on arrays

  • Recommenders - custom python file for recommendation system

Loading the dataset

Let us load the first data for processing

song_df_1 = pd.read_csv('triplets_file.csv')
 
song_df_1.head()

  • user_id - unique value of the user in the dataset

  • song_id - unique name of the song in the dataset

  • listen_count - no. of times that song was listened by the user

Let us load the second data for processing

song_df_2 = pd.read_csv('song_data.csv')
 
song_df_2.head()

Song Dataset
  • title - full name of the song

  • release - name of the album the song was released

  • artist_name - name of the artist or band

  • year - year of the song it was released

  • If the release year is not given it will appear as 0

Now we combine the two data into one

# combine both data
 
song_df = pd.merge(song_df_1, song_df_2.drop_duplicates(['song_id']), on='song_id', how='left')
 
song_df.head()

Combined Dataset
  • Both data were combined and appended into one whole data using song_id

  • song_df_2.drop_duplicates(['song_id']) - Eliminates any duplicate songs by the user for better processing

Let us see the length of the data

print(len(song_df_1), len(song_df_2))

2000000 1000000

  • first data has two million user data

  • second data has one million song data

len(song_df)

2000000

  • length of the combined data

Data Preprocessing

We are going to combine two keywords into a new feature

# creating new feature combining title and artist name
 
song_df['song'] = song_df['title']+' - '+song_df['artist_name']
 
song_df.head()

  • Two features are combined into a new feature called song

  • Title and artist column can be eliminated for cleaner results

# taking top 10k samples for quick results
 
song_df = song_df.head(10000)

  • Shortening the data set for quicker processing

# cumulative sum of listen count of the songs
 
song_grouped = song_df.groupby(['song']).agg({'listen_count':'count'}).reset_index()
 
song_grouped.head()

  • Songs listed with cumulative listen count in the dataset

grouped_sum = song_grouped['listen_count'].sum()
 
song_grouped['percentage'] = (song_grouped['listen_count'] / grouped_sum ) * 100
 
song_grouped.sort_values(['listen_count', 'song'], ascending=[0,1])

  • List of the most listened song in ascending order

  • Percentage determines how much the song was listened by users in the data

Popularity Recommendation Engine

pr = Recommenders.popularity_recommender_py()

  • Initialization of the class for further processing

pr.create(song_df, 'user_id', 'song')

  • Creation of the recommendation data based on the rating of the songs

Now we will display the top 10 popular songs

# display the top 10 popular songs
 
pr.recommend(song_df['user_id'][5])

  • Display of top 10 recommended songs for a user

pr.recommend(song_df['user_id'][100])

  • Display of top 10 recommended songs for another user

Item Similarity Recommendation

ir = Recommenders.item_similarity_recommender_py()
 
ir.create(song_df, 'user_id', 'song')

  • Initialization and creation of the class for processing

user_items = ir.get_user_items(song_df['user_id'][5])

  • Loading the song history of a particular user

Now we will display the song history from the user

# display user songs history
 
for user_item in user_items:
 
print(user_item)

The Cove - Jack Johnson
 
Entre Dos Aguas - Paco De Lucia
 
Stronger - Kanye West
 
Constellations - Jack Johnson
 
Learn To Fly - Foo Fighters
 
Apuesta Por El Rock 'N' Roll - Héroes del Silencio
 
Paper Gangsta - Lady GaGa
 
Stacked Actors - Foo Fighters
 
Sehr kosmisch - Harmonia
 
Heaven's gonna burn your eyes - Thievery Corporation feat. Emiliana Torrini
 
Let It Be Sung - Jack Johnson / Matt Costa / Zach Gill / Dan Lebowitz / Steve Adams
 
I'll Be Missing You (Featuring Faith Evans & 112)(Album Version) - Puff Daddy
 
Love Shack - The B-52's
 
Clarity - John Mayer
 
I?'m A Steady Rollin? Man - Robert Johnson
 
The Old Saloon - The Lonely Island
 
Behind The Sea [Live In Chicago] - Panic At The Disco
 
Champion - Kanye West
 
Breakout - Foo Fighters
 
Ragged Wood - Fleet Foxes
 
Mykonos - Fleet Foxes
 
Country Road - Jack Johnson / Paula Fuga
 
Oh No - Andrew Bird
 
Love Song For No One - John Mayer
 
Jewels And Gold - Angus & Julia Stone
 
Warning - Incubus
 
83 - John Mayer
 
Neon - John Mayer
 
The Middle - Jimmy Eat World
 
High and dry - Jorge Drexler
 
All That We Perceive - Thievery Corporation
 
The Christmas Song (LP Version) - King Curtis
 
Our Swords (Soundtrack Version) - Band Of Horses
 
Are You In? - Incubus
 
Drive - Incubus
 
Generator - Foo Fighters
 
Come Back To Bed - John Mayer
 
He Doesn't Know Why - Fleet Foxes
 
Trani - Kings Of Leon
 
Bigger Isn't Better - The String Cheese Incident
 
Sun Giant - Fleet Foxes
 
City Love - John Mayer
 
Right Back - Sublime
 
Moonshine - Jack Johnson
 
Holes To Heaven - Jack Johnson

# give song recommendation for that user
 
ir.recommend(song_df['user_id'][5])

No. of unique songs for the user: 45
 
no. of unique songs in the training set: 5151
 
Non zero values in cooccurence_matrix :6844

  • Based on the songs listened to the user, cooccurrence matrix is constructed based on the score and rank of the songs

# give related songs based on the words
 
ir.get_similar_items(['Oliver James - Fleet Foxes', 'The End - Pearl Jam'])

no. of unique songs in the training set: 5151
 
Non zero values in cooccurence_matrix :75

  • Display of cooccurrence matrix based on words

Final Thoughts

  • This model can be reused differently depending on the data set and parameters, including content-based, collaborative and popularity.

  • You can change certain structures in the Recommenders to your preference.

  • Larger dataset can take a lot of time to process, for quicker results you can shorten the sample size of the data.

In this project tutorial, we have explored the Millions Songs dataset as a recommendation engine machine learning project. We explored the dataset with different recommendation engines, based on popularity and item content from the data by the user's song history.

Get the project notebook from here

Thanks for reading the article!!!

Check out more project videos from the YouTube channel Hackers Realm

    3694
    2