top of page
  • Writer's pictureHackers Realm

Million Songs Dataset Analysis using Python | Recommendation Engine | Machine Learning Tutorial

Updated: Jun 2, 2023

Immerse yourself in the world of music analysis with Python! This tutorial explores the Million Songs Dataset, delving into recommendation engine techniques and machine learning algorithms to uncover music insights and create personalized recommendations. Enhance your skills in data analysis, machine learning, and unlock the power of music recommendation. Join this comprehensive project tutorial to dive into the vast world of the Million Songs Dataset and revolutionize the way we discover and enjoy music. #MillionSongsDataset #Python #RecommendationEngine #MachineLearning #DataAnalysis #MusicRecommendation

Million Songs Dataset Recommendation Engine
Song Recommendation Engine

In this project tutorial, we are going to build a recommendation engine of songs from the Million songs dataset depending on the user's song history and list recommended songs by popularity.

You can watch the step by step explanation video tutorial down below

Dataset Information

Million Songs Dataset contains of two files: triplet_file and metadata_file. The triplet_file contains user_id, song_id and listen time. The metadata_file contains song_id, title, release, year and artist_name. Million Songs Dataset is a mixture of song from various website with the rating that users gave after listening to the song.

There are 3 types of recommendation system: content-based, collaborative and popularity.

Download the dataset here

Import modules

import pandas as pd
import numpy as np
import Recommenders as Recommenders
  • pandas - used to perform data manipulation and analysis

  • numpy - used to perform a wide variety of mathematical operations on arrays

  • Recommenders - custom python file for recommendation system

Loading the dataset

Let us load the first data for processing

song_df_1 = pd.read_csv('triplets_file.csv')
Million Songs Dataset
  • user_id - unique value of the user in the dataset

  • song_id - unique name of the song in the dataset

  • listen_count - no. of times that song was listened by the user

Let us load the second data for processing

song_df_2 = pd.read_csv('song_data.csv')
Million Songs Dataset
Song Dataset
  • title - full name of the song

  • release - name of the album the song was released

  • artist_name - name of the artist or band

  • year - year of the song it was released

  • If the release year is not given it will appear as 0

Now we combine the two data into one

# combine both data
song_df = pd.merge(song_df_1, song_df_2.drop_duplicates(['song_id']), on='song_id', how='left')
Million Songs Dataset
Combined Dataset
  • Both data were combined and appended into one whole data using song_id

  • song_df_2.drop_duplicates(['song_id']) - Eliminates any duplicate songs by the user for better processing

Let us see the length of the data

print(len(song_df_1), len(song_df_2))

2000000 1000000

  • first data has two million user data

  • second data has one million song data



  • length of the combined data

Data Preprocessing

We are going to combine two keywords into a new feature

# creating new feature combining title and artist name
song_df['song'] = song_df['title']+' - '+song_df['artist_name']
New feature creation for song dataset
  • Two features are combined into a new feature called song

  • Title and artist column can be eliminated for cleaner results

# taking top 10k samples for quick results
song_df = song_df.head(10000)
  • Shortening the data set for quicker processing

# cumulative sum of listen count of the songs
song_grouped = song_df.groupby(['song']).agg({'listen_count':'count'}).reset_index()
cumulative sum of listen count of the songs
  • Songs listed with cumulative listen count in the dataset

grouped_sum = song_grouped['listen_count'].sum()
song_grouped['percentage'] = (song_grouped['listen_count'] / grouped_sum ) * 100
song_grouped.sort_values(['listen_count', 'song'], ascending=[0,1])
percentage of cumulative sum of listen count of the songs
  • List of the most listened song in ascending order

  • Percentage determines how much the song was listened by users in the data

Popularity Recommendation Engine

pr = Recommenders.popularity_recommender_py() 
  • Initialization of the class for further processing

pr.create(song_df, 'user_id', 'song')
  • Creation of the recommendation data based on the rating of the songs

Now we will display the top 10 popular songs

# display the top 10 popular songs
ranking of songs based on score
  • Display of top 10 recommended songs for a user

recommendation of songs for userid
  • Display of top 10 recommended songs for another user

Item Similarity Recommendation

ir = Recommenders.item_similarity_recommender_py()
ir.create(song_df, 'user_id', 'song')
  • Initialization and creation of the class for processing

user_items = ir.get_user_items(song_df['user_id'][5])
  • Loading the song history of a particular user

Now we will display the song history from the user

# display user songs history
for user_item in user_items:

The Cove - Jack Johnson Entre Dos Aguas - Paco De Lucia Stronger - Kanye West Constellations - Jack Johnson Learn To Fly - Foo Fighters Apuesta Por El Rock 'N' Roll - Héroes del Silencio Paper Gangsta - Lady GaGa Stacked Actors - Foo Fighters Sehr kosmisch - Harmonia Heaven's gonna burn your eyes - Thievery Corporation feat. Emiliana Torrini Let It Be Sung - Jack Johnson / Matt Costa / Zach Gill / Dan Lebowitz / Steve Adams I'll Be Missing You (Featuring Faith Evans & 112)(Album Version) - Puff Daddy Love Shack - The B-52's Clarity - John Mayer I?'m A Steady Rollin? Man - Robert Johnson The Old Saloon - The Lonely Island Behind The Sea [Live In Chicago] - Panic At The Disco Champion - Kanye West Breakout - Foo Fighters Ragged Wood - Fleet Foxes Mykonos - Fleet Foxes Country Road - Jack Johnson / Paula Fuga Oh No - Andrew Bird Love Song For No One - John Mayer Jewels And Gold - Angus & Julia Stone Warning - Incubus 83 - John Mayer Neon - John Mayer The Middle - Jimmy Eat World High and dry - Jorge Drexler All That We Perceive - Thievery Corporation The Christmas Song (LP Version) - King Curtis Our Swords (Soundtrack Version) - Band Of Horses Are You In? - Incubus Drive - Incubus Generator - Foo Fighters Come Back To Bed - John Mayer He Doesn't Know Why - Fleet Foxes Trani - Kings Of Leon Bigger Isn't Better - The String Cheese Incident Sun Giant - Fleet Foxes City Love - John Mayer Right Back - Sublime Moonshine - Jack Johnson Holes To Heaven - Jack Johnson

# give song recommendation for that user

No. of unique songs for the user: 45 no. of unique songs in the training set: 5151 Non zero values in cooccurence_matrix :6844

song recommendations for the user
  • Based on the songs listened to the user, cooccurrence matrix is constructed based on the score and rank of the songs

# give related songs based on the words
ir.get_similar_items(['Oliver James - Fleet Foxes', 'The End - Pearl Jam'])

no. of unique songs in the training set: 5151 Non zero values in cooccurence_matrix :75

song recommendation based on song title
  • Display of cooccurrence matrix based on words

Final Thoughts

  • This model can be reused differently depending on the data set and parameters, including content-based, collaborative and popularity.

  • You can change certain structures in the Recommenders to your preference.

  • Larger dataset can take a lot of time to process, for quicker results you can shorten the sample size of the data.

In this project tutorial, we have explored the Millions Songs dataset as a recommendation engine machine learning project. We explored the dataset with different recommendation engines, based on popularity and item content from the data by the user's song history.

Get the project notebook from here

Thanks for reading the article!!!

Check out more project videos from the YouTube channel Hackers Realm


bottom of page