• Hackers Realm

Million Songs Dataset Analysis using Python | Recommendation Engine | Machine Learning Tutorial

Million Song Dataset is a recommendation engine project that comes under information filtering system. Recommendation engines seeks to predict or filter preferences according to the user’s choices. Useful model for various domains like products, movies, videos, news, songs, etc.



In this project tutorial, we are going to build a recommendation engine of songs from the Million songs dataset depending on the user's song history and list recommended songs by popularity.



You can watch the step by step explanation video tutorial down below


Dataset Information


Million Songs Dataset contains of two files: triplet_file and metadata_file. The triplet_file contains user_id, song_id and listen time. The metadata_file contains song_id, title, release, year and artist_name. Million Songs Dataset is a mixture of song from various website with the rating that users gave after listening to the song.


There are 3 types of recommendation system: content-based, collaborative and popularity.


Download the dataset here


Import modules


import pandas as pd
import numpy as np
import Recommenders as Recommenders
  • pandas - used to perform data manipulation and analysis

  • numpy - used to perform a wide variety of mathematical operations on arrays

  • Recommenders - custom python file for recommendation system



Loading the dataset


Let us load the first data for processing

song_df_1 = pd.read_csv('triplets_file.csv')
song_df_1.head()
  • user_id - unique value of the user in the dataset

  • song_id - unique name of the song in the dataset

  • listen_count - no. of times that song was listened by the user


Let us load the second data for processing

song_df_2 = pd.read_csv('song_data.csv')
song_df_2.head()
  • title - full name of the song

  • release - name of the album the song was released

  • artist_name - name of the artist or band

  • year - year of the song it was released

  • If the release year is not given it will appear as 0



Now we combine the two data into one

# combine both data
song_df = pd.merge(song_df_1, song_df_2.drop_duplicates(['song_id']), on='song_id', how='left')
song_df.head()
  • Both data were combined and appended into one whole data using song_id

  • song_df_2.drop_duplicates(['song_id']) - Eliminates any duplicate songs by the user for better processing


Let us see the length of the data

print(len(song_df_1), len(song_df_2))

2000000 1000000

  • first data has two million user data

  • second data has one million song data

len(song_df)

2000000

  • length of the combined data



Data Preprocessing


We are going to combine two keywords into a new feature

# creating new feature combining title and artist name
song_df['song'] = song_df['title']+' - '+song_df['artist_name']
song_df.head()
  • Two features are combined into a new feature called song

  • Title and artist column can be eliminated for cleaner results


# taking top 10k samples for quick results
song_df = song_df.head(10000)
  • Shortening the data set for quicker processing



# cummulative sum of listen count of the songs
song_grouped = song_df.groupby(['song']).agg({'listen_count':'count'}).reset_index()
song_grouped.head()
  • Songs listed with cumulative listen count in the dataset


grouped_sum = song_grouped['listen_count'].sum()
song_grouped['percentage'] = (song_grouped['listen_count'] / grouped_sum ) * 100
song_grouped.sort_values(['listen_count', 'song'], ascending=[0,1])
  • List of the most listened song in ascending order

  • Percentage determines how much the song was listened by users in the data



Popularity Recommendation Engine


pr = Recommenders.popularity_recommender_py() 
  • Initialization of the class for further processing

pr.create(song_df, 'user_id', 'song')
  • Creation of the recommendation data based on the rating of the songs


Now we will display the top 10 popular songs

# display the top 10 popular songs
pr.recommend(song_df['user_id'][5])
  • Display of top 10 recommended songs for a user



pr.recommend(song_df['user_id'][100])
  • Display of top 10 recommended songs for another user


Item Similarity Recommendation


ir = Recommenders.item_similarity_recommender_py()
ir.create(song_df, 'user_id', 'song')
  • Initialization and creation of the class for processing


user_items = ir.get_user_items(song_df['user_id'][5])
  • Loading the song history of a particular user



Now we will display the song history from the user

# display user songs history
for user_item in user_items:
    print(user_item)

The Cove - Jack Johnson Entre Dos Aguas - Paco De Lucia Stronger - Kanye West Constellations - Jack Johnson Learn To Fly - Foo Fighters Apuesta Por El Rock 'N' Roll - Héroes del Silencio Paper Gangsta - Lady GaGa Stacked Actors - Foo Fighters Sehr kosmisch - Harmonia Heaven's gonna burn your eyes - Thievery Corporation feat. Emiliana Torrini Let It Be Sung - Jack Johnson / Matt Costa / Zach Gill / Dan Lebowitz / Steve Adams I'll Be Missing You (Featuring Faith Evans & 112)(Album Version) - Puff Daddy Love Shack - The B-52's Clarity - John Mayer I?'m A Steady Rollin? Man - Robert Johnson The Old Saloon - The Lonely Island Behind The Sea [Live In Chicago] - Panic At The Disco Champion - Kanye West Breakout - Foo Fighters Ragged Wood - Fleet Foxes Mykonos - Fleet Foxes Country Road - Jack Johnson / Paula Fuga Oh No - Andrew Bird Love Song For No One - John Mayer Jewels And Gold - Angus & Julia Stone Warning - Incubus 83 - John Mayer Neon - John Mayer The Middle - Jimmy Eat World High and dry - Jorge Drexler All That We Perceive - Thievery Corporation The Christmas Song (LP Version) - King Curtis Our Swords (Soundtrack Version) - Band Of Horses Are You In? - Incubus Drive - Incubus Generator - Foo Fighters Come Back To Bed - John Mayer He Doesn't Know Why - Fleet Foxes Trani - Kings Of Leon Bigger Isn't Better - The String Cheese Incident Sun Giant - Fleet Foxes City Love - John Mayer Right Back - Sublime Moonshine - Jack Johnson Holes To Heaven - Jack Johnson



# give song recommendation for that user
ir.recommend(song_df['user_id'][5])

No. of unique songs for the user: 45 no. of unique songs in the training set: 5151 Non zero values in cooccurence_matrix :6844

  • Based on the songs listened to the user, cooccurrence matrix is constructed based on the score and rank of the songs



# give related songs based on the words
ir.get_similar_items(['Oliver James - Fleet Foxes', 'The End - Pearl Jam'])

no. of unique songs in the training set: 5151 Non zero values in cooccurence_matrix :75


  • Display of cooccurrence matrix based on words



Final Thoughts

  • This model can be reused differently depending on the data set and parameters, including content-based, collaborative and popularity.

  • You can change certain structures in the Recommenders to your preference.

  • Larger dataset can take a lot of time to process, for quicker results you can shorten the sample size of the data.


In this project tutorial, we have explored the Millions Songs dataset as a recommendation engine machine learning project. We explored the dataset with different recommendation engines, based on popularity and item content from the data by the user's song history.


Get the project notebook from here


Thanks for reading the article!!!


Check out more project videos from the YouTube channel Hackers Realm

1,260 views