top of page
  • Writer's pictureHackers Realm

Speech Emotion Recognition using Python | Sound Classification | Deep Learning Project Tutorial

Updated: Jul 29, 2022

The Speech Emotion Recognition is a deep learning sound classification project. The objective of the project is to analyze the speech audio and classify the corresponding emotion. This model can be used for any sound based recognition projects such as speech, music, songs, etc.


In this project tutorial we are going to analyze and classify various audio files to a corresponding class and visualize the frequency of the sounds through a plot.



You can watch the step by step explanation video tutorial down below


Dataset Information


There are a set of 200 target words were spoken in the carrier phrase "Say the word _' by two actresses (aged 26 and 64 years) and recordings were made of the set portraying each of seven emotions (anger, disgust, fear, happiness, pleasant surprise, sadness, and neutral). There are 2800 data points (audio files) in total.


The dataset is organized such that each of the two female actor and their emotions are contain within its own folder. And within that, all 200 target words audio file can be found. The format of the audio file is a WAV format


Output Attributes

  • anger

  • disgust

  • fear

  • happiness

  • pleasant surprise

  • sadness

  • neutral

Download the dataset here



Import Modules


import pandas as pd
import numpy as np
import os
import seaborn as sns
import matplotlib.pyplot as plt
import librosa
import librosa.display
from IPython.display import Audio
import warnings
warnings.filterwarnings('ignore')
  • pandas - used to perform data manipulation and analysis

  • numpy - used to perform a wide variety of mathematical operations on arrays

  • matplotlib - used for data visualization and graphical plotting

  • os - used to handle files using system commands

  • seaborn - built on top of matplotlib with similar functionalities

  • librosa - used to analyze sound files

  • librosa.display - used to display sound data as images

  • Audio - used to display and hear the audio

  • warnings - to manipulate warnings details



Load the Dataset


paths = []
labels = []
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        paths.append(os.path.join(dirname, filename))
        label = filename.split('_')[-1]
        label = label.split('.')[0]
        labels.append(label.lower())
     if len(paths) == 2800:
         break
print('Dataset is Loaded')

Dataset is Loaded

  • The paths of the speech data has been loaded for further processing

  • Filenames were split and appended as labels

  • To ensure proper processing all characters were converted to lower case


len(paths)

2800

  • No. of samples in the dataset



paths[:5]

['/kaggle/input/toronto-emotional-speech-set-tess/TESS Toronto emotional speech set data/YAF_fear/YAF_home_fear.wav', '/kaggle/input/toronto-emotional-speech-set-tess/TESS Toronto emotional speech set data/YAF_fear/YAF_youth_fear.wav', '/kaggle/input/toronto-emotional-speech-set-tess/TESS Toronto emotional speech set data/YAF_fear/YAF_near_fear.wav', '/kaggle/input/toronto-emotional-speech-set-tess/TESS Toronto emotional speech set data/YAF_fear/YAF_search_fear.wav', '/kaggle/input/toronto-emotional-speech-set-tess/TESS Toronto emotional speech set data/YAF_fear/YAF_pick_fear.wav']

  • First five path files in the dataset


labels[:5]

['fear', 'fear', 'fear', 'fear', 'fear']

  • First five labels of the speech files in the dataset



Now we create a dataframe of the audio files and labels

## Create a dataframe
df = pd.DataFrame()
df['speech'] = paths
df['label'] = labels
df.head()
  • File path is the input data

  • Label is the output data



df['label'].value_counts()

fear 400 angry 400 disgust 400 neutral 400 sad 400 ps 400 happy 400 Name: label, dtype: int64

  • List of classes in the data set and the amount of samples per class


Exploratory Data Analysis


sns.countplot(df['label'])
  • All classes in equal distribution

  • For unequal distribution, you must balance the distribution between classes



Now we define the functions for the waveplot and spectrogram

def waveplot(data, sr, emotion):
    plt.figure(figsize=(10,4))
    plt.title(emotion, size=20)
    librosa.display.waveplot(data, sr=sr)
    plt.show()
    
def spectogram(data, sr, emotion):
     x = librosa.stft(data)
     xdb = librosa.amplitude_to_db(abs(x))
     plt.figure(figsize=(11,4))
     plt.title(emotion, size=20)
     librosa.display.specshow(xdb, sr=sr, x_axis='time', y_axis='hz')
     plt.colorbar()
  • Waveplot is to view the waveform of the audio file

  • Spectrogram is to view the frequency levels of the audio file



emotion = 'fear'
path = np.array(df['speech'][df['label']==emotion])[0]
data, sampling_rate = librosa.load(path)
waveplot(data, sampling_rate, emotion)
spectogram(data, sampling_rate, emotion)
Audio(path)



emotion = 'angry'
path = np.array(df['speech'][df['label']==emotion])[1]
data, sampling_rate = librosa.load(path)
waveplot(data, sampling_rate, emotion)
spectogram(data, sampling_rate, emotion)
Audio(path)


emotion = 'disgust'
path = np.array(df['speech'][df['label']==emotion])[0]
data, sampling_rate = librosa.load(path)
waveplot(data, sampling_rate, emotion)
spectogram(data, sampling_rate, emotion)
Audio(path)


emotion = 'neutral'
path = np.array(df['speech'][df['label']==emotion])[0]
data, sampling_rate = librosa.load(path)
waveplot(data, sampling_rate, emotion)
spectogram(data, sampling_rate, emotion)
Audio(path)


emotion = 'sad'
path = np.array(df['speech'][df['label']==emotion])[0]
data, sampling_rate = librosa.load(path)
waveplot(data, sampling_rate, emotion)
spectogram(data, sampling_rate, emotion)
Audio(path)


emotion = 'ps'
path = np.array(df['speech'][df['label']==emotion])[0]
data, sampling_rate = librosa.load(path)
waveplot(data, sampling_rate, emotion)
spectogram(data, sampling_rate, emotion)
Audio(path)


emotion = 'happy'
path = np.array(df['speech'][df['label']==emotion])[0]
data, sampling_rate = librosa.load(path)
waveplot(data, sampling_rate, emotion)
spectogram(data, sampling_rate, emotion)
Audio(path)
  • Waveplot and spectrogram of an audio file from each class is plotted

  • Sample audio of emotion speech from each class is displayed

  • Lower pitched voices have darker colors

  • Higher pitched voices have more brighter colors



Feature Extraction


Now we define a feature extraction function for the audio files

def extract_mfcc(filename):
     y, sr = librosa.load(filename, duration=3, offset=0.5)
     mfcc = np.mean(librosa.feature.mfcc(y=y, sr=sr, n_mfcc=40).T, axis=0)
     return mfcc
  • Audio duration capped to max 3 seconds for equal duration of file size

  • It will extract the Mel-frequency cepstral coefficients (MFCC) features with the limit of 40 and take the mean as the final feature


extract_mfcc(df['speech'][0])

array([-285.2542 , 86.24267 , -2.7735834 , 22.61731 , -15.214631 , 11.602871 , 11.931779 , -2.5318177 , 0.65986294, 11.62756 , -17.814924 , -7.5654893 , 6.2167835 , -3.7255652 , -9.563306 , 3.899267 , -13.657834 , 14.420068 , 19.243341 , 23.024492 , 32.129776 , 16.585697 , -4.137755 , 1.2746525 , -11.517016 , 7.0145273 , -2.8494127 , -7.415011 , -11.150621 , -2.1190548 , -5.4515266 , 4.473824 , -11.377713 , -8.931878 , -3.8482094 , 4.950994 , -1.7254968 , 2.659218 , 11.390564 , 11.3327265 ], dtype=float32)

  • Feature values of an audio file



X_mfcc = df['speech'].apply(lambda x: extract_mfcc(x))
  • Returns extracted features from all the audio files


X_mfcc

0 [-285.2542, 86.24267, -2.7735834, 22.61731, -1... 1 [-348.23337, 35.60242, -4.365128, 15.534869, 6... 2 [-339.50308, 54.41241, -14.795754, 21.566118, ... 3 [-306.92944, 21.973307, -5.1588626, 7.6269317,... 4 [-344.88586, 47.05694, -24.83122, 20.24406, 1.... ... 2795 [-374.1317, 61.859463, -0.41998756, 9.31088, -... 2796 [-314.12222, 40.262157, -6.7909045, -3.2963052... 2797 [-357.65854, 78.49201, -15.684815, 3.644915, -... 2798 [-352.78336, 102.219765, -14.560364, -11.48181... 2799 [-389.80002, 54.120773, 0.8988281, -0.6595729,... Name: speech, Length: 2800, dtype: object

  • Visualization of the features extracted from the data

  • The more samples in the dataset, the longer the processing time



X = [x for x in X_mfcc]
X = np.array(X)
X.shape

(2800, 40)

  • Conversion of the list into a single dimensional array


## input split
X = np.expand_dims(X, -1)
X.shape

(2800, 40, 1)

  • The shape represents the number of samples in the dataset and features in a single dimension array


from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()
y = enc.fit_transform(df[['label']])
y = y.toarray()
y.shape

(2800, 7)

  • The shape represents the number of samples and number of output classes



Create the LSTM Model

from keras.models import Sequential
from keras.layers import Dense, LSTM, Dropout

model = Sequential([
    LSTM(256, return_sequences=False, input_shape=(40,1)),
    Dropout(