Hackers Realm
Speech Emotion Recognition using Python | Sound Classification | Deep Learning Project Tutorial
Updated: Jul 29, 2022
The Speech Emotion Recognition is a deep learning sound classification project. The objective of the project is to analyze the speech audio and classify the corresponding emotion. This model can be used for any sound based recognition projects such as speech, music, songs, etc.

In this project tutorial we are going to analyze and classify various audio files to a corresponding class and visualize the frequency of the sounds through a plot.
You can watch the step by step explanation video tutorial down below
Dataset Information
There are a set of 200 target words were spoken in the carrier phrase "Say the word _' by two actresses (aged 26 and 64 years) and recordings were made of the set portraying each of seven emotions (anger, disgust, fear, happiness, pleasant surprise, sadness, and neutral). There are 2800 data points (audio files) in total.
The dataset is organized such that each of the two female actor and their emotions are contain within its own folder. And within that, all 200 target words audio file can be found. The format of the audio file is a WAV format
Output Attributes
anger
disgust
fear
happiness
pleasant surprise
sadness
neutral
Download the dataset here
Import Modules
import pandas as pd
import numpy as np
import os
import seaborn as sns
import matplotlib.pyplot as plt
import librosa
import librosa.display
from IPython.display import Audio
import warnings
warnings.filterwarnings('ignore')
pandas - used to perform data manipulation and analysis
numpy - used to perform a wide variety of mathematical operations on arrays
matplotlib - used for data visualization and graphical plotting
os - used to handle files using system commands
seaborn - built on top of matplotlib with similar functionalities
librosa - used to analyze sound files
librosa.display - used to display sound data as images
Audio - used to display and hear the audio
warnings - to manipulate warnings details
Load the Dataset
paths = []
labels = []
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
paths.append(os.path.join(dirname, filename))
label = filename.split('_')[-1]
label = label.split('.')[0]
labels.append(label.lower())
if len(paths) == 2800:
break
print('Dataset is Loaded')
Dataset is Loaded
The paths of the speech data has been loaded for further processing
Filenames were split and appended as labels
To ensure proper processing all characters were converted to lower case
len(paths)
2800
No. of samples in the dataset
paths[:5]
['/kaggle/input/toronto-emotional-speech-set-tess/TESS Toronto emotional speech set data/YAF_fear/YAF_home_fear.wav', '/kaggle/input/toronto-emotional-speech-set-tess/TESS Toronto emotional speech set data/YAF_fear/YAF_youth_fear.wav', '/kaggle/input/toronto-emotional-speech-set-tess/TESS Toronto emotional speech set data/YAF_fear/YAF_near_fear.wav', '/kaggle/input/toronto-emotional-speech-set-tess/TESS Toronto emotional speech set data/YAF_fear/YAF_search_fear.wav', '/kaggle/input/toronto-emotional-speech-set-tess/TESS Toronto emotional speech set data/YAF_fear/YAF_pick_fear.wav']
First five path files in the dataset
labels[:5]
['fear', 'fear', 'fear', 'fear', 'fear']
First five labels of the speech files in the dataset
Now we create a dataframe of the audio files and labels
## Create a dataframe
df = pd.DataFrame()
df['speech'] = paths
df['label'] = labels
df.head()

File path is the input data
Label is the output data
df['label'].value_counts()
fear 400 angry 400 disgust 400 neutral 400 sad 400 ps 400 happy 400 Name: label, dtype: int64
List of classes in the data set and the amount of samples per class
Exploratory Data Analysis
sns.countplot(df['label'])

All classes in equal distribution
For unequal distribution, you must balance the distribution between classes
Now we define the functions for the waveplot and spectrogram
def waveplot(data, sr, emotion):
plt.figure(figsize=(10,4))
plt.title(emotion, size=20)
librosa.display.waveplot(data, sr=sr)
plt.show()
def spectogram(data, sr, emotion):
x = librosa.stft(data)
xdb = librosa.amplitude_to_db(abs(x))
plt.figure(figsize=(11,4))
plt.title(emotion, size=20)
librosa.display.specshow(xdb, sr=sr, x_axis='time', y_axis='hz')
plt.colorbar()
Waveplot is to view the waveform of the audio file
Spectrogram is to view the frequency levels of the audio file
emotion = 'fear'
path = np.array(df['speech'][df['label']==emotion])[0]
data, sampling_rate = librosa.load(path)
waveplot(data, sampling_rate, emotion)
spectogram(data, sampling_rate, emotion)
Audio(path)


emotion = 'angry'
path = np.array(df['speech'][df['label']==emotion])[1]
data, sampling_rate = librosa.load(path)
waveplot(data, sampling_rate, emotion)
spectogram(data, sampling_rate, emotion)
Audio(path)


emotion = 'disgust'
path = np.array(df['speech'][df['label']==emotion])[0]
data, sampling_rate = librosa.load(path)
waveplot(data, sampling_rate, emotion)
spectogram(data, sampling_rate, emotion)
Audio(path)


emotion = 'neutral'
path = np.array(df['speech'][df['label']==emotion])[0]
data, sampling_rate = librosa.load(path)
waveplot(data, sampling_rate, emotion)
spectogram(data, sampling_rate, emotion)
Audio(path)


emotion = 'sad'
path = np.array(df['speech'][df['label']==emotion])[0]
data, sampling_rate = librosa.load(path)
waveplot(data, sampling_rate, emotion)
spectogram(data, sampling_rate, emotion)
Audio(path)


emotion = 'ps'
path = np.array(df['speech'][df['label']==emotion])[0]
data, sampling_rate = librosa.load(path)
waveplot(data, sampling_rate, emotion)
spectogram(data, sampling_rate, emotion)
Audio(path)


emotion = 'happy'
path = np.array(df['speech'][df['label']==emotion])[0]
data, sampling_rate = librosa.load(path)
waveplot(data, sampling_rate, emotion)
spectogram(data, sampling_rate, emotion)
Audio(path)


Waveplot and spectrogram of an audio file from each class is plotted
Sample audio of emotion speech from each class is displayed
Lower pitched voices have darker colors
Higher pitched voices have more brighter colors
Feature Extraction
Now we define a feature extraction function for the audio files
def extract_mfcc(filename):
y, sr = librosa.load(filename, duration=3, offset=0.5)
mfcc = np.mean(librosa.feature.mfcc(y=y, sr=sr, n_mfcc=40).T, axis=0)
return mfcc
Audio duration capped to max 3 seconds for equal duration of file size
It will extract the Mel-frequency cepstral coefficients (MFCC) features with the limit of 40 and take the mean as the final feature
extract_mfcc(df['speech'][0])
array([-285.2542 , 86.24267 , -2.7735834 , 22.61731 , -15.214631 , 11.602871 , 11.931779 , -2.5318177 , 0.65986294, 11.62756 , -17.814924 , -7.5654893 , 6.2167835 , -3.7255652 , -9.563306 , 3.899267 , -13.657834 , 14.420068 , 19.243341 , 23.024492 , 32.129776 , 16.585697 , -4.137755 , 1.2746525 , -11.517016 , 7.0145273 , -2.8494127 , -7.415011 , -11.150621 , -2.1190548 , -5.4515266 , 4.473824 , -11.377713 , -8.931878 , -3.8482094 , 4.950994 , -1.7254968 , 2.659218 , 11.390564 , 11.3327265 ], dtype=float32)
Feature values of an audio file
X_mfcc = df['speech'].apply(lambda x: extract_mfcc(x))
Returns extracted features from all the audio files
X_mfcc
0 [-285.2542, 86.24267, -2.7735834, 22.61731, -1... 1 [-348.23337, 35.60242, -4.365128, 15.534869, 6... 2 [-339.50308, 54.41241, -14.795754, 21.566118, ... 3 [-306.92944, 21.973307, -5.1588626, 7.6269317,... 4 [-344.88586, 47.05694, -24.83122, 20.24406, 1.... ... 2795 [-374.1317, 61.859463, -0.41998756, 9.31088, -... 2796 [-314.12222, 40.262157, -6.7909045, -3.2963052... 2797 [-357.65854, 78.49201, -15.684815, 3.644915, -... 2798 [-352.78336, 102.219765, -14.560364, -11.48181... 2799 [-389.80002, 54.120773, 0.8988281, -0.6595729,... Name: speech, Length: 2800, dtype: object
Visualization of the features extracted from the data
The more samples in the dataset, the longer the processing time
X = [x for x in X_mfcc]
X = np.array(X)
X.shape
(2800, 40)
Conversion of the list into a single dimensional array
## input split
X = np.expand_dims(X, -1)
X.shape
(2800, 40, 1)
The shape represents the number of samples in the dataset and features in a single dimension array
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()
y = enc.fit_transform(df[['label']])
y = y.toarray()
y.shape
(2800, 7)
The shape represents the number of samples and number of output classes
Create the LSTM Model
from keras.models import Sequential
from keras.layers import Dense, LSTM, Dropout
model = Sequential([
LSTM(256, return_sequences=False, input_shape=(40,1)),
Dropout(