Hackers Realm
- Apr 29, 2022
- 7 min read

Urban Sound Analysis using Python | Classification | Deep Learning Project Tutorial

Updated: Jun 2, 2023

Dive into the world of urban sound analysis with Python! This tutorial explores classification and deep learning techniques to analyze and classify urban sounds. Learn to build models that can distinguish between various sounds in urban environments, opening doors to applications in noise pollution monitoring, smart cities, and more. Enhance your skills in audio processing, machine learning, and unlock the potential of urban sound analysis. Join this comprehensive project tutorial to unravel the secrets hidden within the sounds of the city. #UrbanSoundAnalysis #Python #Classification #DeepLearning #AudioProcessing #MachineLearning #NoisePollution

Urban Sound Analysis Sound Classification — Urban Sound Analysis

In this project tutorial we are going to analyze and classify various audio files to a corresponding class and visualize the frequency of the sounds through a plot.

You can watch the step by step explanation video tutorial down below

Dataset Information

This dataset contains 8732 labeled sound excerpts (<=4s) of urban sounds from 10 classes:

air_conditioner
car_horn
children_playing
dog_bark
drilling
engine_idling
gun_shot
jackhammer
siren
street_music

Download the dataset here

Mounting Drive

We are mounting the sound dataset from Google Drive

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive

The files must be uploaded to your Google Drive account for this to work.
An authorization link is provided you must click the link to access the authorization code and paste it in the code box.

Let us verify with directory we are working in

!pwd

/content

Unzip data

Now we unzip the train dataset from the drive

!unzip 'drive/MyDrive/Colab Notebooks/train.zip'

The dataset file is around 3GB

Streaming output truncated to the last 5000 lines.

inflating: Train/1674.wav inflating: Train/1675.wav inflating: Train/1677.wav inflating: Train/1678.wav inflating: Train/1679.wav inflating: Train/168.wav inflating: Train/1680.wav inflating: Train/1681.wav inflating: Train/1686.wav inflating: Train/1687.wav

...

For this example we are listing 10 sound samples for a simple view

Import modules

import pandas as pd
import numpy as np
import librosa
import librosa.display
import glob
import IPython.display as ipd
import random
%pylab inline

import warnings
warnings.filterwarnings('ignore')

pandas - used to perform data manipulation and analysis
numpy - used to perform a wide variety of mathematical operations on arrays
librosa - used to analyze music and sound files
librosa.display - used to display sound data as images
glob - used to find all pathnames matching a specific pattern
IPython.display - used to display and hear the audio
random - used for randomizing
%pylab inline - to enable the inline plotting
warnings - to manipulate warnings details

Loading the dataset

Now we load the dataset for processing

df = pd.read_csv('Urban Sound Dataset.csv')
df.head()

ID - Name of the audio file
Class - Name of the output class the audio file belongs to

Let us display an audio file

ipd.Audio('Train/1.wav')

Sound bar display of the audio file from the data

Exploratory Data Analysis

In this step we will visualize different audio sample of the data through wave plots.

We will load the audio file into an array

data, sampling_rate = librosa.load('Train/1.wav')

sampling_rate - number of splits or samples per second

Now we will view the data array

data

array([-0.09602016, -0.14303702, 0.05203498, ..., -0.01646687, -0.00915894, 0.09742922], dtype=float32)

Audio files loaded into values
Each value is a frequency value of the data

Next we will view the sampling rate

sampling_rate

22050

Output value determines the amount of samples per second

Now we plot some graphs of the audio files

plt.figure(figsize=(12,4))
librosa.display.waveplot(data, sr=sampling_rate)

figsize=(12,4) - size of the plot graph
librosa.display.waveplot() - display a waveplot of the data and sampling rate

index = random.choice(df.index)

print('Class:', df['Class'][index])
data, sampling_rate = librosa.load('Train/'+str(df['ID'][index]) + '.wav')

plt.figure(figsize=(12,4))
librosa.display.waveplot(data, sr=sampling_rate)

Class: dog_bark

Randomly picked audio file to train
librosa.load('Train/'+str(df['ID'][index]) + '.wav') - creating a whole path for the data file and append the format
Graph display waveplot of a dog bark from the data

index = random.choice(df.index)

print('Class:', df['Class'][index])
data, sampling_rate = librosa.load('Train/'+str(df['ID'][index]) + '.wav')

plt.figure(figsize=(12,4))
librosa.display.waveplot(data, sr=sampling_rate)

Class: gun_shot

Different audio data randomly picked
Graph display of a gun shot data sample

index = random.choice(df.index)

print('Class:', df['Class'][index])
data, sampling_rate = librosa.load('Train/'+str(df['ID'][index]) + '.wav')

plt.figure(figsize=(12,4))
librosa.display.waveplot(data, sr=sampling_rate)

Class: car_horn

Graph display of a car horn data sample

Now we will view the different class distribution in the data set

import seaborn as sns
plt.figure(figsize=(12,7))
sns.countplot(df['Class'])

seaborn - built on top of matplotlib with similar functionalities
Visualization through a bar graph for the no. of samples for each class.

Input Split

The data currently is in the audio file, we need to extract the audio into an array and convert the data as a sample to directly load the input and output data.

import os

def parser(row):
    # path of the file
    file_name = os.path.join('Train', str(row.ID) + '.wav')
    # load the audio file
    x, sample_rate = librosa.load(file_name, res_type='kaiser_fast')
    # extract features from the data
    mfccs = np.mean(librosa.feature.mfcc(y=x, sr=sample_rate, n_mfcc=40).T, axis=0)
    
    feature = mfccs
    label = row.Class
    
    return [feature, label]

import os - used to obtain and concatenate path directories
res_type='kaiser_fast' - used to extract the features very fast
librosa.feature.mfcc() - Mel-frequency cepstral coefficients technique to extract audio file features
feature - array of the features extracted form the data
label - name of the class of the extracted data

Now we will load the data for training

data = df.apply(parser, axis=1)
data.columns = ['feature','label']

Assigning the columns to display the features and the corresponding label of the data

data[0]

[array([-82.12358939, 139.50591598, -42.43086489, 24.82786139, -11.62076447, 23.49708426, -12.19458986, 25.89713885, -9.40527728, 21.21042898, -7.36882138, 14.25433903, -8.67870015, 7.75023765, -10.1241154 , 3.2581183 , -11.35261914, 2.80096779, -7.04601346, 3.91331351, -2.3349743 , 2.01242254, -2.79394367, 4.12927394, -1.62076864, 4.32620082, -1.03440959, -1.23297714, -3.11085341, 0.32044827, -1.787786 , 0.44295495, -1.79164752, -0.76361758, -1.24246428, -0.27664012, 0.65718559, -0.50237115, -2.60428533, -1.05346291]), 'siren']

List of data in a single array in the first index
Second index indicates the class

Now we split the data for better processing

# input split
X = np.array(list(zip(*data))[0])
y = np.array(list(zip(*data))[1])

Label encoder

We will transform the 10 class labels from text attributes to numerical attributes

from sklearn.preprocessing import LabelEncoder
from keras.utils import np_utils

le = LabelEncoder()
y = np_utils.to_categorical(le.fit_transform(y))

Each class converted into integer values in different categorical columns

y.shape

(5435, 10)

Shape of the data set for training, indicating 5435 samples of training data with 10 classes

y[0]

array([0., 0., 0., 0., 0., 0., 0., 0., 1., 0.], dtype=float32)

Single sample of the data in numerical columns of the classes
If the output class is present in the sample, it will change the corresponding numerical column to 1 and the rest to 0

Model Training

Let us create the model for training

from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten

num_classes = 10

# model creation
model = Sequential()

model.add(Dense(256, input_shape=(40,)))
model.add(Activation('relu'))
model.add(Dropout(0.3))

model.add(Dense(512))
model.add(Activation('relu'))
model.add(Dropout(0.3))

model.add(Dense(256))
model.add(Activation('relu'))
model.add(Dropout(0.3))

model.add(Dense(num_classes))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy', metrics='accuracy', optimizer='adam')

Dense - single dimension linear layer
Dropout - used to add regularization to the data, avoiding over fitting & dropping out a fraction of the data
Activation - layer for the use of certain threshold
Flatten - convert a 2D array into a 1D array
Loss=’sparse_categorical_crossentropy’ - basic structure for the threshold to adjust the gradient descent
Optimizer=’adam’ - automatically adjust the learning rate for the model over the number of epochs

Now we will train the data

# train the model
model.fit(X, y, batch_size=32, epochs=100, validation_split=0.25)

Epoch 1/30

1149/1149 [==============================] - 10s 3ms/step - loss: 0.4816 - accuracy: 0.8475 - val_loss: 0.1202 - val_accuracy: 0.9637

Epoch 2/30

1149/1149 [==============================] - 4s 3ms/step - loss: 0.1336 - accuracy: 0.9605 - val_loss: 0.0848 - val_accuracy: 0.9743

Epoch 3/30

1149/1149 [==============================] - 4s 3ms/step - loss: 0.0863 - accuracy: 0.9732 - val_loss: 0.0807 - val_accuracy: 0.9742

Epoch 4/30

1149/1149 [==============================] - 4s 3ms/step - loss: 0.0685 - accuracy: 0.9783 - val_loss: 0.0734 - val_accuracy: 0.9788

Epoch 5/30

1149/1149 [==============================] - 4s 3ms/step - loss: 0.0543 - accuracy: 0.9825 - val_loss: 0.0690 - val_accuracy: 0.9809

Epoch 6/30

1149/1149 [==============================] - 4s 3ms/step - loss: 0.0461 - accuracy: 0.9844 - val_loss: 0.0684 - val_accuracy: 0.9808

Epoch 7/30

1149/1149 [==============================] - 4s 3ms/step - loss: 0.0360 - accuracy: 0.9873 - val_loss: 0.0743 - val_accuracy: 0.9798

Epoch 8/30

1149/1149 [==============================] - 4s 3ms/step - loss: 0.0318 - accuracy: 0.9884 - val_loss: 0.0733 - val_accuracy: 0.9811

Epoch 9/30

1149/1149 [==============================] - 4s 3ms/step - loss: 0.0319 - accuracy: 0.9891 - val_loss: 0.0658 - val_accuracy: 0.9838

Epoch 10/30

1149/1149 [==============================] - 4s 3ms/step - loss: 0.0242 - accuracy: 0.9919 - val_loss: 0.0728 - val_accuracy: 0.9827

Epoch 11/30

1149/1149 [==============================] - 4s 3ms/step - loss: 0.0218 - accuracy: 0.9926 - val_loss: 0.0815 - val_accuracy: 0.9818

Epoch 12/30

1149/1149 [==============================] - 4s 3ms/step - loss: 0.0286 - accuracy: 0.9895 - val_loss: 0.0766 - val_accuracy: 0.9829

Epoch 13/30

1149/1149 [==============================] - 4s 3ms/step - loss: 0.0199 - accuracy: 0.9928 - val_loss: 0.0762 - val_accuracy: 0.9820

Epoch 14/30

1149/1149 [==============================] - 4s 3ms/step - loss: 0.0239 - accuracy: 0.9918 - val_loss: 0.0754 - val_accuracy: 0.9836

Epoch 15/30

1149/1149 [==============================] - 4s 3ms/step - loss: 0.0160 - accuracy: 0.9938 - val_loss: 0.0865 - val_accuracy: 0.9820

Epoch 16/30

1149/1149 [==============================] - 4s 3ms/step - loss: 0.0196 - accuracy: 0.9935 - val_loss: 0.0842 - val_accuracy: 0.9822

Epoch 17/30

1149/1149 [==============================] - 4s 3ms/step - loss: 0.0152 - accuracy: 0.9951 - val_loss: 0.0825 - val_accuracy: 0.9828

Epoch 18/30

1149/1149 [==============================] - 4s 3ms/step - loss: 0.0155 - accuracy: 0.9943 - val_loss: 0.0889 - val_accuracy: 0.9817

Epoch 19/30

1149/1149 [==============================] - 4s 3ms/step - loss: 0.0207 - accuracy: 0.9930 - val_loss: 0.0886 - val_accuracy: 0.9822

Epoch 20/30

1149/1149 [==============================] - 4s 3ms/step - loss: 0.0122 - accuracy: 0.9955 - val_loss: 0.0958 - val_accuracy: 0.9822

Epoch 21/30

1149/1149 [==============================] - 4s 3ms/step - loss: 0.0135 - accuracy: 0.9957 - val_loss: 0.0986 - val_accuracy: 0.9824

Epoch 22/30

1149/1149 [==============================] - 4s 3ms/step - loss: 0.0166 - accuracy: 0.9949 - val_loss: 0.0987 - val_accuracy: 0.9824

Epoch 23/30

1149/1149 [==============================] - 4s 3ms/step - loss: 0.0153 - accuracy: 0.9949 - val_loss: 0.0917 - val_accuracy: 0.9832

Epoch 24/30

1149/1149 [==============================] - 4s 3ms/step - loss: 0.0147 - accuracy: 0.9950 - val_loss: 0.0967 - val_accuracy: 0.9838

Epoch 25/30

1149/1149 [==============================] - 4s 3ms/step - loss: 0.0112 - accuracy: 0.9957 - val_loss: 0.1057 - val_accuracy: 0.9816

Epoch 26/30

1149/1149 [==============================] - 4s 3ms/step - loss: 0.0134 - accuracy: 0.9959 - val_loss: 0.1024 - val_accuracy: 0.9830

Epoch 27/30

1149/1149 [==============================] - 4s 3ms/step - loss: 0.0085 - accuracy: 0.9968 - val_loss: 0.1256 - val_accuracy: 0.9795

Epoch 28/30

1149/1149 [==============================] - 4s 3ms/step - loss: 0.0127 - accuracy: 0.9958 - val_loss: 0.1099 - val_accuracy: 0.9832

Epoch 29/30

1149/1149 [==============================] - 4s 3ms/step - loss: 0.0136 - accuracy: 0.9952 - val_loss: 0.1043 - val_accuracy: 0.9824

Epoch 30/30

1149/1149 [==============================] - 4s 3ms/step - loss: 0.0132 - accuracy: 0.9959 - val_loss: 0.1162 - val_accuracy: 0.9827

Display of the results after training the data
batch_size=32 - amount of data to process per iteration
epochs=30 - no. of iterations for training
validation_split=0.25 - train and split validation percentage
The training accuracy and validation accuracy increases per iteration
Both training and validation accuracy reached more than 90 percent

Final Thoughts

Deep learning models give more accuracy results compared to machine learning algorithms
Sound features are extracted and used for training
More training the data will get you better accuracy
This model can be reused differently depending on the data set and parameters, including speech recognition or other sound related tracks

In this project tutorial, we have explored the Urban Sound Analysis dataset as a classification project under deep learning. Different urban sounds were identified and classified with explanatory data analysis

Get the project notebook from here

Thanks for reading the article!!!

Check out more project videos from the YouTube channel Hackers Realm