top of page
  • Writer's pictureHackers Realm

Extract Features from Audio File | MFCC | Deep Learning | Python

In recent years, the field of audio analysis and processing has witnessed a remarkable transformation, driven by the power of deep learning techniques. Extracting meaningful information from raw audio data has been a longstanding challenge, given the intricate and dynamic nature of sound. However, with the advent of deep learning, it has become possible to unravel the hidden patterns and nuances within audio recordings, opening up new avenues for applications ranging from speech recognition and music generation to environmental sound classification. Join us on this illuminating journey to master audio feature extraction with MFCC in Python.

Extract Features from Audio - MFCC
Extract Features from Audio - MFCC

MFCC stands for Mel-Frequency Cepstral Coefficients. It is a widely used feature extraction technique in the field of audio signal processing, particularly for tasks like speech and music analysis, recognition, and classification. MFCCs are derived from the spectral characteristics of an audio signal and are designed to mimic the human auditory system's perception of sound.

You can watch the video-based tutorial with step by step explanation down below.

Import Modules

import librosa
import librosa.display
import IPython.display as ipd
import os
import numpy as np
  • librosa - provides a wide range of tools for analyzing and processing audio and music signals.

  • librosa.display - provides functions for visualizing various audio-related data and representations.

  • IPython.display - module provided by IPython, an interactive computing environment primarily used with Jupyter Notebooks, that allows you to generate and display various types of media directly within the notebook interface.

  • os - provides a way to interact with the operating system's file system, including tasks like file and directory manipulation, working with paths, and more.

  • numpy - provides support for working with arrays, matrices, and numerical operations.

Display the available audio files

Next list the files in a directory named 'audio data'.

for file in os.listdir('audio data/'):





  • os.listdir('audio data/'): This function call returns a list of all the items (files and subdirectories) present in the directory 'audio data/'.

  • Inside the loop, print(file) prints the name of each item (file or subdirectory) in the directory. The variable file takes on each item's name in the iteration.

Next use the ipd.Audio function, which is typically used with the IPython.display library to play audio files directly in a Jupyter Notebook environment.

ipd.Audio('audio data/OAF_back_happy.wav')
  • This assumes that you have the 'audio data' directory containing the audio file 'OAF_back_happy.wav' in the same directory as your Jupyter Notebook or Python script.

Define a function to extract features

Define a Python function called feature_extraction that is designed to extract certain features from an audio file using the Librosa library.

def feature_extraction(file_path):
    # load the audio file
    x, sample_rate = librosa.load(file_path, res_type='kaiser_fast')
    # extract features from the audio
    mfcc = np.mean(librosa.feature.mfcc(y=x, sr=sample_rate, n_mfcc=50).T, axis=0)
    return mfcc
  • import librosa: This line assumes that the librosa library is imported elsewhere in the script or module. Librosa is a Python package used for analyzing and extracting features from audio and music signals.

  • def feature_extraction(file_path): This line defines a function named feature_extraction that takes a single parameter file_path, which should be the path to an audio file.

  • x, sample_rate = librosa.load(file_path, res_type='kaiser_fast'): This line uses Librosa's load function to load the audio file specified by the file_path. It returns two values: x, which is the audio data, and sample_rate, which is the sampling rate of the audio.

  • mfcc = np.mean(librosa.feature.mfcc(y=x, sr=sample_rate, n_mfcc=50).T, axis=0): This line calculates the Mel-frequency cepstral coefficients (MFCCs) from the audio data. MFCCs are commonly used features in audio processing and analysis.

  • librosa.feature.mfcc(y=x, sr=sample_rate, n_mfcc=50): This computes the MFCCs of the audio data x with a specified sampling rate sample_rate. The parameter n_mfcc sets the number of MFCC coefficients to compute.

  • .T: Transposes the computed MFCC matrix.

  • np.mean(..., axis=0): Takes the mean along axis 0 (columns) of the transposed MFCC matrix. This results in a vector of mean MFCC values for each coefficient.

  • return mfcc: This line returns the calculated MFCC vector as the output of the feature_extraction function.

Next extract features using the feature_extraction function you defined earlier from multiple audio files within a specified directory.

features = {}
directory = 'audio data/'
for audio in os.listdir(directory):
    audio_path = directory+audio
    features[audio_path] = feature_extraction(audio_path)
  • features = {}: Initializes an empty dictionary named features where the extracted audio features will be stored.

  • directory = 'audio data/': Specifies the directory containing the audio files. This is assumed to be a directory named 'audio data' in the same location as the script.

  • for audio in os.listdir(directory): This line initiates a loop that iterates through the list of items (audio files) in the specified directory.

  • audio_path = directory + audio: Constructs the full path to the current audio file by combining the directory and the audio filename.

  • features[audio_path] = feature_extraction(audio_path): Calls the feature_extraction function on the current audio file and stores the result (extracted features) in the features dictionary. The key for the dictionary entry is the audio_path, and the value is the extracted feature vector.

  • After this loop completes, the features dictionary will contain entries for each audio file in the specified directory, where the key is the file path, and the value is the corresponding extracted feature vector.

Next let us print the full path of one of the audio path.


'audio data/OAF_back_sad.wav'

Next let us retrieve the extracted features for a specific audio file.

features[audio_path], len(features[audio_path])

(array([-5.45112976e+02, 8.45765152e+01, 1.97867851e+01, 1.57587433e+01,

1.19505682e+01, 1.99414787e+01, -1.66443958e+01, -5.83508873e+00,

-1.49142656e+01, 7.49133253e+00, -1.26599941e+01, 1.03757305e+01,

-8.21155357e+00, 1.39499397e+01, 3.85002089e+00, -1.90467656e+00,

-9.66936052e-01, 1.13953471e+00, 3.03179502e+00, -3.28641486e+00,

2.70385575e+00, 2.46525741e+00, -4.16511345e+00, 8.95555496e-01,

-7.91851473e+00, -3.65912080e-01, 1.03952038e+00, 6.57844543e-03,

-1.79344571e+00, 8.81427479e+00, 6.59980965e+00, 1.05787868e+01,

1.09505825e+01, 9.41102600e+00, 6.45796394e+00, 5.72548151e+00,

9.25910664e+00, 7.19050741e+00, 1.38176470e+01, 1.14543543e+01,

9.49003029e+00, 5.20508289e+00, 4.15053225e+00, 5.35622215e+00,

5.44810677e+00, 3.08560395e+00, 1.56520414e+00, 9.63827789e-01,

3.19523668e+00, 7.85421312e-01], dtype=float32),


  • features[audio_path]: This part accesses the value associated with the audio_path key in the features dictionary. In your case, it retrieves the extracted features (which are stored as a vector) for the audio file represented by audio_path.

  • len(features[audio_path]): This part calculates the length of the features vector obtained from the features dictionary. The len() function is used to determine the number of elements in the vector, which corresponds to the length of the features.

  • It returns a tuple containing two values: The extracted features vector for the specific audio file represented by audio_path and the length (number of elements) in the extracted features vector.

  • This can be useful for checking the length of the features vector and understanding the dimensions of the feature data associated with a particular audio file.

Final Thoughts

  • MFCCs are particularly effective for capturing the frequency content and timbral characteristics of audio signals. They mimic how humans perceive sound and are robust to variations in pitch, loudness, and other audio characteristics.

  • MFCCs provide a reduced-dimensional representation of audio compared to the raw waveform or spectrogram. This is beneficial for efficient storage and processing of audio data, especially in machine learning applications.

  • Before extracting MFCCs, it's essential to preprocess the audio data by resampling, normalizing, and potentially removing any unwanted noise. Librosa and similar libraries offer tools for these tasks.

  • The number of MFCC coefficients (n_mfcc), the window size, and other parameters can affect the quality and quantity of features extracted. Experimenting with different values can help find the best configuration for your specific application.

  • Often, the mean or other statistical measures are taken across the MFCC coefficients over time frames to create a single feature vector for the entire audio file.

  • Extracted MFCC features can be used as input for various machine learning algorithms like classification, clustering, or regression tasks. They provide a more meaningful representation of audio data than using raw waveforms directly.

  • Understanding the domain of your audio data is crucial. The choice of features, including MFCCs, should align with the characteristics that are relevant to your analysis or application.

  • Libraries like Librosa make it easier to compute MFCCs and other audio features. However, it's important to understand the underlying calculations and adjust parameters as needed.

  • While MFCCs are powerful, they might not capture all aspects of an audio signal, especially for complex sound events. Other features like chroma, spectral contrast, or rhythm patterns could complement MFCCs.

  • Always evaluate the effectiveness of extracted features for your specific task. This could involve testing different feature sets, experimenting with algorithms, and using appropriate evaluation metrics.

In summary, MFCCs are a valuable tool for extracting meaningful features from audio files. However, their effectiveness depends on proper preprocessing, parameter tuning, and alignment with the goals of your analysis or application. Combining MFCCs with other audio features and appropriate machine learning techniques can lead to accurate and robust audio analysis systems.

Get the project notebook from here

Thanks for reading the article!!!

Check out more project videos from the YouTube channel Hackers Realm

bottom of page