Speech Emotion Recognition using Machine Learning

Published in

Analytics Vidhya

5 min readSep 28, 2020

As human beings speech is amongst the most natural way to express ourselves. We depend so much on it that we recognize its importance when resorting to other communication forms like emails and text messages where we often use emojis to express the emotions associated with the messages. As emotions play a vital role in communication, the detection and analysis of the same is of vital importance in today’s digital world of remote communication. Emotion detection is a challenging task, because emotions are subjective. There is no common consensus on how to measure or categorize them.

Speech Emotion Recognition, abbreviated as SER, is the act of attempting to recognize human emotion and affective states from speech. Such a system can find use in a wide variety of application areas like interactive voice based-assistant or caller-agent conversation analysis.

SER - Objective

To build a model to recognize emotions from speech using python’s librosa library.

Data Source

For this project, we will be using the RAVDESS dataset which is the abbreviated form of Ryerson Audio-Visual Database of Emotional Speech and Song dataset. This dataset has 7356 files rated by 247 individuals 10 times on emotional validity, intensity, and genuineness.

Libraries Used

The main library that we will be using is Librosa. Apart from that we will also be using Soundfile and Pyaudio. Librosa is a Python library for analyzing audio and music. It has a flatter package layout, standardizes interfaces and names, backwards compatibility, modular functions, and readable code.

Features Used In This Study

From the Audio data we have extracted three key features which have been used in this study, namely, MFCC (Mel Frequency Cepstral Coefficients), Mel Spectrogram and Chroma. Librosa was used in their extraction.

MFCC : MFCC was by far the most researched about and utilized feature in this dataset. It represents the short-term power spectrum of a sound.

Mel Spectrogram : This is just a spectrogram that depicts amplitude which is mapped on a Mel scale.

Chroma : A Chroma vector is typically a 12-element feature vector indicating how much energy of each pitch class is present in the signal in a standard chromatic scale.

Extracting The Features

We simply define a function to extract the MFCC, Chroma, and Mel features from a sound file. This function takes 4 parameters- the file name and three Boolean parameters for the three features. Open the sound file with Sound File library. Read from it and call it X. Also, get the sample rate. If chroma is True, get the Short-Time Fourier Transform of X.

Let result be an empty numpy array. Now, for each feature of the three, if it exists, make a call to the corresponding function from librosa.feature and get the mean value. Call the function hstack() from numpy with result and the feature value, and store this in result. Then, return the result.

Process Followed

WE define a dictionary to hold numbers and the emotions available in the RAVDESS dataset, and a list to hold those we want- calm, happy, fearful and disgust.

Loading Data

We load the data with a function that takes in the relative size of the test set as parameter. x and y are empty lists. We’ll use the glob() function from the glob module to get all the pathnames for the sound files in our dataset.

So, for each such path, get the base name of the file, the emotion by splitting the name around ‘-’ and extracting the third value.

Using our emotions dictionary, this number is turned into an emotion, and our function checks whether this emotion is in our list of observed emotions, if not, it continues to the next file. It makes a call to extract feature and stores what is returned in ‘feature’. Then, it appends the feature to x and the emotion to y. So, the list x holds the features and y holds the emotions. We call the function train-test-split with these, the test size, and a random state value, and return that.

Train-Test Split

We split our dataset into training and testing data, keeping the test size 25% of our total data.

Classification Task

For our speech emotion recognition system we will be using MLPClassifier. This is the Multi Layer Perceptron Classifier, it optimizes the log-loss function using stochastic gradient descent. The MLPClassifier has an internal neural network for the purpose of classification. This is a feedforward ANN model.

We then train our model and get the predictions from it.

Model Evaluation

To calculate the accuracy of our model, we’ll call up the accuracy_score() function we imported from sklearn. We will also use the confusion matrix for a better understanding of our model predictions.