Emotion recognition with convolutional neural network

Recognizing emotions has always been an exciting challenge for scientists. Recently, I have been working on an experimental SER project (Speech Emotion Recognition) to understand the potential of this technology – for this I selected the most popular repositories on Github and made them the basis of my project.

Before we begin to understand the project, it will be nice to remember what bottlenecks the SER has.

Main obstacles

Emotions are subjective, even people interpret them differently. It is difficult to define the very concept of “emotions”;

Commenting on audio is difficult. Should we somehow tag every single word, sentence, or the entire communication? A set of exactly which emotions to use when recognizing?

Collecting data is also not easy. A lot of audio data can be collected from movies and news. However, both sources are “biased”, because the news must be neutral, and the emotions of the actors are played. It’s hard to find an “objective” source of audio data.

Data layout requires a lot of human and time resources. Unlike drawing frames on images, it requires specially trained personnel to listen to entire audio recordings, analyze them and provide comments. And then these comments need to be appreciated by many other people, because the evaluations are subjective.

Project Description

Using a convolutional neural network to recognize emotions in audio recordings. And yes, the repository owner did not cite any sources.

Data description

There are two datasets that were used in the RAVDESS and SAVEE repositories, I just adapted RAVDESS in my model. In the context of RAVDESS there are two types of data: speech (speech) and song (song).

Dataset RAVDESS (The Ryerson Audio-Visual Database of Emotional Speech and Song) :

12 actors and 12 actresses recorded their speech and songs in their performance;

actor # 18 has no recorded songs;

Disgust (disgust), Neutral (neutral) and Surprises (surprise) emotions are missing in the song data.

Breakdown by emotions

Feature extraction

When we work with speech recognition tasks, chalk-cepstral coefficients (MFCCs) are an advanced technology, despite the fact that it appeared in the 80s.

Quote from MFCC Tutorial :

This form determines what the output sound is. If we can pinpoint the shape, it will give us an accurate representation of the phoneme sounded . The shape of the vocal tract manifests itself in the envelope of the short spectrum, and the work of the MFCC is to accurately reflect this envelope.



We use MFCC as an input feature. If you are interested in learning more about what MFCC is, then this tutorial is for you. Downloading data and converting it to the MFCC format can be easily done using the Python librosa package.

Default model architecture

The author developed the CNN model using the Keras package, creating 7 layers – six Con1D layers and one density layer (Dense).

The author commented out layers 4 and 5 in the latest release (September 18, 2018) and the final file size of this model does not fit the provided network, so I cannot achieve the same result in accuracy – 72%.

The model is simply trained with parameters batch_size=16and epochs=700, without any training schedule, etc.

# Compile Model

Model.compile(loss=’categorical_crossentropy’, optimizer=opt,metrics=[‘accuracy’])

# Fit Model

Mnnhistory=model.fit(x_traincnn, y_train, batch_size=16, epochs=700, validation_data=(x_testcnn, y_test))

Here categorical_crossentropyit is a loss function, and the measure of measure is accuracy.

My experiment

Exploratory data analysis

In RAVDESS dataset, each actor shows 8 emotions, pronouncing and singing 2 sentences, 2 times each. As a result, with each actor, 4 examples of each emotion are obtained with the exception of the above-mentioned neutral emotions, disgust and surprise. Each audio lasts about 4 seconds, in the first and last seconds most often silence.

Typical sentences :


After I chose a dataset from 1 actor and 1 actress, and then listened to all of their recordings, I realized that men and women express their emotions differently. For example:

male anger (Angry) just louder;

Masculine joy (Happy) and frustration (Sad) – a feature in laughing and crying tones during the “silence”;

women’s joy (Happy), anger (Angry) and upset (Sad) louder;

feminine disgust (Disgust) contains the sound of vomiting.

Repeat experiment

The author removed the classes neutral, disgust and surprised to make a 10-class recognition dataset RAVDESS. Trying to repeat the author’s experience, I got the following result:

However, I found out that there is a data leak when a dataset for validation is identical to a test dataset. Therefore, I repeated the separation of the data, isolating the datasets of two actors and two actresses so that they are not visible during the test:

actors 1 to 20 are used for Train / Valid sets in a ratio of 8: 2;

actors 21 to 24 are isolated from tests;

Train Set parameters: (1248, 216, 1);

Valid Set parameters: (312, 216, 1);

Test Set parameters: (320, 216, 1) – (isolated).

I re-trained the model and here’s the result:

Performance test

From the Train Valid Gross graph, it is clear that there is no convergence for the snd leave only male emotions. I isolated the two actors as part of the test set, and put the rest in the train / valid set, an 8: 2 ratio. This ensures that there is no imbalance in dataset. Then I trained the male and female data separately to conduct the test.elected 10 classes. Therefore, I decided to reduce the complexity of the model a

Male dataset

Train Set – 640 samples from actors 1-10;

Valid Set – 160 samples from actors 1-10;

Test Set – 160 samples from actors 11-12.

Baseline: Men

Female dataset

Train Set – 608 samples from actresses 1-10;

Valid Set – 152 samples from actresses 1-10;

Test Set – 160 samples from actresses 11-12.

Baseline: Women

As you can see, the error matrix is ​​different.

Men: anger (Angry) and joy (Happy) – the main predicted classes in the model, but they are not similar.

Women: Disorder (Sad) and Joy (Happy) are the main predicted classes in the model; anger (Angry) and joy (Happy) is easy to confuse.

Recalling the observations from Exploration Analysis , I suspect that women’s anger (Angry) and joy (Happy) are confusingly similar, because their way of expression is simply to raise their voices.

On top of that, I wonder if I simplify the model even more, leave only the classes Positive, Neutral and Negative. Or only Positive and Negative. In short, I grouped emotions into 2 and 3 classes, respectively.

2 classes:

Positive: joy (Happy), calm (Calm);

Negative: anger (Angry), fear (fearful), upset (sad).

3 classes: 

Positive: joy (Happy);

Neutral: Calm (Calm), Neutral (Neutral);

Negative: anger (Angry), fear (fearful), upset (sad).

Before the start of the experiment, I set up the model architecture using male data, making 5-class recognition


Please enter your comment!
Please enter your name here