Before we begin to understand the project, it will be nice to remember what bottlenecks the SER has.
Emotions are subjective, even people interpret them differently. It is difficult to define the very concept of “emotions”;
Commenting on audio is difficult. Should we somehow tag every single word, sentence, or the entire communication? A set of exactly which emotions to use when recognizing?
Collecting data is also not easy. A lot of audio data can be collected from movies and news. However, both sources are “biased”, because the news must be neutral, and the emotions of the actors are played. It’s hard to find an “objective” source of audio data.
Data layout requires a lot of human and time resources. Unlike drawing frames on images, it requires specially trained personnel to listen to entire audio recordings, analyze them and provide comments. And then these comments need to be appreciated by many other people, because the evaluations are subjective.
Using a convolutional neural network to recognize emotions in audio recordings. And yes, the repository owner did not cite any sources.
There are two datasets that were used in the RAVDESS and SAVEE repositories, I just adapted RAVDESS in my model. In the context of RAVDESS there are two types of data: speech (speech) and song (song).
Dataset RAVDESS (The Ryerson Audio-Visual Database of Emotional Speech and Song) :
12 actors and 12 actresses recorded their speech and songs in their performance;
actor # 18 has no recorded songs;
Disgust (disgust), Neutral (neutral) and Surprises (surprise) emotions are missing in the song data.
Breakdown by emotions
When we work with speech recognition tasks, chalk-cepstral coefficients (MFCCs) are an advanced technology, despite the fact that it appeared in the 80s.
Quote from MFCC Tutorial :
This form determines what the output sound is. If we can pinpoint the shape, it will give us an accurate representation of the phoneme sounded . The shape of the vocal tract manifests itself in the envelope of the short spectrum, and the work of the MFCC is to accurately reflect this envelope.
We use MFCC as an input feature. If you are interested in learning more about what MFCC is, then this tutorial is for you. Downloading data and converting it to the MFCC format can be easily done using the Python librosa package.
Default model architecture
The author developed the CNN model using the Keras package, creating 7 layers – six Con1D layers and one density layer (Dense).
The author commented out layers 4 and 5 in the latest release (September 18, 2018) and the final file size of this model does not fit the provided network, so I cannot achieve the same result in accuracy – 72%.
The model is simply trained with parameters batch_size=16and epochs=700, without any training schedule, etc.
# Compile Model
# Fit Model
Mnnhistory=model.fit(x_traincnn, y_train, batch_size=16, epochs=700, validation_data=(x_testcnn, y_test))
Here categorical_crossentropyit is a loss function, and the measure of measure is accuracy.
Exploratory data analysis
In RAVDESS dataset, each actor shows 8 emotions, pronouncing and singing 2 sentences, 2 times each. As a result, with each actor, 4 examples of each emotion are obtained with the exception of the above-mentioned neutral emotions, disgust and surprise. Each audio lasts about 4 seconds, in the first and last seconds most often silence.
Typical sentences :
After I chose a dataset from 1 actor and 1 actress, and then listened to all of their recordings, I realized that men and women express their emotions differently. For example:
male anger (Angry) just louder;
Masculine joy (Happy) and frustration (Sad) – a feature in laughing and crying tones during the “silence”;
women’s joy (Happy), anger (Angry) and upset (Sad) louder;
feminine disgust (Disgust) contains the sound of vomiting.
The author removed the classes neutral, disgust and surprised to make a 10-class recognition dataset RAVDESS. Trying to repeat the author’s experience, I got the following result:
However, I found out that there is a data leak when a dataset for validation is identical to a test dataset. Therefore, I repeated the separation of the data, isolating the datasets of two actors and two actresses so that they are not visible during the test:
actors 1 to 20 are used for Train / Valid sets in a ratio of 8: 2;
actors 21 to 24 are isolated from tests;
Train Set parameters: (1248, 216, 1);
Valid Set parameters: (312, 216, 1);
Test Set parameters: (320, 216, 1) – (isolated).
I re-trained the model and here’s the result:
From the Train Valid Gross graph, it is clear that there is no convergence for the snd leave only male emotions. I isolated the two actors as part of the test set, and put the rest in the train / valid set, an 8: 2 ratio. This ensures that there is no imbalance in dataset. Then I trained the male and female data separately to conduct the test.elected 10 classes. Therefore, I decided to reduce the complexity of the model a
Train Set – 640 samples from actors 1-10;
Valid Set – 160 samples from actors 1-10;
Test Set – 160 samples from actors 11-12.
Train Set – 608 samples from actresses 1-10;
Valid Set – 152 samples from actresses 1-10;
Test Set – 160 samples from actresses 11-12.
As you can see, the error matrix is different.
Men: anger (Angry) and joy (Happy) – the main predicted classes in the model, but they are not similar.
Women: Disorder (Sad) and Joy (Happy) are the main predicted classes in the model; anger (Angry) and joy (Happy) is easy to confuse.
Recalling the observations from Exploration Analysis , I suspect that women’s anger (Angry) and joy (Happy) are confusingly similar, because their way of expression is simply to raise their voices.
On top of that, I wonder if I simplify the model even more, leave only the classes Positive, Neutral and Negative. Or only Positive and Negative. In short, I grouped emotions into 2 and 3 classes, respectively.
Positive: joy (Happy), calm (Calm);
Negative: anger (Angry), fear (fearful), upset (sad).
Positive: joy (Happy);
Neutral: Calm (Calm), Neutral (Neutral);
Negative: anger (Angry), fear (fearful), upset (sad).
Before the start of the experiment, I set up the model architecture using male data, making 5-class recognition