| Literature DB >> 35634125 |
Awais Asghar1,2, Sarmad Sohaib3, Saman Iftikhar4,5, Muhammad Shafi6, Kiran Fatima7.
Abstract
Emotion recognition from acoustic signals plays a vital role in the field of audio and speech processing. Speech interfaces offer humans an informal and comfortable means to communicate with machines. Emotion recognition from speech signals has a variety of applications in the area of human computer interaction (HCI) and human behavior analysis. In this work, we develop the first emotional speech database of the Urdu language. We also develop the system to classify five different emotions: sadness, happiness, neutral, disgust, and anger using different machine learning algorithms. The Mel Frequency Cepstrum Coefficient (MFCC), Linear Prediction Coefficient (LPC), energy, spectral flux, spectral centroid, spectral roll-off, and zero-crossing were used as speech descriptors. The classification tests were performed on the emotional speech corpus collected from 20 different subjects. To evaluate the quality of speech emotions, subjective listing tests were conducted. The recognition of correctly classified emotions in the complete Urdu emotional speech corpus was 66.5% with K-nearest neighbors. It was found that the disgust emotion has a lower recognition rate as compared to the other emotions. Removing the disgust emotion significantly improves the performance of the classifier to 76.5%.Entities:
Keywords: Emotion recognition; Human behavior analysis; Human computer interaction; Linear prediction coefficient (LPC); Machine learning algorithms; Mel frequency capstrum coefficient (MFCC); Speech descriptors; Urdu
Year: 2022 PMID: 35634125 PMCID: PMC9138108 DOI: 10.7717/peerj-cs.954
Source DB: PubMed Journal: PeerJ Comput Sci ISSN: 2376-5992
Figure 1Emotion recognition pipeline.
Summary of literature on emotion recognition from different languages.
| Papers with year | Dataset used | Emotions recognized | Technique used | Achieved accuracy |
|---|---|---|---|---|
|
| Berlin EmoDB | Anger, disgust, fear, joy, neutral, surprise and sadness | SVM and multivariate linear regression (MLR) | 83% |
|
| EmoDB dataset | Neutral anger and sad | Deep belief network (DBN) and Stacked encoder | 65% |
|
| Chinese emotional speech | Anger, scared, happiness, sadness, neutral and surprise | Wavelet-kernel sparse SVM | 80.95% |
|
| IEMOCAP dataset | Anger, happiness, sadness, and neutral | RNN with 3 layers | 71.04% |
|
| EmotAss dataset | Anger, happiness, neutral and sadness | CNN and RNN with | 45.12% |
|
| IEMOCAP dataset | Anger, happiness, neutral, sadness, surprise, fear and disgust | LSTM | 70.6% |
|
| IEMOCAP dataset | Neutral, sadness, frustration and anger | CNN | 47% |
|
| Urdu language emotional speech dataset | Anger, happiness, neutral and sadness | SVM, logistic regression and random Forest | 83.4% |
|
| Spontaneous emotional | Happiness, sadness, anger and neutral | CNN and ResNEt | 78.7% |
|
| INTERSPEECH 2009, ABC and EmoDB | Happiness, sadness, neutral, fear, surprise, disgust and anger | Emotion discriminative and domain invariant feature learning method (EDFLM) | 65.62% |
|
| IEMOCAP dataset | Neutral, happiness, sadness, anger and silence | RNN and CNN | 64.78% |
|
| Acted and spontaneous | Sadness, happiness, anger, neutral, joy, fear, and surprise | SVM and k-NN | 81% |
|
| IEMOCAP dataset | Neutral, anger, sadness, and happiness | Recurrent Neural Network RNN and SVM | 63.5% |
|
| Chinese emotional speech dataset | Sadness, joy, anger, neutral fear, and surprise | SVM and Deep learning | 84.54% |
|
| Chinese emotional speech dataset | Sadness, joy, anger, neutral fear, and surprise | Combination of SVM and Deep learning | 95.8% |
|
| Malayalam language emotional speech dataset | Neutral, anger, happiness and sad | ANN and SVM | 78.2% |
|
| English emotion speech | Sadness, happiness, anger and neutral | Artificial Neural Network | 85% |
|
| District name of Pakistan dataset | SVM and GMM | 71% | |
|
| SAVEE and Malayalam | Anger, happiness, neutral and sadness | Support vector machine | 75% |
|
| Urdu language emotional speech dataset | Anger, sadness, happiness and neutral | Decision tree and J48 | 40% |
|
| ENTERFACE and SAVEE | Boredom, disgust, sadness, joy, anger and neutral | Deep neural network DNN | 60.53% |
|
| Provisional language of Pakistan emotional speech dataset | Comfort, happiness, sadness and neutral | Multilayer perceptron (MLP), Naïve Bayes and SMO | 75% |
|
| Polish emotional speech | Sadness, happiness, anger and neutral | k-NN | 70% |
Figure 2Flow chart of Urdu emotional speech dataset creation and validation.
Number of emotions per sample.
| Emotion | Number of samples |
|---|---|
| Angry | 500 |
| Disgust | 400 |
| Happy | 500 |
| Neutral | 450 |
| Sad | 450 |
Chosen Urdu language utterances with English translation.
| Sentences in Urdu | English translation |
|---|---|
| Pakistan kesa hai? | How is Pakistan? |
| qareeb tareen hospital kahan hai? | Where is the nearest hospital? |
| kapre fridg pr parey hein | The cloths are lying on the fridge. |
| tum kahan gaye they? | Where did you go? |
| kahan ho ajj kal? | Where are you nowadays? |
Recognition rate of each emotion during validation process.
| Emotion | Recognition rate |
|---|---|
| Angry | 96% |
| Sadness | 94% |
| Neutral | 92% |
| Happy | 80% |
| Disgust | 76% |
Figure 3Pre-processing flow of speech signal.
Feature dimensions.
| Features name | Features dimensions |
|---|---|
| MFCC | 13 |
| Mean of MFCC | 13 |
| Standard deviation of MFCC | 13 |
| LPC | 10 |
| Mean of LPC | 10 |
| Spectral flux | 01 |
| Spectral centroid | 01 |
| Spectral rolloff | 01 |
| Zero crossing | 01 |
| Energy | 01 |
| Total feature vector | 64 |
Figure 4Proposed emotion recognition system for Urdu speech signals.
Comparison of performance of classification algorithms on emotional speech corpus with disgust emotions.
| ML techniques | For male only | For female only | Complete dataset | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Accuracy | Precision | Recall | Accuracy | Precision | Recall | Accuracy | Precision | Recall | |
| One- | 69.5% | 71% | 69% | 68.4% | 71% | 68% | 60.6% | 62% | 61% |
| One- | 70% | 71% | 70% | 65.6% | 67% | 66% | 62.2% | 64% | 62% |
| k-NN | 73% | 73% | 72% | 66.4% | 69% | 66% | 66.2% | 67% | 66% |
| Random Forest | 66.5% | 67% | 66% | 58.8% | 62% | 59% | 60.8% | 64% | 61% |
Comparison of performance of classification algorithms on emotional speech corpus without disgust emotions.
| ML techniques | For male only | For female only | Complete dataset | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Accuracy | Precision | Recall | Accuracy | Precision | Recall | Accuracy | Precision | Recall | |
| One- | 75% | 75% | 74% | 78.5% | 81% | 79% | 70.2% | 72% | 70% |
| One- | 79.5% | 80% | 79% | 77.5% | 78% | 79% | 70.7% | 72% | 71% |
| k-NN | 82.5% | 84% | 83% | 76% | 76% | 76% | 76.5% | 77% | 77% |
| Random Forest | 74% | 74% | 73% | 71% | 72% | 71% | 71.5% | 73% | 71% |
Comparison with related work.
| Papers | Languages | Training technique | Features extraction techniques | Emotions | Classifier used | Accuracy |
|---|---|---|---|---|---|---|
|
| English and German | Speaker dependent | RNN | Anger, happiness, neutral and sadness | RNN with three | 71.04% |
|
| Polish | Speaker dependent independent | MFCC, BFCC, RASTA, energy, formants, LPC and HFCC | Sadness, happiness, anger, neutral, joy, fear and surprise | SVM and k-NN | 81% |
|
| Malayalam | Speaker dependent | MFCC, STE and pitch | Neutral, anger, happiness and | ANN and SVM | 78% |
|
| Urdu | Speaker dependent | Duration, intensity, pitch and formants | Anger, sadness, happiness and | Neive Bayes | 76% |
|
| Urdu | Speaker dependent | Intensity, pitch and formants | Anger, sadness, happiness and | SMO, MLP, J48 and Neive Bayes | 75% |
|
| Urdu | Speaker independent | LLDs low level descriptor | Happiness, sadness, anger and neutral | SVM, logistic regression and RF | 83% |
|
| English | Speaker dependent | MFCC, pitch and energy | Anger, neutral sadness and | SVM | 70% |
| Our work | Urdu (with disgust emotion) | Speaker dependent | MFCC, LPC, energy, pitch, zero crossing, spectral flux spectral centroid, spectral roll off | Anger, disgust, happiness, | k-Nearest Neighbours | 73% |
| Our work | Urdu (without disgust emotion) | Speaker dependent | MFCC, LPC, energy, pitch, zero crossing, spectral flux spectral centroid, spectral roll off | Anger, happiness, sadness and neutral | k-Nearest Neighbors | 82 .5% |
Figure 5ROC curve of K-NN with disgust emotion.
Figure 6ROC curve of K-NN without disgust emotion.
Figure 7Confusion matrix of k-NN for complete dataset including disgust.
Figure 8Confusion matrix of k-NN for complete dataset without disgust.