| Literature DB >> 26346654 |
Pavol Partila1, Miroslav Voznak1, Jaromir Tovarek1.
Abstract
The impact of the classification method and features selection for the speech emotion recognition accuracy is discussed in this paper. Selecting the correct parameters in combination with the classifier is an important part of reducing the complexity of system computing. This step is necessary especially for systems that will be deployed in real-time applications. The reason for the development and improvement of speech emotion recognition systems is wide usability in nowadays automatic voice controlled systems. Berlin database of emotional recordings was used in this experiment. Classification accuracy of artificial neural networks, k-nearest neighbours, and Gaussian mixture model is measured considering the selection of prosodic, spectral, and voice quality features. The purpose was to find an optimal combination of methods and group of features for stress detection in human speech. The research contribution lies in the design of the speech emotion recognition system due to its accuracy and efficiency.Entities:
Mesh:
Year: 2015 PMID: 26346654 PMCID: PMC4539500 DOI: 10.1155/2015/573068
Source DB: PubMed Journal: ScientificWorldJournal ISSN: 1537-744X
Figure 1Block diagram of speech emotion recognition system (SERS). The system consists of a database that is used for training and testing and other blocks that describe functions of the algorithm. Different views on the issue are represented by blocks. (a) Stress versus neutral state classification, (b) each kind of emotion, and (c) other approaches. Scenario option is represented by control block.
Figure 2Artificial neural network architecture with hidden layers and output classes.
Performance results of classifiers for LPC features.
| LPC | ANN |
| GMM | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Neutral state | 1718 | 920 | 65.1% | 2233 | 3241 | 40.8% | 5240 | 6023 | 46.5% |
|
| |||||||||
| Stress | 7542 | 23410 | 75.6% | 7027 | 21089 | 75.0% | 4020 | 18307 | 81.9% |
|
| |||||||||
| 18.6% | 96.2% | 74.8% | 24.1% | 86.7% | 69.4% | 56.6% | 75.2% | 73.4% | |
| Neutral state | Stress | Neutral state | Stress | Neutral state | Stress | ||||
Performance results of classifiers for LSP features.
| LSP | ANN |
| GMM | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Neutral state | 3466 | 1814 | 66.1% | 3409 | 3369 | 50.3% | 5778 | 5452 | 51.5% |
|
| |||||||||
| Stress | 5794 | 14820 | 71.7% | 5851 | 13265 | 69.4% | 3482 | 18818 | 84.4% |
|
| |||||||||
| 37.4% | 89.1% | 70.6% | 36.8% | 79.7% | 64.4% | 62.4% | 77.6% | 73.4% | |
| Neutral state | Stress | Neutral state | Stress | Neutral state | Stress | ||||
Performance results of classifiers for prosodic features.
| LSP | ANN |
| GMM | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Neutral state | 4470 | 1699 | 72.5% | 3953 | 2503 | 61.2% | 6483 | 4774 | 57.6% |
|
| |||||||||
| Stress | 4790 | 22629 | 82.5% | 5307 | 21825 | 80.4% | 2777 | 19556 | 87.6% |
|
| |||||||||
| 48.3% | 93.0% | 80.7% | 42.7% | 89.7 | 76.7% | 70.0% | 80.4% | 70.1% | |
| Neutral state | Stress | Neutral state | Stress | Neutral state | Stress | ||||
Performance results of classifiers for MFCC (dynamic and acceleration coefficients too) features.
| MFCC | ANN |
| GMM | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Neutral state | 7445 | 1901 | 79.7% | 4919 | 2209 | 69.0% | 6305 | 3507 | 63.3% |
|
| |||||||||
| Stress | 1684 | 22587 | 93.1% | 4210 | 22279 | 84.1% | 2824 | 20981 | 88.1% |
|
| |||||||||
| 81.6% | 92.2% | 89.3% | 53.9% | 91.0% | 80.9% | 69.1% | 85.7% | 81.2% | |
| Neutral state | Stress | Neutral state | Stress | Neutral state | Stress | ||||
Confusion matrix: description of cells.
| Classifier | |||
|
| |||
| Output classes | True positive | False positive | Positive predictive value |
| False negative | True negative | Negative predictive value | |
|
| |||
| Sensitivity | Specificity | Precision | |
| Target classes | |||
Figure 3Receiver operating characteristic of GMM, k-NN, and ANN classifier in neutral state versus stress recognizing for LPC, LSP, prosodic, and MFCC features.
Gaussian mixture model classification accuracy for different combinations of emotions [%].
| Train 2/test | Train 1 | ||||||
|---|---|---|---|---|---|---|---|
| Anger | Boredom | Disgust | Fear | Happiness | Sadness | Neutral state | |
| Anger | — | 91.7 | 85.7 | 83.4 | 70 | 96.4 | 90.8 |
| Boredom | 76.3 | — | 64.9 | 66.2 | 71.9 | 65.3 | 59 |
| Disgust | 59.1 | 64.3 | — | 62.5 | 56 | 78.5 | 64.7 |
| Fear | 52.4 | 72.6 | 59.7 | — | 47.2 | 81.8 | 70.4 |
| Happiness | 42.7 | 83.9 | 73.7 | 73.9 | — | 90.5 | 82.4 |
| Sadness | 87.1 | 67 | 75.5 | 73 | 85.5 | — | 75 |
| Neutral state | 78.6 | 53.3 | 69.1 | 63.9 | 74 | 65.7 | — |
K-nearest neighbours classification accuracy for different combinations of emotions [%].
| Train 2/test | Train 1 | ||||||
|---|---|---|---|---|---|---|---|
| Anger | Boredom | Disgust | Fear | Happiness | Sadness | Neutral state | |
| Anger | — | 91.2 | 90.3 | 89.8 | 81.1 | 94.7 | 92.4 |
| Boredom | 92 | — | 70.4 | 67.7 | 75.9 | 61.6 | 64.6 |
| Disgust | 49 | 55 | — | 59.7 | 51.8 | 71.2 | 56.6 |
| Fear | 46.9 | 60.3 | 56.5 | — | 47.8 | 75.1 | 68.2 |
| Happiness | 31.9 | 78 | 72.7 | 73.1 | — | 86.1 | 78.7 |
| Sadness | 89.2 | 70.5 | 84.8 | 79.7 | 89.6 | — | 81.1 |
| Neutral state | 79.8 | 40.5 | 70.6 | 66 | 77.4 | 65.7 | — |
Feed forward backpropagation neural network classification accuracy for different combinations of emotions [%].
| Train 2/test | Train 1 | ||||||
|---|---|---|---|---|---|---|---|
| Anger | Boredom | Disgust | Fear | Happiness | Sadness | Neutral state | |
| Anger | — | 95.8 | 93.7 | 92.4 | 87.9 | 98.1 | 96.7 |
| Boredom | 92 | — | 87.6 | 83.4 | 91.2 | 75.1 | 77.1 |
| Disgust | 79.8 | 79.1 | — | 68.2 | 77.7 | 86.4 | 77.6 |
| Fear | 69.6 | 70.6 | 73.7 | — | 68.2 | 82.2 | 76.2 |
| Happiness | 32.3 | 88.3 | 79.8 | 83.9 | — | 95 | 88.9 |
| Sadness | 97.9 | 81.5 | 92.6 | 95.3 | 97.5 | — | 85.2 |
| Neutral state | 93 | 49.9 | 86 | 85.1 | 88.2 | 52.4 | — |