| Literature DB >> 25594590 |
Abstract
The classification of emotional speech is mostly considered in speech-related research on human-computer interaction (HCI). In this paper, the purpose is to present a novel feature extraction based on multi-resolutions texture image information (MRTII). The MRTII feature set is derived from multi-resolution texture analysis for characterization and classification of different emotions in a speech signal. The motivation is that we have to consider emotions have different intensity values in different frequency bands. In terms of human visual perceptual, the texture property on multi-resolution of emotional speech spectrogram should be a good feature set for emotion classification in speech. Furthermore, the multi-resolution analysis on texture can give a clearer discrimination between each emotion than uniform-resolution analysis on texture. In order to provide high accuracy of emotional discrimination especially in real-life, an acoustic activity detection (AAD) algorithm must be applied into the MRTII-based feature extraction. Considering the presence of many blended emotions in real life, in this paper make use of two corpora of naturally-occurring dialogs recorded in real-life call centers. Compared with the traditional Mel-scale Frequency Cepstral Coefficients (MFCC) and the state-of-the-art features, the MRTII features also can improve the correct classification rates of proposed systems among different language databases. Experimental results show that the proposed MRTII-based feature information inspired by human visual perception of the spectrogram image can provide significant classification for real-life emotional recognition in speech.Entities:
Year: 2015 PMID: 25594590 PMCID: PMC4327087 DOI: 10.3390/s150101458
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1.The flowchart for deriving the proposed MRTII-based feature extraction approach.
Figure 2.Spectrogram image decomposition: (a) one-level; (b) two-level.
Figure 3.The first prefer channel in 4-level tree-structured wavelet transform domain for three types of emotion: Anger, Fear and Neutral.
The first 4 dominant channels for five types of emotion.
|
| ||||
|---|---|---|---|---|
| LL1, LL2, HL3, LH4 | LL1, LL2, HH3, LL4 | LL1, HL2, HL3, HH4 | LL1, HL2, HH3, HH4 | |
| LL1, LL2, LL3, LL4 | LL1, LL2, LL3, LH4 | LL1, LL2, LH3, LL4 | LL1, LL2, HH3, HL4 | |
| LL1, LL2, HL3, LH4 | LL1, LL2, HH3, LH4 | LL1, HL2, HH3, HL4 | LL1, HL2, HL3, HH4 | |
| LL1, HL2, LH3, LL4 | LL1, HL2, LH3, LH4 | LH1, HL2, LL3, LL4 | LH1, HL2, LL3, LH4 | |
| LL1, HL2, LH3, LL4 | LL1, HL2, LH3, LL4 | LL1, HL2, HH3, HL4 | LH1, HL2, HL3, HH4 | |
The proportion of each emotion label in the dialog Corpus 1 and Corpus 2.
| 5.7% | 1.5% | 2.4% | 3.5% | 3.05% | 83.85% | ||
| 1.2% | 0.4% | 0.3% | 5.2% | 2.74% | 94.16% | ||
|
| |||||||
| 9.23% | 5.8% | 6.8% | 0.3% | 1.64% | 76.23% | ||
| 1.8% | 1.0% | 1.2% | 2.6% | 1.54% | 91.86% | ||
Prosodic features set.
| F0 (8 features) | mean, std, max value, relative position of max, min value, relative position of min, range, number of local max point |
| dF0 (8 features) | mean of positive, mean of negative, std, max value, relative positive of max, min value, relative position of min, ratio of positive |
| logE (3 features) | std, max value, relative position of max |
| dlogE (8 features) | mean of positive, mean of negative, std, max value, relative position of max, min value, relative position of min, ratio of positive |
| Duration (3 features) | speaking rate, std of voiced duration, mean pause time |
Description of the collected speech database.
|
| |||||
|---|---|---|---|---|---|
| 71 | 69 | 62 | 127 | ||
| 207 | 215 | 210 | 215 | ||
| 102 | 102 | 102 | 102 | ||
|
| |||||
The LLD Features used in INTERSPEECH 2009 emotion challenge [37].
| LLD(16 × 2) | Functionals (12) |
| (delta) ZCR | Mean |
| (delta) RMS | Energy standard deviation |
| (delta) F0 | Kurtosis, skewness |
| (delta) HNR | Extremes: value, rel. position, range |
| (delta) MFCC 1–12 | Linear regression: offset, slope, MSE |
The average percentage of classification accuracy (APCA) of performance comparisons with/without AAD.
| EMO-DB | 66.54% | 69.23% |
| eNTERFACE | 60.21% | 64.58% |
| KHUSC-EmoDB | 59.58% | 62.36% |
The APCA for the Mixed database using MRTII with/without cubic curve.
| EMO-DB | 78.42% | 84.53% |
| eNTERFACE | 77.58% | 81.34% |
| KHUSC-EmoDB | 76.24% | 80.15% |
| Mixed | 76.82% | 84.58% |
The APCA for the Mixed database using MFCC and MRTII features combined with SVM, KNN and LDA classifiers.
| SVM [ | 69.23% | 86.23% |
| KNN | 64.58% | 84.76% |
| LDA [ | 61.14% | 83.85% |
Confusion table for Artificial Databases using MFCC, Prosodic, LLD and MRTII.
|
| ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 63.59 | 60.25 | 57.28 | 66.48 | 85.37 | 82.56 | 89.39 | 86.49 | 90.12 | 88.72 | 90.65 | 89.37 | 90.54 | 89.28 | 88.43 | 91.32 | |
| 57.94 | 59.31 | 64.52 | 60.28 | 82.49 | 79.38 | 87.26 | 85.21 | 87.29 | 85.83 | 88.39 | 86.58 | 88.28 | 84.67 | 89.48 | 87.92 | |
| 50.48 | 56.29 | 55.38 | 61.28 | 80.27 | 75.38 | 82.39 | 83.67 | 84.22 | 82.95 | 85.93 | 85.27 | 84.28 | 84.48 | 83.02 | 86.58 | |
|
| ||||||||||||||||
| 58.01% | 82.08% | 85.80% | 86.64% | |||||||||||||
Ha: Happiness; Fe: Fear; Sa: Sadness; An: Angry.
Figure 4.Comparison of the recognition recognitions using different feature extractions.
Confusion table for real-life corpora using MFCC, Prosodic, LLD and MRTII.
| Emotional Type | Ha | Fe | Sa | An | Ha | Fe | Sa | An | Ha | Fe | Sa | An | Ha | Fe | Sa | An |
| 121 agent-client dialogs in MCSC | 42.8 | 58.3 | 57.3 | 58.7 | 64.8 | 65.3 | 66.8 | 65.7 | 69.5 | 70.1 | 72.6 | 70.3 | 75.4 | 74.4 | 74.8 | 76.5 |
| 68 agent-client dialogs in HECC | 43.6 | 54.7 | 53.6 | 54.2 | 60.3 | 62.6 | 65.5 | 66.8 | 67.9 | 68.4 | 69.2 | 70.9 | 71.5 | 72.6 | 71.7 | 72.5 |
|
| ||||||||||||||||
| Average (%) | 52.90% | 64.73% | 69.86% | 73.68% | ||||||||||||
Ha: Happiness; Fe: Fear; Sa: Sadness; An: Angry.