| Literature DB >> 28737705 |
Lianzhang Zhu1, Leiming Chen2, Dehai Zhao3, Jiehan Zhou4, Weishan Zhang5.
Abstract
Accurate emotion recognition from speech is important for applications like smart health care, smart entertainment, and other smart services. High accuracy emotion recognition from Chinese speech is challenging due to the complexities of the Chinese language. In this paper, we explore how to improve the accuracy of speech emotion recognition, including speech signal feature extraction and emotion classification methods. Five types of features are extracted from a speech sample: mel frequency cepstrum coefficient (MFCC), pitch, formant, short-term zero-crossing rate and short-term energy. By comparing statistical features with deep features extracted by a Deep Belief Network (DBN), we attempt to find the best features to identify the emotion status for speech. We propose a novel classification method that combines DBN and SVM (support vector machine) instead of using only one of them. In addition, a conjugate gradient method is applied to train DBN in order to speed up the training process. Gender-dependent experiments are conducted using an emotional speech database created by the Chinese Academy of Sciences. The results show that DBN features can reflect emotion status better than artificial features, and our new classification approach achieves an accuracy of 95.8%, which is higher than using either DBN or SVM separately. Results also show that DBN can work very well for small training databases if it is properly designed.Entities:
Keywords: Deep Belief Networks; speech emotion recognition; speech features; support vector machine
Mesh:
Year: 2017 PMID: 28737705 PMCID: PMC5539696 DOI: 10.3390/s17071694
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1Process of extracting speech features.
Figure 2Process of extracting Mel-Frequency Cepstral Coefficient (MFCC).
Figure 3Structure of Deep Belief Network (DBN).
Figure 4Structure of combining support vector machine (SVM) and DBN. Speech features are converted into deep features by a pre-trained DBN, which are feature vectors output by the last hidden layer of the DBN. The feature vectors act as the input of SVM and are used to train the SVM. The output of the SVM classifier is the emotion status corresponding to the input speech sample.
Attribute of Chinese speech database.
| Corpus | Access | Language | Size | Subject | Emotions |
|---|---|---|---|---|---|
| CASIA | Commercially available | Chinese | 400 utterances 4 actors × 6 emotions | 4 professional actors | ang, fea, hap, neu, sad, sur |
Figure 5Dataset structure.
Accuracy of male and female group using SVM.
| Emotion | Male (%) | Female (%) |
|---|---|---|
| Angry | 90 | 82 |
| Fear | 88 | 86 |
| Happy | 84 | 82 |
| Neutral | 80 | 88 |
| Sad | 86 | 88 |
| Surprise | 82 | 82 |
| Average | 85 | 84.6 |
Figure 6DBN training phase.
Accuracy of male and female group using DBN.
| Emotion | Male (%) | Female (%) |
|---|---|---|
| Angry | 92 | 94 |
| Fear | 94 | 96 |
| Happy | 90 | 96 |
| Neutral | 98 | 96 |
| Sad | 98 | 90 |
| Surprise | 96 | 96 |
| Average | 94.6 | 94.6 |
Accuracy of male and female group using combining algorithm.
| Emotion | Male (%) | Female (%) |
|---|---|---|
| Angry | 96 | 98 |
| Fear | 98 | 96 |
| Happy | 94 | 96 |
| Neutral | 96 | 96 |
| Sad | 94 | 94 |
| Surprise | 98 | 94 |
| Average | 96 | 95.6 |
Summary of related work.
| Author | Database Size | Number of Emotions | Subjects | Gender-Dependent | Classifier | Features |
|---|---|---|---|---|---|---|
| Amiya Kumar | 4200 utterances | 7 | 5 native speaker | No | SVM | Statistic |
| Zhaocheng Huang | 12 h | 4 | professional actors | No | GMM | Intrinsic |
| Zixing Zhang | 18,216 utterances | 11 | 51 children | No | SVM | Intrinsic |
| Arti Rawat | 10 utterances | 5 | 5 people | No | NN | Intrinsic |
| Ya Li | 535 utterances | 7 | 10 professional actors | No | SVM | Statistic |
| Jinkyu Lee | 12 h | 4 | professional actors | Yes | RNN | Intrinsic |
| Kunching Wang | 1584 utterances | 4 | professional actors | No | SVM | Statistic |
| Weishan Zhang | 1200 utterances | 6 | professional actors | Yes | DBN | Statistic |