| Literature DB >> 35808238 |
Ming-Che Lee1, Sheng-Cheng Yeh1, Jia-Wei Chang2, Zhen-Yi Chen1.
Abstract
In recent years, the use of Artificial Intelligence for emotion recognition has attracted much attention. The industrial applicability of emotion recognition is quite comprehensive and has good development potential. This research uses voice emotion recognition technology to apply it to Chinese speech emotion recognition. The main purpose of this research is to transform gradually popularized smart home voice assistants or AI system service robots from a touch-sensitive interface to a voice operation. This research proposed a specifically designed Deep Neural Network (DNN) model to develop a Chinese speech emotion recognition system. In this research, 29 acoustic characteristics in acoustic theory are used as the training attributes of the proposed model. This research also proposes a variety of audio adjustment methods to amplify datasets and enhance training accuracy, including waveform adjustment, pitch adjustment, and pre-emphasize. This study achieved an average emotion recognition accuracy of 88.9% in the CASIA Chinese sentiment corpus. The results show that the deep learning model and audio adjustment method proposed in this study can effectively identify the emotions of Chinese short sentences and can be applied to Chinese voice assistants or integrated with other dialogue applications.Entities:
Keywords: acoustic features; deep neural network; emotion recognition
Mesh:
Year: 2022 PMID: 35808238 PMCID: PMC9269147 DOI: 10.3390/s22134744
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.847
Figure 1Spectral Centroid Waveform.
Figure 2Chromagram obtained from a voice recording.
Figure 3Training flowchart of the proposed emotion recognition model.
Figure 4The proposed Deep Neural Network Model.
Figure 5(a) Original waveform (b) waveform after 60% rotation.
Figure 6Waveform of frequency adjustment and original data.
Figure 7Pitch adjustment and original data waveform.
Figure 8Original data and spectrogram after pre-emphasize.
Figure 9Results of the original method.
Figure 10Results of the pre-emphasize procedure.
Comparison of original method and pre-emphasized results.
| Model | Angry | Fearful | Happy | Calm | Sad | Average |
|---|---|---|---|---|---|---|
|
| 76.3% | 68.9% | 73.3% | 55.6% | 57.0% | 66.2% |
|
| 86.7% | 78.5% | 83.0% | 90.4% | 79.3% | 83.6% |
Comparison of original method and rotating results.
| Model | Angry | Fearful | Happy | Calm | Sad | Average |
|---|---|---|---|---|---|---|
|
| 76.3% | 68.9% | 73.3% | 55.6% | 57.0% | 66.2% |
|
| 83.7% | 74.8% | 91.9% | 86.7% | 80.7% | 83.6% |
|
| 83.7% | 77.8% | 88.9% | 87.4% | 83.0% | 84.1% |
|
| 83.0% | 68.9% | 88.1% | 87.4% | 84.4% | 82.4% |
|
| 87.4% | 80.0% | 91.1% | 88.9% | 76.3% | 84.7% |
|
| 80.7% | 73.3% | 89.6% | 85.9% | 78.5% | 81.6% |
|
| 83.7% | 75.6% | 91.1% | 87.4% | 80.0% | 83.6% |
|
| 82.2% | 69.6% | 88.1% | 85.2% | 86.7% | 82.4% |
|
| 83.0% | 73.3% | 89.6% | 85.2% | 83.7% | 83.0% |
|
| 82.2% | 72.6% | 91.1% | 87.4% | 79.3% | 82.5% |
Comparison of original method and sound frequency adjustment.
| Model | Angry | Fearful | Happy | Calm | Sad | Average |
|---|---|---|---|---|---|---|
|
| 76.3% | 68.9% | 73.3% | 55.6% | 57.0% | 66.2% |
|
| 84.4% | 77.8% | 89.6% | 93.3% | 78.5% | 84.7% |
|
| 89.6% | 77.0% | 91.9% | 93.3% | 79.3% | 86.2% |
|
| 83.7% | 76.3% | 94.1% | 91.9% | 82.2% | 85.6% |
|
| 88.1% | 77.0% | 92.6% | 90.4% | 82.2% | 86.1% |
|
| 85.9% | 78.5% | 93.3% | 91.1% | 83.7% | 86.5% |
|
| 85.2% | 72.6% | 94.8% | 88.9% | 81.5% | 84.6% |
|
| 85.2% | 71.9% | 85.9% | 90.4% | 82.2% | 83.1% |
|
| 85.9% | 77.0% | 89.6% | 92.6% | 80.7% | 85.2% |
|
| 86.7% | 80.7% | 88.9% | 87.4% | 77.8% | 84.3% |
|
| 80.7% | 74.8% | 90.4% | 86.7% | 79.3% | 82.4% |
|
| 82.2% | 76.3% | 88.9% | 85.2% | 84.4% | 83.4% |
|
| 77.2% | 71.1% | 90.0% | 83.3% | 83.9% | 81.1% |
Comparison of original method and sound pitch adjustment.
| Model | Angry | Fearful | Happy | Calm | Sad | Average |
|---|---|---|---|---|---|---|
|
| 76.3% | 68.9% | 73.3% | 55.6% | 57.0% | 66.2% |
|
| 79.7% | 73.7% | 89.4% | 84.3% | 84.2% | 82.3% |
|
| 77.0% | 76.3% | 91.1% | 88.1% | 72.6% | 81.0% |
|
| 77.0% | 77.8% | 86.7% | 86.7% | 75.6% | 80.7% |
|
| 82.2% | 75.6% | 85.2% | 80.7% | 79.3% | 80.6% |
|
| 83.7% | 74.1% | 86.7% | 86.7% | 78.5% | 81.9% |
|
| 83.7% | 75.6% | 90.4% | 86.7% | 80.7% | 83.4% |
|
| 82.2% | 68.9% | 85.9% | 85.9% | 84.4% | 81.5% |
|
| 80.0% | 77.0% | 82.2% | 85.9% | 79.3% | 80.9% |
|
| 77.0% | 72.6% | 85.2% | 89.6% | 78.5% | 80.6% |
|
| 75.6% | 74.8% | 89.6% | 85.9% | 78.5% | 80.9% |
|
| 85.2% | 80.0% | 87.4% | 87.4% | 74.8% | 83.0% |
|
| 81.5% | 76.3% | 88.1% | 88.9% | 79.3% | 82.8% |
Mixed adjustment data test results.
| Model | Angry | Fearful | Happy | Calm | Sad | Average |
|---|---|---|---|---|---|---|
|
| 76.3% | 68.9% | 73.3% | 55.6% | 57.0% | 66.2% |
|
| 85.2% | 79.3% | 96.3% | 91.1% | 79.3% | 86.2% |
|
| 83.7% | 77.8% | 93.3% | 93.3% | 83.0% | 86.2% |
|
| 90.3% | 79.4% | 95.3% | 92.2% | 77.8% | 87.0% |
|
| 93.3% | 82.2% | 97.8% | 93.3% | 77.8% | 88.9% |
The confusion matrix of the best sound adjustment model.
| Predicted | ||||||
|---|---|---|---|---|---|---|
| Happy | Sad | Angry | Calm | Fearful | ||
|
|
| 93.3% | 0.0% | 4.4% | 2.2% | 0.0% |
|
| 0.0% | 82.2% | 0.0% | 0.0% | 17.8% | |
|
| 2.2% | 0.0% | 97.8% | 0.0% | 0.0% | |
|
| 0.0% | 0.0% | 0.0% | 93.3% | 6.7% | |
|
| 0.0% | 20.0% | 0.0% | 2.2% | 77.8% | |
Comparison of accuracy between KNN, GoogLeNet, and the original method of this research.
| Method | Training Time | Accuracy | Accuracy | Accuracy |
|---|---|---|---|---|
|
| 1.5 (sec) | 81.1% | - | 71.2% |
|
| 13.8 (min) | - | 65.1% | 51.2% |
|
| 25.4 (sec) | 93.3% | 72.8% | 66.2% |
Comparison of accuracy between KNN, GoogLeNet, and the proposed approach with 40% pre-emphasis.
| Method | Training Time | Accuracy | Accuracy | Accuracy |
|---|---|---|---|---|
|
| 5.4 (sec) | 82.5% | - | 76.6% |
|
| 43.5 (min) | - | 75.6% | 66.5% |
|
| 32.7 (sec) | 95.2% | 88.1% | 86.2% |
Comparison of accuracy between KNN, GoogLeNet, and the proposed approach with rotation 40% and FM-10%.
| Method | Training Time | Accuracy | Accuracy | Accuracy |
|---|---|---|---|---|
|
| 5.3 (sec) | 82.2% | - | 75.7% |
|
| 41.2 (min) | - | 72.4% | 66.7% |
|
| 64.9 (sec) | 97.0% | 92.7% | 88.9% |
Comparison of accuracy between KNN, GoogLeNet, and the proposed approach with rotation 40%, FM-10%, and pre-emphasis.
| Method | Training Time | Accuracy | Accuracy | Accuracy |
|---|---|---|---|---|
|
| 9.7 (sec) | 84.1% | - | 77.9% |
|
| 56.8 (min) | - | 81.0% | 68.7% |
|
| 50.9 (sec) | 94.4% | 89.1% | 86.2% |