| Literature DB >> 34305565 |
Dong Liu1, Zhiyong Wang1, Lifeng Wang1, Longxi Chen1.
Abstract
The redundant information, noise data generated in the process of single-modal feature extraction, and traditional learning algorithms are difficult to obtain ideal recognition performance. A multi-modal fusion emotion recognition method for speech expressions based on deep learning is proposed. Firstly, the corresponding feature extraction methods are set up for different single modalities. Among them, the voice uses the convolutional neural network-long and short term memory (CNN-LSTM) network, and the facial expression in the video uses the Inception-Res Net-v2 network to extract the feature data. Then, long and short term memory (LSTM) is used to capture the correlation between different modalities and within the modalities. After the feature selection process of the chi-square test, the single modalities are spliced to obtain a unified fusion feature. Finally, the fusion data features output by LSTM are used as the input of the classifier LIBSVM to realize the final emotion recognition. The experimental results show that the recognition accuracy of the proposed method on the MOSI and MELD datasets are 87.56 and 90.06%, respectively, which are better than other comparison methods. It has laid a certain theoretical foundation for the application of multimodal fusion in emotion recognition.Entities:
Keywords: LibSVM classifier; deep learning; emotion recognition; expression; long short-term memory; multimodal fusion; voice
Year: 2021 PMID: 34305565 PMCID: PMC8300695 DOI: 10.3389/fnbot.2021.697634
Source DB: PubMed Journal: Front Neurorobot ISSN: 1662-5218 Impact factor: 2.650
Figure 1Multimodal emotion recognition model based on deep learning.
Figure 2Model flow graph of CNN-LSTM speech feature extraction.
Figure 3Network mode flow diagram of Inception-Res Net-v2.
Figure 4RNN is used to capture modal internal dependencies.
Figure 5LIBSVM architecture.
MOSI statistics parameters.
| Number of video segments | 3,702 |
| Number of emotion segments | 2,199 |
| Total number of videos | 93 |
| Average length of emotion segment | 4.2 s |
| Average number of words per emotion segment | 12 |
| Average number of emotion segments in video | 23.2 |
| Total number of different speakers | 89 |
Figure 6Selection of network weight.
Recognition rate of output features in each layer of Inception-Res Net-v2 network.
| Convolution layer features | 66.09 |
| AvgPool layer features | 63.84 |
| Dropout layer features | 79.18 |
| Fully connected layer features | 85.75 |
Figure 7Identify the change of model loss function.
Figure 8Multimodal emotion recognition with and without chi-square test.
Accuracy of emotion recognition based on different signal features.
| Happy | 81.25 | 83.17 | 90.89 |
| Angry | 87.83 | 80.55 | 91.26 |
| Disgust | 58.36 | 62.49 | 76.72 |
| Sad | 84.29 | 82.45 | 87.13 |
| Fear | 82.13 | 85.86 | 89.28 |
| All emotions | 83.72 | 80.19 | 88.64 |
Figure 9Comparison of recognition accuracy of different methods. (A) MOSI dataset and (B) MELD dataset.