| Literature DB >> 35310264 |
Wei Bi1, Yongzhen Xie1, Zheng Dong2, Hongshen Li3.
Abstract
Emotion recognition (ER) is an important part of building an intelligent human-computer interaction system and plays an important role in human-computer interaction. Often, people express their feelings through a variety of symbols, such as words and facial expressions. A business ecosystem is an economic community based on interacting organizations and individuals. Over time, they develop their capabilities and roles together and tend to develop themselves in the direction of one or more central enterprises. This paper aims to study a multimodal ER method based on attention mechanism. It analyzes the current emotional state of consumers and the development direction of enterprises through multi-modal ER of human emotions and analysis of market trends, so as to provide the most appropriate response or plan. This paper firstly describes the related methods of multimodal ER and deep learning in detail, and briefly outlines the meaning of enterprise strategy in the business ecosystem. Then, two datasets, CMU-MOSI and CMU-MOSEI, are selected to design the scheme for multimodal ER based on self-attention mechanism. Through the comparative analysis of the accuracy of single-modal and multi-modal ER, the self-attention mechanism is applied in the experiment. The experimental results show that the average recognition accuracy of happy under multimodal ER reaches 91.5%.Entities:
Keywords: attention mechanism; business ecosystem; deep learning; enterprise strategic management; multimodal emotion recognition
Year: 2022 PMID: 35310264 PMCID: PMC8927019 DOI: 10.3389/fpsyg.2022.857891
Source DB: PubMed Journal: Front Psychol ISSN: 1664-1078
FIGURE 1Data layer fusion flow chart.
FIGURE 3Flow chart of decision-making level fusion.
FIGURE 4Long short-term memory network (LSTM) cell structure.
FIGURE 5Convolutional neural network (CNN) basic network structure diagram.
Statistical overview of the CMU-MOSI dataset.
| Total number of video clip samples | 2,199 |
| Different speakers | 89 |
| Total number of videos | 93 |
| The average number of video clips cut by the video | 23.2 |
| Average length of video clip samples | 4.2 s |
| The total number of words in the video clip sample | 26,295 |
| The total number of unique words in the video clip sample | 3,107 |
Statistical overview of the CMU-MOSEI dataset.
| Total number of video clips | 23,453 |
| Total number of videos | 3,228 |
| Number of different speakers | 1,000 |
| Number of different topics | 250 |
| The number of video clips in a video | 7.3 |
| Average duration of each video clip | 7.28 s |
| The total number of words in the video clip sample | 447,143 |
| The total number of unique words in the video clip sample | 23,026 |
FIGURE 6The accuracy of single-modal emotion analysis in MOSI and MOSEI, respectively.
FIGURE 7The accuracy of multimodal sentiment analysis in MOSI and MOSEI, respectively.
FIGURE 8Single-modal and bi-modal emotion recognition (ER) results in MOSI dataset.
MOSI test set Confusion Matrix.
| Emotion category | Negtive | Positive | Total |
| Negtive | 306 | 65 | 375 |
| Positive | 90 | 225 | 311 |
| Total | 396 | 290 | 686 |
FIGURE 9Emotion recognition (ER) results on the MOSEI dataset.