| Literature DB >> 35111209 |
Jing He1, Haonan Yanga1, Changfan Zhang1, Hongrun Chen1, Yifu Xua1.
Abstract
Multimodal sentiment analysis (MSA) aims to infer emotions from linguistic, auditory, and visual sequences. Multimodal information representation method and fusion technology are keys to MSA. However, the problem of difficulty in fully obtaining heterogeneous data interactions in MSA usually exists. To solve these problems, a new framework, namely, dynamic invariant-specific representation fusion network (DISRFN), is put forward in this study. Firstly, in order to effectively utilize redundant information, the joint domain separation representations of all modes are obtained through the improved joint domain separation network. Then, the hierarchical graph fusion net (HGFN) is used for dynamically fusing each representation to obtain the interaction of multimodal data for guidance in the sentiment analysis. Moreover, comparative experiments are performed on popular MSA data sets MOSI and MOSEI, and the research on fusion strategy, loss function ablation, and similarity loss function analysis experiments is designed. The experimental results verify the effectiveness of the DISRFN framework and loss function.Entities:
Mesh:
Year: 2022 PMID: 35111209 PMCID: PMC8803469 DOI: 10.1155/2022/2105593
Source DB: PubMed Journal: Comput Intell Neurosci
Figure 1The framework of DISRFN. Note: Bi-LSTM: bidirectional short and long memory network; BERT: bidirectional encoder representation from transformers; MLP: multilayer perception; audio encoder (decoder): encoder (decoder) of auditory mode; linguistic encoder (decoder): encoder (decoder) of linguistic mode; visual encoder (decoder): encoder (decoder) of visual mode; share encoder: shared encoder of three modes; HGFN: hierarchical graph fusion net.
Figure 2The framework of HGFN.
Hyperparameter settings in this article.
| Hyperparameter | MOSI | MOSEI |
|---|---|---|
| CMD K | 5 | 5 |
| Batch_size | 16 | 16 |
|
| 0.3 | 0.4 |
|
| 1.0 | 0.8 |
|
| 0.4 | 0.4 |
|
| 0.1 | 0.01 |
| Drop | 0.4 | 0.1 |
| Hid | 256 | 256 |
| P_h | 64 | 50 |
Figure 3The parameter setting of modules.
Comparison experiments of multimodal models in MOSI
| Model | MAE | Mul_Acc2 | Mul_Acc5 | Corr | F1_Score | CPU_Clock |
|---|---|---|---|---|---|---|
| TFN [ | 1.016 | 0.765 | 0.386 | 0.604 | 0.765 | 0.404 |
| LMF [ | 1.009 | 0.767 | 0.362 | 0.604 | 0.769 | 0.395 |
| MFN [ | 1.007 | 0.773 | 0.329 | 0.632 | 0.773 | 0.379 |
| ARGF [ | 0.857 | 0.814 | 0.423 | 0.712 | 0.815 | 0.147 |
| Gragh-MFN [ | 1.003 | 0.784 | 0.360 | 0.623 | 0.785 | 0.454 |
| MARM [ | 1.028 | 0.756 | 0.351 | 0.625 | 0.755 | 0.345 |
| LSTHM [ | 1.087 | 0.745 | 0.375 | 0.608 | 0.744 | 1.527 |
| LSTHM [ | 1.056 | 0.750 | 0.370 | 0.581 | 0.752 | 1.524 |
| LSTHM [ | 0.992 | 0.758 | 0.401 | 0.626 | 0.757 | 0.357 |
| LSTHM [ | 1.092 | 0.764 | 0.332 | 0.569 | 0.764 | 0.708 |
| MISA [ | 0.827 | 0.819 | 0.440 | 0.726 | 0.819 | 0.839 |
|
|
|
|
|
|
|
|
Comparison experiments of multimodal models in MOSEI.
| Model | MAE | Mul_Acc2 | Mul_Acc5 | Corr | F1_Score | CPU_Clock |
|---|---|---|---|---|---|---|
| TFN [ | 0.714 | 0.760 | 0.443 | 0.507 | 0.761 | 0.417 |
| LMF [ | 0.729 | 0.761 | 0.436 | 0.520 | 0.760 | 0.412 |
| MFN [ | 0.715 | 0.773 | 0.432 | 0.530 | 0.772 | 0.418 |
| Gragh-MFN [ | 0.714 | 0.765 | 0.448 | 0.526 | 0.766 | 0.46 |
| MARM [ | 0.708 | 0.772 | 0.449 | 0.530 | 0.773 | 0.363 |
| LSTHM [ | 0.852 | 0.733 | 0.383 | 0.403 | 0.733 | 1.585 |
| LSTHM [ | 0.861 | 0.704 | 0.383 | 0.383 | 0.721 | 1.6 |
| LSTHM [ | 0.837 | 0.748 | 0.391 | 0.437 | 0.748 | 0.369 |
| LSTHM [ | 0.905 | 0.722 | 0.383 | 0.405 | 0.723 | 0.715 |
| MISA [ | 0.600 | 0.858 | 0.538 | 0.776 | 0.857 | 0.975 |
|
|
|
|
|
|
|
|
Experiments of fusion methods.
| Method | MAE (↓) | Mul_Acc2 (↑) | Mul_Acc5 (↑) | Corr (↑) | F1_Score (p/g) (↑) |
|---|---|---|---|---|---|
| JDSN-AttFusion | 0.924 | 0.791 | 0.378 | 0.687 | 0.782 |
| JDSN-concat | 0.839 | 0.814 | 0.443 | 0.724 | 0.813 |
| JDSN-DFG | 0.825 | 0.816 | 0.459 | 0.727 | 0.817 |
|
|
|
|
|
|
|
Figure 4Visualization for Graph Fusion in MOSI sentiment analysis task.
Figure 5Visualization of sentiment semantic distribution under different loss. Notes: (1) lack of loss function Ldiff; (2) lack of loss function Lrecon; (3) lack of loss function Lsim; (4) lack of loss function Ltrip; (5) full configuration of loss function.
Experiments of ablation study.
| Method | MAE | Mul_Acc2 | Mul_Acc5 | Corr | F1_Score |
|---|---|---|---|---|---|
| Without diff loss | 0.868 | 0.811 | 0.404 | 0.728 | 0.816 |
| Without sim loss | 0.999 | 0.784 | 0.351 | 0.723 | 0.782 |
| Without recon loss | 0.833 | 0.817 | 0.464 | 0.711 | 0.816 |
| Without CosineTriplet loss | 0.857 | 0.799 | 0.469 | 0.705 | 0.798 |
|
|
|
|
|
|
|
Figure 6Visualization of performance comparison under different similarity loss.