| Literature DB >> 35885233 |
Feng Liu1,2,3, Si-Yuan Shen2, Zi-Wang Fu3, Han-Yang Wang2, Ai-Min Zhou1,2,4, Jia-Yin Qi5.
Abstract
Semantic-rich speech emotion recognition has a high degree of popularity in a range of areas. Speech emotion recognition aims to recognize human emotional states from utterances containing both acoustic and linguistic information. Since both textual and audio patterns play essential roles in speech emotion recognition (SER) tasks, various works have proposed novel modality fusing methods to exploit text and audio signals effectively. However, most of the high performance of existing models is dependent on a great number of learnable parameters, and they can only work well on data with fixed length. Therefore, minimizing computational overhead and improving generalization to unseen data with various lengths while maintaining a certain level of recognition accuracy is an urgent application problem. In this paper, we propose LGCCT, a light gated and crossed complementation transformer for multimodal speech emotion recognition. First, our model is capable of fusing modality information efficiently. Specifically, the acoustic features are extracted by CNN-BiLSTM while the textual features are extracted by BiLSTM. The modality-fused representation is then generated by the cross-attention module. We apply the gate-control mechanism to achieve the balanced integration of the original modality representation and the modality-fused representation. Second, the degree of attention focus can be considered, as the uncertainty and the entropy of the same token should converge to the same value independent of the length. To improve the generalization of the model to various testing-sequence lengths, we adopt the length-scaled dot product to calculate the attention score, which can be interpreted from a theoretical view of entropy. The operation of the length-scaled dot product is cheap but effective. Experiments are conducted on the benchmark dataset CMU-MOSEI. Compared to the baseline models, our model achieves an 81.0% F1 score with only 0.432 M parameters, showing an improvement in the balance between performance and the number of parameters. Moreover, the ablation study signifies the effectiveness of our model and its scalability to various input-sequence lengths, wherein the relative improvement is almost 20% of the baseline without a length-scaled dot product.Entities:
Keywords: computational affection; cross-attention; entropy invariance; gate control; lightweight model; multimodal speech emotion recognition
Year: 2022 PMID: 35885233 PMCID: PMC9316084 DOI: 10.3390/e24071010
Source DB: PubMed Journal: Entropy (Basel) ISSN: 1099-4300 Impact factor: 2.738
Figure 1The overall architecture of LGCCT. CNN−BiLSTM and BiLSTM extract acoustic features and text features respectively. At the heart of the model, the cross−attention module with a gate−control mechanism fuses the modality information. The transformer encoder layers reinforce the modality-fused representation.
Figure 2Cross-attention module fuses the modality information.
Figure 3The architecture of the Transformer.
Detailed dimensions of LGCCT.
| Notation | Meaning | Value |
|---|---|---|
|
| Aligned input-sequence length | 50 |
|
| Word-embedding dimension | 300 |
|
| Audio feature dimension | 74 |
|
| Encoded text feature dimension by BiLSTM | 32 |
|
| Encoded audio feature dimension by CNN-BiLSTM | 32 |
|
| Hidden state dimension | 30 |
|
| Length-scale logits |
|
The performance and the number of parameters on the CMU-MOSEI dataset.
| Method | #Params(M) |
|
|
|
|---|---|---|---|---|
| MulT | 0.961 | 48.2 | 80.2 | 79.7 |
| MCTN | 0.247 | 47.64 | 78.87 | 77.86 |
| MISA ** | 110.915 | 53.31 | 80.81 | 80.26 |
| BBFN ** | 110.548 | 51.7 | 85.5 | 85.5 |
| EF-LSTM * | 0.56 | 47.4 | 78.2 | 77.9 |
| LF-LSTM * | 1.22 | 48.8 | 80.6 | 80.6 |
| RAVEN * | 1.19 | 45.5 | 75.4 | 75.7 |
| LGCCT (ours) | 0.432 | 47.5 | 81.0 | 81.1 |
* with tri-modality, namely audio, video and text. ** with pretrained BERT.
Figure 4Comparison of the score of different models on CMU-MOSEI. The proposed LGCCT achieves the best performance with an order of magnitude smaller model size.
Ablation study on the CMU-MOSEI dataset.
| Model | #Params(M) |
|
|
|
|---|---|---|---|---|
| LGCCT | 0.432 | 47.5 | 81.0 | 81.1 |
| 0.429 | 42.9 | 76.7 | 76.3 | |
| 0.354 | 40.9 | 70.7 | 70.8 | |
| 0.203 | 40.3 | 75.6 | 78.0 |
Accuracy comparisons on CMU-MOSEI with different length distributions.
| All = 50 | Part = 30 | Part = 40 | |||
|---|---|---|---|---|---|
| Type | Train All | Train Part | Train All | Train Part | Train All |
| Length scaled | 80.8 | 75.7 | 65.2 | 74.4 | 67.7 |
| 81.1 | 62.8 | 58.3 | 77.0 | 71.9 | |
F1-score comparisons on CMU-MOSEI with different length distributions.
| All = 50 | Part = 30 | Part = 40 | |||
|---|---|---|---|---|---|
| Type | Train All | Train Part | Train All | Train Part | Train All |
| Length scaled | 80.7 | 76.2 | 75.3 | 76.8 | 74.8 |
| 81.0 | 77.2 | 57.8 | 78.3 | 72.3 | |