| Literature DB >> 35741485 |
Jiangfeng Li1, Zijian Zhang2, Bowen Wang1, Qinpei Zhao1, Chenxi Zhang1.
Abstract
Internet users are benefiting from technologies of abstractive summarization enabling them to view articles on the internet by reading article summaries only instead of an entire article. However, there are disadvantages to technologies for analyzing articles with texts and images due to the semantic gap between vision and language. These technologies focus more on aggregating features and neglect the heterogeneity of each modality. At the same time, the lack of consideration of intrinsic data properties within each modality and semantic information from cross-modal correlations result in the poor quality of learned representations. Therefore, we propose a novel Inter- and Intra-modal Contrastive Hybrid learning framework which learns to automatically align the multimodal information and maintains the semantic consistency of input/output flows. Moreover, ITCH can be taken as a component to make the model suitable for both supervised and unsupervised learning approaches. Experiments on two public datasets, MMS and MSMO, show that the ITCH performances are better than the current baselines.Entities:
Keywords: contrastive learning; cross-modal fusion; multimodal abstractive summarization; supervised and unsupervised learning
Year: 2022 PMID: 35741485 PMCID: PMC9222507 DOI: 10.3390/e24060764
Source DB: PubMed Journal: Entropy (Basel) ISSN: 1099-4300 Impact factor: 2.738
Figure 1Illustration of the standard multimodal abstractive summarization framework, which consists of a multimodal encoder and a textual decoder. The decoder generates a target summary after extracting the visual semantic features and merging them together.
Figure 2Schematic illustration of the ITCH framework with supervised learning. It comprises four components. (1) Feature Extractor: the visual and textual features are embedded by their own domain encoders, respectively, i.e., ViT and BERT. (2) Cross-Modal Fusion Module: the self-attention mechanism with and residual connection. (3) Textual Decoder: a traditional transformer-based decoder is used to reconstruct a summary. (4) Hybrid Contrastive Objectives: apart from using the common reconstruction loss for summary generation, an inter-modal contrastive objective is designed to maintain the distance among bi-modal inputs, and an intra-modal contrastive objective is used to gather information between input sentences and output utterances.
Figure 3Illustration of cross-modal fusion module.
Figure 4Visualization of inter- and intra-modal contrastive losses. The positive pairs in are denoted by pink (input sentences) and blue (input image), and in are denoted by pink (input sentences) and green (target summary) points. The negative examples are noted by red triangles.
Figure 5Structure for the unsupervised learning methods using the same structure of ITCH with the supervised approach as an additional component and adding a transformer model with encoder TransEnc and decoder TransDec to reconstruct the input text, which is advised for the existing Compress-then-Reconstruct approach (CTNR).
Dataset statistics of MMS and MSMO. Each image is paired with captions. #Train, #Valid and #Test denote the number of examples in each group. #MaxLength denotes the maximum number of tokens in captions for MMS and MSMO, respectively.
| Datasets | #Train | #Valid | #Test | #MaxLength |
|---|---|---|---|---|
| MMS | 62,000 | 2000 | 2000 | 439 |
| MSMO | 240,000 | 3000 | 3000 | 492 |
Hyperparameters setting.
| Symbol | Annotation | Value | Symbol | Annotation | Value |
|---|---|---|---|---|---|
| E | Word embedding size | 300 | EP | Number of epochs | 30 |
| V | Vocabulary size | 30,004 | DR | Dropout rate | 0.3 |
|
| Dimension of feature | 768 |
| Adjustable factor | −0.15 |
|
| Batch size | 128 |
| Temperature parameter | 0.1 |
| LR0 | For pre-trained modules | 2 × 10 | LR1 | For other modules | 2 × 10 |
The metrics for the human evaluation.
| Fluency (F) | Relevance (R) | ||
|---|---|---|---|
| Points | Explanations | Points | Explanations |
| 1 |
| 1 |
|
| 2 |
| 2 |
|
| 3 |
| 3 |
|
| 4 |
| 4 |
|
Performance of ITCH and baselines on MSMO dataset with ROUGE/Relevance/Human. ✓ means the methods belong to either uni-modal or bi-modal. A Bold value means the best performance.
| Types | Resource | Methods | ROUGE | Relevance | Human | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| uni- | bi- | R-1 | R-2 | R-L | Average | Extrema | Greedy | F | R | ||
| ✓ | LexRank | 32.54 | 9.96 | 28.02 | 0.277 | 0.204 | 0.278 | 2.72 | 3.02 | ||
| ✓ | W2VLSTM | 29.86 | 13.11 | 27.68 | 0.278 | 0.201 | 0.296 | 2.54 | 2.82 | ||
| Unsupervised | ✓ | Seq3 | 38.16 | 13.58 | 32.07 | 0.347 | 0.245 | 0.342 | 3.14 | 3.28 | |
| ✓ | GuideRank | 37.13 | 15.03 | 36.18 | 0.332 | 0.231 | 0.341 | 3.12 | 3.31 | ||
| ✓ | CTNR | 40.11 | 16.97 | 39.71 | 0.372 | 0.271 | 0.386 | 3.32 | 3.44 | ||
| ✓ | MMR | 41.72 | 17.33 | 39.81 | 0.381 | 0.269 | 0.391 | 3.39 | 3.39 | ||
| ✓ |
|
|
|
|
|
|
|
|
| ||
| ✓ | S2S | 32.32 | 12.44 | 29.65 | 0.292 | 0.209 | 0.287 | 3.24 | 3.32 | ||
| ✓ | PointerNet | 34.62 | 13.72 | 30.05 | 0.339 | 0.267 | 0.352 | 3.21 | 3.41 | ||
| Supervised | ✓ | UniLM | 42.32 | 22.04 | 40.03 | 0.443 | 0.308 | 0.438 |
| 3.54 | |
| ✓ | Doubly-Attn | 41.11 | 21.75 | 39.92 | 0.434 | 0.297 | 0.433 | 3.46 | 3.52 | ||
| ✓ | Select | 46.25 | 24.68 | 44.02 | 0.466 | 0.331 | 0.471 | 3.62 | 3.59 | ||
| ✓ |
|
|
|
|
|
|
| 3.67 |
| ||
Performance of ITCH and baselines on MMS dataset with ROUGE/Relevance/Human. Symbol “-” denotes that no ready-made results and no code are provided.
| Types | Resource | Methods | ROUGE | Relevance | Human | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| uni- | bi- | R-1 | R-2 | R-L | Average | Extrema | Greedy | F | R | ||
| ✓ | LexRank | 36.52 | 9.16 | 27.66 | 0.264 | 0.192 | 0.271 | 2.86 | 3.17 | ||
| ✓ | W2VLSTM | 29.14 | 9.77 | 28.11 | 0.272 | 0.202 | 0.283 | 2.71 | 2.92 | ||
| Unsupervised | ✓ | Seq3 | 36.42 | 10.22 | 34.91 | 0.339 | 0.221 | 0.341 | 3.21 | 3.24 | |
| ✓ | GuideRank | 35.31 | 10.11 | 33.91 | 0.302 | 0.211 | 0.312 | 3.01 | 3.15 | ||
| ✓ | CTNR | 39.69 | 13.16 | 39.22 | 0.371 | 0.254 | 0.357 | 3.37 | 3.37 | ||
| ✓ | MMR | 41.29 | 16.75 | 38.29 | 0.382 | 0.269 | 0.393 | 3.43 |
| ||
| ✓ |
|
|
|
|
|
|
|
|
| ||
| ✓ | S2S | 30.81 | 11.72 | 28.23 | 0.285 | 0.202 | 0.278 | 3.24 | 3.32 | ||
| ✓ | PointerNet | 35.61 | 14.64 | 33.62 | 0.345 | 0.271 | 0.355 | 3.32 | 3.46 | ||
| Supervised | ✓ | UniLM | 41.82 | 20.82 | 39.83 | 0.451 | 0.311 | 0.459 |
| 3.57 | |
| ✓ | Doubly-Attn | 39.82 | 19.72 | 38.21 | 0.438 | 0.302 | 0.431 | 3.44 | 3.54 | ||
| ✓ | Select | 45.63 | 23.68 | 42.97 | 0.466 | 0.327 | 0.473 | 3.64 | 3.55 | ||
| ✓ | MMAF | 47.28 | 24.85 | 44.48 | 0.472 | 0.336 | 0.480 | - | - | ||
| ✓ | MMCF | 46.84 | 24.25 | 43.76 | 0.470 | 0.335 | 0.476 | - | - | ||
| ✓ |
|
|
|
|
|
|
| 3.68 |
| ||
The effect of the hyperparameters ( and ) on the MSMO dataset. (The symbol ↑ denotes that the value has been improved, and ↓ denotes that the value has decreased).
| Unsupervised | Supervised | |||||
|---|---|---|---|---|---|---|
| Methods | ROUGE | Relevance | M | ROUGE | Relevance | M |
| ITCH | 36.47 | 0.363 | 0.547 | 39.99 | 0.434 | 0.623 |
| Default | ||||||
| ITCH ( | 36.31 ↓ | 0.357 ↓ | 0.541 ↓ | 39.49 ↓ | 0.431 ↓ | 0.618 ↓ |
| ITCH ( | 35.23 ↓ | 0.343 ↓ | 0.511 ↓ | 39.02 ↓ | 0.424 ↓ | 0.611 ↓ |
| ITCH ( | 33.28 ↓ | 0.318 ↓ | 0.436 ↓ | 37.66 ↓ | 0.397 ↓ | 0.528 ↓ |
| ITCH ( | 31.74 ↓ | 0.394 ↓ | 0.402 ↓ | 36.81 ↓ | 0.389 ↓ | 0.503 ↓ |
| Default | ||||||
| ITCH ( | 35.62 ↓ | 0.347 ↓ | 0.556 ↑ | 38.74 ↓ | 0.418 ↓ | 0.634 ↑ |
| ITCH ( | 35.17 ↓ | 0.339 ↓ | 0.493 ↓ | 37.87 ↓ | 0.401 ↓ | 0.589 ↓ |
The effect of the cross-modal fusion module and hybrid contrastive losses. (Symbol “-X” denotes that module X is removed).
| Unsupervised | Supervised | ||||||
|---|---|---|---|---|---|---|---|
| Methods | ROUGE | Relevance | M | Methods | ROUGE | Relevance | M |
| ITCH | 36.47 | 0.363 | 0.547 | ITCH | 39.99 | 0.434 | 0.623 |
| MMR | 32.95 | 0.347 | 0.382 | Select | 38.32 | 0.423 | 0.452 |
| - | 34.18 | 0.359 | 0.389 | - | 38.66 | 0.427 | 0.471 |
| - | 33.71 | 0.350 | 0.443 | - | 38.89 | 0.428 | 0.581 |
| - | 30.11 | 0.336 | 0.301 | - | 36.91 | 0.396 | 0.449 |
| - CrossFusion | 32.71 | 0.343 | 0.408 | - CrossFusion | 38.51 | 0.425 | 0.518 |
A case study from the MMS dataset. The references of the bi-modal inputs and the target summary are given in the top table. The summaries generated by ITCH and the baselines are shown in the bottom table, which also calculated ROUGE and M.
| Text | Zika is primarily spread by mosquitoes but can also | Image |
| ||||
|
| Singapore has suffered from the Zika virus and dengue virus, both of them are mosquitoborne disease with high | ||||||
|
|
|
|
|
|
|
| |
| LexRank | Singapore is known to suffer widely from virus, a mosquitoborne tropical disease that trigger high fevers, vomiting and skin rashes in infected. | 25.8 | 4.0 | 24.1 | 0.36 | ||
| Unsupervised | GuidRank | Singapore is the Asian country with active transmission of < | 29.3 | 11.7 | 20.7 | 0.42 | |
| CTNR | Singapore is known to suffer from < | 32.2 | 13.6 | 27.6 | 0.40 | ||
|
| ITCH | The Singapore in < | 48.3 | 25.0 | 44.8 | 0.52 | |
| PointNet | The Singapore’s government aggressive spray and information prevent zika and dengue virus. | 33.7 | 21.5 | 27.5 | 0.41 | ||
| Supervised | UniLM | Zika is primarily spread by < | 42.0 | 19.7 | 38.1 | 0.44 | |
| Select | The Singapore taking spraying and information campaign to prevent <unk> virus. It suffering from virus that are fevers and transmission. | 52.1 | 27.9 | 29.6 | 0.51 | ||
|
| The Singapore take aggressive spraying, indoor spraying and information campaign to prevent < |
|
|
|
| ||