| Literature DB >> 35885162 |
Yang Xiao1, Guxue Gao1, Liejun Wang1, Huicheng Lai1.
Abstract
Violence detection aims to locate violent content in video frames. Improving the accuracy of violence detection is of great importance for security. However, the current methods do not make full use of the multi-modal vision and audio information, which affects the accuracy of violence detection. We found that the violence detection accuracy of different kinds of videos is related to the change of optical flow. With this in mind, we propose an optical flow-aware-based multi-modal fusion network (OAMFN) for violence detection. Specifically, we use three different fusion strategies to fully integrate multi-modal features. First, the main branch concatenates RGB features and audio features and the optical flow branch concatenates optical flow features with RGB features and audio features, respectively. Then, the cross-modal information fusion module integrates the features of different combinations and applies weights to them to capture cross-modal information in audio and video. After that, the channel attention module extracts valuable information by weighting the integration features. Furthermore, an optical flow-aware-based score fusion strategy is introduced to fuse features of different modalities from two branches. Compared with methods on the XD-Violence dataset, our multi-modal fusion network yields APs that are 83.09% and 1.4% higher than those of the state-of-the-art methods in offline detection, and 78.09% and 4.42% higher than those of the state-of-the-art methods in online detection.Entities:
Keywords: adaptive fusion; multi-modal fusion; optical flow-aware; violence detection
Year: 2022 PMID: 35885162 PMCID: PMC9316342 DOI: 10.3390/e24070939
Source DB: PubMed Journal: Entropy (Basel) ISSN: 1099-4300 Impact factor: 2.738
Figure 1Our proposed OAMFN. The cross-modal information fusion module for capturing the cross-modal information and fusing multi-modal features, as well as the channel attention for meaningful information selection and the prediction module for predicting score generation and fusing two branches via the optical flow-aware-based score fusion strategy.
Figure 2Structures of the channel attention.
Figure 3Schematic diagram of the feature extraction module.
Ablation study: a comparison of the modules in our method.
| Cross-Attention | Channel Attention | Optical Flow-Aware Fusion | AP (%) |
|---|---|---|---|
| 🗸 | 80.69 | ||
| 🗸 | 81.39 | ||
| 🗸 | 80.8 | ||
| 🗸 | 🗸 | 82.58 | |
| 🗸 | 🗸 | 81.87 | |
| 🗸 | 🗸 | 82.43 | |
| 🗸 | 🗸 | 🗸 | 83.09 |
A comparison of the AP performance with the existing methods on the XD-Violence dataset. The best results are in and the second-best results are in .
| Supervision | Method | Feature | Online AP(%) | Offline AP(%) |
|---|---|---|---|---|
| Unsupervised | SVM | - | - | 50.78 |
| OCSVM [ | - | - | 27.25 | |
| Hasan et al. [ | - | - | 30.77 | |
| Weakly Supervised | Sultani et al. [ | RGB | - | 73.2 |
| Wu et al. [ | RGB + Audio | 73.67 | 78.64 | |
| Tian et al. [ | RGB | - | 77.81 | |
| CRFD [ | RGB | - | 75.90 | |
| Pang et al. [ | RGB + Audio | - | 81.69 | |
| MSL [ | RGB | - | 78.59 | |
| Ours (without OASFM) | RGB + Flow + Audio |
|
| |
| Ours (with OASFM) | RGB + Flow + Audio |
|
|
Figure 4A comparison of the offline AP performance with Wu et al. [22] on the violent classes.
Figure 5Qualitative results in the testing videos from the XD-Violence dataset.