| Literature DB >> 35494848 |
Batyrkhan Omarov1,2,3,4, Sergazi Narynov1, Zhandos Zhumanov1,4, Aidana Gumar1,5, Mariyam Khassanova1,5.
Abstract
We investigate and analyze methods to violence detection in this study to completely disassemble the present condition and anticipate the emerging trends of violence discovery research. In this systematic review, we provide a comprehensive assessment of the video violence detection problems that have been described in state-of-the-art researches. This work aims to address the problems as state-of-the-art methods in video violence detection, datasets to develop and train real-time video violence detection frameworks, discuss and identify open issues in the given problem. In this study, we analyzed 80 research papers that have been selected from 154 research papers after identification, screening, and eligibility phases. As the research sources, we used five digital libraries and three high ranked computer vision conferences that were published between 2015 and 2021. We begin by briefly introducing core idea and problems of video-based violence detection; after that, we divided current techniques into three categories based on their methodologies: conventional methods, end-to-end deep learning-based methods, and machine learning-based methods. Finally, we present public datasets for testing video based violence detectionmethods' performance and compare their results. In addition, we summarize the open issues in violence detection in videoand evaluate its future tendencies. ©2022 Omarov et al.Entities:
Keywords: Artificial intelligence; Computer vision; Datasets; Deep learning; Machine learning; Video features; Violence detection
Year: 2022 PMID: 35494848 PMCID: PMC9044356 DOI: 10.7717/peerj-cs.920
Source DB: PubMed Journal: PeerJ Comput Sci ISSN: 2376-5992
Inclusion and exclusion criteria.
| I/E | Criteria | Explanation |
|---|---|---|
|
| Review paper | The paper proposes different types of reviews as literature review, systematic review, survey, etc. |
| Research paper | The paper aims to solve specific research problems related to video surveillance security systems. | |
|
| Duplicated papers | The same paper that appears multiple times |
| Non-research papers | The paper is not a research article. It might be Editorial notes, comments, etc. | |
| Non-related papers | The topic under study goes beyond the research context of this work | |
| Non English papers | The paper is not written in English | |
| Implicitly related papers | The paper does not directly express the research focus on video surveillance security systems. | |
| Non research paper | The paper is not a research paper. It might be editorial notes, comments, etc. |
Figure 1Systematic literature review flowchart.
Research questions and their motivations.
| ID | Research Question | Motivation |
|---|---|---|
| RQ1 | What kind of video based violence detection techniques are applied in state-of-the-art researches? | Identify state-of-the-art methods and techniques in intelligent video surveillance |
| RQ2 | What kind of video features and descriptors are used in video-violence detection? | Identify commonly used and state-of-the-art features and descriptors in video violence detection |
| RQ3 | What datasets are used to train models for video-violence detection | Identify datasets commonly used in intelligent video surveillance |
| RQ4 | What challenges and open questions exist to identify violence in videos? | Identify challenges and open issues in intelligent video surveillance |
Figure 2Year-wise violence detection papers distribution.
Figure 3Distribution of violence detection methods.
Figure 4Fundamental stages of video-based violence detection.
Violence detection techniques that use machine learning.
|
| Motion blob acceleration measure vector method for detection of fast fighting from video | Ellipse detection method | An algorithm to find the acceleration | Spatio-temporal features use for classification | Both crowded and less crowded | Accuracy about 90% |
|
| FightNet for Violent Interaction Detection | Temporal Segment Network | Image acceleration | Softmax | Both crowded and uncrowded | 97% in Hockey, 100% in Movies dataset |
|
| RIMOC method focuses on speed and direction of an object on the base of HOF | Covariance Matrix method STV based | Spatio-temporal vector method (STV) | STV uses supervised learning | Both crowded and uncrowded | For normal situation 97% accuracy |
|
| Multiview fight detection method | YOLO-V3 network | Optical flow | Random Forest | Both crowded and uncrowded | 97.66% accuracy, 97.66 F1-score |
|
| Two step detection of violent and faces in video by using ViF descriptor and normalization algorithms | Vif object recognition CUDA method and KLT face detector | Horn shrunk method for histogram | Interpolation classification | Less crowded | Lower frame rate 14% too high rate of 35% fs/s 97% |
| HL-Net to simultaneously capture long-range relations and local distance relations | HLC approximator | CNN based model | Weak supervision | Both crowded and uncrowded scene | 78.64% | |
|
| SVM method for recognition based on statistical theory frames | Vector normalization method | Macro block technique for features extractions | Region motion and descripton for video classification | Crowded | 96.1% accuracy |
|
| A cascaded method of violence detection based on MoBSIFT and movement fltering | MoBSIFT | Motion boundary histogram | SVM, random forest, and AdaBoost | Both Crowded and uncrowded scene | 90.2% accuracy in Hockey, 91% in Movies dataset |
|
| Lagrangian fields of direction and begs of word framework to recognize the violence in videos | Global compensation of object motion | Lagrangian theory and STIP method for extract motion features | Late fusion for classification | Crowded | 91% to 94% accuracy |
Violence detection techniques using SVM.
|
| A Video-Based DT– SVM School Violence Detecting Algorithm | Motion Co-occurrence Feature (MCF) | Optical flow extraction | Crowded | 97.6% |
|
| GMOF framework with tracking and detection module | Gaussian Mixture model | OHFO for optical flow extraction | Crowded | 82%–89% accuracy |
|
| Violence detection using Oriented ViF | Optical Flow method | Combination of ViF and OViF descriptor | Crowded | 90% |
|
| Autocorrelation of gradients based violence detection | Motion boundary histograms | Frame based feature extraction | Crowded | 91.38% accuracy in Crowd Violence; 90.40% in Hockey dataset |
|
| Framework includes preprocessing, detection of activity and image retrieval. It identifies the abnormal event and image from data-based images. | Optical flow and tempora difference for object detection CBIR method for retrieving images. | Gaussian function for video future analysis | Less crowded | 97% accuracy |
|
| Sparsity-Based Naive Bayes Approach for Anomaly Detection in Real Surveillance Videos | Sparsity-Based Naive Bayes | C3D feature extraction | Both crowded and uncrowded | 64.7% F1 score; 52.1% precision; 85.3% recall in UCF dataset |
|
| SGT-based and SVM-based multi-temporal framework to detect violent events in multi-camera surveillance. | Late fusion | Multi-temporal Analysis (MtA) | Variety fight scenes from minimum two to maximum fifteen people include various movements | 78.3% (SGT-based, BEHAVE), 70.2% (SVM-based, BEHAVE), 87.2% (SGT-based, NUS–HGA), and 69.9% (SGT-based, YouTube) |
|
| An architecture to identify violence in video surveillance system using ViF and LBP | Shape and motion analysis | ViF and Local Binary Pattern (LBP) descriptors | Both crowded and non-crowded scenes | 89.1% accuracy in Hockey dataset, 88.2% accuracy in Violent-Flow dataset |
Violence detection using deep learning techniques.
|
| Violence Detection using 3D CNN | 3D convolution is used to get spatial information | Backpropagation method | Crowded | 91% accuracy |
|
| Deep architecture for place recognition | VGG VLAD method for image retrieval | Backpropagation method for feature extraction | Crowded | 87%–96% accuracy |
|
| Framework for football stadium comprising of big data analysis and deep learning through bidirectional LSTM | Bidirectional LSTM | HOG, SVM | Crowded | 94.5% accuracy |
|
| Violent scene detection using CNN and deep audio features | MFB | CNN | Crowded | Approximately 90% accuracy |
|
| A multi-stream CNN using handcrafted features | A deep violence detection framework based on the specific features (speed of vmovement, and representative image) derived from handcrafted methods. | CNN | Both crowded and uncrowded | |
|
| Detect violent videos using ConvLSTM | CNN along with the ConvLSTM | CNN | Crowded | Approximately 97% |
|
| Deep violence detection framework based on the specific features derived from handcrafted methods | Discriminative feature with a novel differential motion energy image | CNN | Both crowded and uncrowded | |
|
| Detecting Human Violent Behavior by integrating trajectory and Deep CNN | Deep CNN | Optical flow method | Crowded | 98% accuracy |
|
| ViolenceNet: Dense Multi-Head Self-Attention with Bidirectional Convolutional LSTM | 3D DenseNet | Optical flow method | Crowded | 95.6%– 100% accuracy |
|
| Violence detection method based on a bi-channels CNN and the SVM. | Linear SVM | Bi-channels CNN | Both crowded and uncrowded scenes | 95.90 ± 3.53 accuracy in Hockey fight, 93.25 ± 2.34 accuracy in Violence crowd |
|
| Trajectory-Pooled Deep Convolutional Networks | ConvNet model which contains 17 convolutionpool-norm layers and two fully connected layers | Deep ConvNet model | Both crowded and uncrowded | 92.5% accuracy in Crowd Violence, 98.6% in Hockey Fight dataset |
|
| Violence Detection using Spatiotemporal Features | Pre-train Mobile Net CNN model | 3D CNN | Crowded | Approximately 97% accuracy |
Video features were used in the selected studies.
|
| Motion, space and time |
|
| Motion blobs, Edges and corner of image |
|
| Motion blobs |
|
| Optical flow, motion and moving bob |
|
| Motion, direction and speed |
|
| MoDI and WMoDI, motion regions marking |
|
| Optical flow, Magnitude |
| Optical flow and audio features | |
|
| Motion vector and direction |
|
| Optical flow, MBH, movement filtering? |
|
| Spatial, temporal and motion |
|
| Rectangular frame and optical-flow |
|
| Spatiotemporal and motion |
|
| Spatio-Temporal Auto-Correlation of Gradients |
|
| Motion region and optical flow |
|
| Multiple handcrafted features |
|
| Spatial–temporal, C3D |
|
| Movement, direction and speed |
|
| Speed, direction, centroid and dimensions |
|
| Spatiotemporal features |
|
| Spatio-temporal |
|
| Spatial, temporal, and spatiotemporal streams |
|
| Spatio-temporal features |
|
| Apperancde, motion, optical flow |
|
| Direction and motion information |
|
| hand-crafted and trajectory-deep features |
|
| Spatiotemporal features |
|
| Motion, acceleration and magnitude |
|
| Spatiotemporal, acceleration and motion |
|
| Time-domain and frequency-domain |
|
| STIP, optical flow |
Datasets for violence detection in video.
| Dataset | Reference | Characteristics | Published year | Reference used |
|---|---|---|---|---|
| Movies |
| 200 video clips of totally 6 min | 2011 | |
| Hockey |
| 1 000 video clips of totally 27 min | 2011 | |
| UCF-101 |
| 13 000 clips | 2012 |
|
| CASIA Action |
| 8 classes of single person activities that contain 1446 video clips | 2007 |
|
| UT-Interaction | 20 video sequences with the resolution of 720x480 at 30fps. | 2012 |
| |
| The Boss |
| 10,000 images for training and 1,000 for testing | 2011 |
|
| XD-Violence | 4574 videos with duration of 217 h that has 6 types of violent and 9 types of non-violent videos | 2020 | ||
| VVAR10 |
| 296 positive and 277 negative instances | 2016 |
|
| UCF Sports |
| – | 2014 |
|
| UCF50 |
| 50 actions, 100 min videos | 2010 |
|
| HMDB51 |
| 6,766 manually annotated videos that divided to 51 classes | 2011 |
|
| BEHAVE |
| 200 000 frames | 2010 | |
| CAVIAR |
| 28 video clips | 2004 |
|
| Violent-Flow |
| 30 video clips | 2012 | |
| UCF |
| 128 h of videos. 1900 surveillance videos that includes 13 types of classes | 2018 |
|
| Crowd Violence |
| – | 2012 | |
| Finnish emotional |
| 132 samples | 2021 |
|
| Chinese emotional |
| 370 samples | 2018 |
|
| Pittsburgh |
| – | 2013 |
|
| Tokyo 24/7 |
| – | 2015 |
|
| Violent interaction |
| 2314 movies with 1077 fights and 1237 no-fights | 2019 |
|
| MediaEval |
| 10 000 clips | 2014 |
|
| Violent-Flows |
|
| 2015 |
|
| Weizmann |
| 9 actions, 9 clips | 2005 |
|
| KTH |
| 6 actions, 100 clips | 2004 |
|
| Violent Crowd |
| 246 short video sequences that video length is varying from 50 to 150 frames. | 2012 |
|
| London Metropolitan Police | Cheng et al. (2012) | – | 2012 |
|