| Literature DB >> 35336387 |
Romas Vijeikis1, Vidas Raudonis1, Gintaras Dervinis1.
Abstract
Intelligent video surveillance systems are rapidly being introduced to public places. The adoption of computer vision and machine learning techniques enables various applications for collected video features; one of the major is safety monitoring. The efficacy of violent event detection is measured by the efficiency and accuracy of violent event detection. In this paper, we present a novel architecture for violence detection from video surveillance cameras. Our proposed model is a spatial feature extracting a U-Net-like network that uses MobileNet V2 as an encoder followed by LSTM for temporal feature extraction and classification. The proposed model is computationally light and still achieves good results-experiments showed that an average accuracy is 0.82 ± 2% and average precision is 0.81 ± 3% using a complex real-world security camera footage dataset based on RWF-2000.Entities:
Keywords: LSTM; U-Net; computer vision; deep learning; intelligent video surveillance; violence detection; violent behavior
Mesh:
Year: 2022 PMID: 35336387 PMCID: PMC8950857 DOI: 10.3390/s22062216
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Summary of some related studies in violence detection.
| Reference | Detection Methods | Feature Extraction | Strength |
|---|---|---|---|
| Gao et al. [ | SVM and AdaBoost | Oriented violent flows (OViF) | Performance of the proposed OViF and LTP was able to achieve a more satisfactory accuracy of 87.50% and 88.00% for Hockey Fights and Violent Flow. |
| Peixoto et al. [ | Inception v4, C3D, and a CNN-LSTM | Mel-frequency cepstral coefficients (MFCCs) | The proposed method was able to achieve an increasing accuracy of 6% in comparison with the baseline concept of violence with audio of 72.8% to 78.5% for both visual and audio features. |
| Accattoli et al. [ | C3D + SVM | BoW or sparse coding information | Better detection rate—the proposed method is therefore general and can be applied in various scenarios |
| Zhou et al. [ | SVM | Low-level features are the local histogram of oriented gradient (LHOG), bag-of-words (BoW), local histogram of optical flow (LHOF) descriptor | The proposed features extraction showed an effective detection model in automatic violent behaviors in comparison with the state-of-the-art algorithms. |
| Mohtavipour et al. [ | CNN model | Differential motion energy image (DMEI) and DMOF | The proposed model was able to improve violence detection with an accuracy of approximately 100% for both crowded and uncrowded environments. |
Figure 1U-Net and MobileNet V2 network model. Green shows copies of the encoder (MobileNet V2) features maps concatenated to the decoder feature maps.
Figure 2Proposed model architecture.
Figure 3MobileNet V2 comparison with other state-of-the-art classifiers in terms of accuracy (Elgendi et al. [37]).
A summary of some related studies in violence detection.
| Layer | Output Shape | No. of Parameters # |
|---|---|---|
| Time distribution (U-Net features extractor) | (30, 64, 64, 1) | 1,907,041 |
| LSTM | (128) | 2,163,200 |
| Dense | (32) | 4128 |
| Dense | (2) | 66 |
| Total parameters: 4,074,435 | ||
Figure 4Examples of violent (on the left) and nonviolent (on the right) scenes from the RWF-2000 dataset.
Figure 5Examples of violent (on the left) and nonviolent (on the right) scenes from the Movie Fights dataset.
Figure 6Examples of violent (on the left) and nonviolent (on the right) scenes from the Hockey Fights dataset.
Summary of datasets used.
| Dataset | Size of Dataset | Framework Rate |
|---|---|---|
| RWF-2000 | 1600 videos | 30 fps |
| Movie Fights | 200 videos | 25–30 fps |
| Hockey Fights | 1000 videos | 25 fps |
Summary of results based on the dataset.
| Dataset | Avg. Inference Time, s | Avg. Accuracy, % | Avg. Precision, % | Avg. F1 Score |
|---|---|---|---|---|
| RWF-2000 | 0.046 ± 15% | 82.0 ± 3% | 81.2 ± 3% | 0.782 ± 5% |
| Movie Fights | 0.056 ± 10% | 99.5 ± 2% | 100 ± 0% | 0.995 ± 2% |
| Hockey Fights | 0.022 ± 2% | 96.1 ± 1% | 97.3 ± 2% | 0.961 ± 1% |
Figure 7Precision–recall curve against RWF-2000 dataset.
Proposed model compared to existing studies in terms of accuracy and number of model parameters.
| Method | Accuracy of RWF-2000 Dataset, % | Accuracy of Hockey Fights Dataset, % | Accuracy of Movie Fights Dataset, % | # of Parameters in the Model |
|---|---|---|---|---|
| ViolenceNet Optical Flow (Rendón-Segador et al. [ | - | 99.2 ± 0.6% | 100 ± 0% | 4.5 M |
| Efficient 3D CNN (Li et al. [ | - | 98.3 ± 0.81% | 100 ± 0% | 7.4 M |
| Xception + Bi-LSTM + attention for 5 frames (Akti et al. [ | - | 98 ± 0% | 100 ± 0% | 9 M |
| Xception + Bi-LSTM + attention for 10 Frames (Akti et al. [ | - | 97.5 ± 0% | 100 ± 0% | 9 M |
| ViolenceNet Pseudo-Optical Flow (Rendón-Segador et al. [ | - | 97.5 ± 1% | 100 ± 0% | 4.5 M |
| C3D (Tran et al. [ | - | 87.4 ± 1.2% | 93.6 ± 1.2% | 78.0 M |
| AlexNet + LSTM RNN (Sudhakaran and Lanz [ | - | 97.1 ± 3% | 100 ± 0% | 9.6 M |
| end-to-end CNN-LSTM (AlDahoul et al. [ | 73.35 ± 3% | - | - | 1.266 M |
| Hough Forests + 2D CNN (Serrano et al. [ | - | 94.6 ± 0% | 99 ± 0% | not specified |
| Three Streams LSTM (Dong et al. [ | - | 93.9 ± 0% | - | not specified |
| MoSIFT (Xu et al. [ | - | 93.6 ± 1.67% | - | not specified |
| Histograms of frequency-based motion intensities + AdaBoost (Deniz et al. [ | - | 90.1 ± 0% | 98.9 ± 0% | not specified |
| ResNet50 + ConvLSTM (Sharma and Baghel [ | - | 89 ± 0% | 92 ± 0% | not specified |
| Fine-tuned MobileNet model (Khan et al. [ | - | 87 ± 0% | 99.5 ± 0% | not specified |
| Motion Blobs + Random Forest (Gracia et al. [ | - | 82.4 ± 0% | 96.9 ± 0% | not specified |
|
|
|
|
|
|