| Literature DB >> 34960386 |
Shabana Habib1, Altaf Hussain2, Waleed Albattah1, Muhammad Islam3, Sheroz Khan3, Rehan Ullah Khan1, Khalil Khan4.
Abstract
Background and motivation: Every year, millions of Muslims worldwide come to Mecca to perform the Hajj. In order to maintain the security of the pilgrims, the Saudi government has installed about 5000 closed circuit television (CCTV) cameras to monitor crowd activity efficiently. PROBLEM: As a result, these cameras generate an enormous amount of visual data through manual or offline monitoring, requiring numerous human resources for efficient tracking. Therefore, there is an urgent need to develop an intelligent and automatic system in order to efficiently monitor crowds and identify abnormal activity.Entities:
Keywords: CCTV; CNN; Hajj pilgrims monitoring; LSTM; crowd monitoring; lightweight; violent activity recognition
Mesh:
Year: 2021 PMID: 34960386 PMCID: PMC8703748 DOI: 10.3390/s21248291
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Tragic stampedes during Hajj [3].
| Date | Event and Location | Casualties |
|---|---|---|
| 24 September 2015 | Stampede at the junction of streets 204 and 223 in Mina | 2110 |
| 12 January 2006 | Stampede at Jamarat Bridge in Mina | 364 |
| 1 February 2004 | 27-min stampede during Jamarat stoning | 251 |
| 11 February 2003 | Stampede at Jamarat in Mina | 14 |
| 5 March 2001 | Stampede at Jamarat in Mina | 35 |
| 9 April 1998 | Stampede/overpass fall off at Jamarat Mina | 118 |
| 15 April 1997 | Fire fueled by high winds in tent city, Mina | 343 |
| 23 May 1994 | Stampede at Jamarat in Mina | 270 |
| 2 July 1990 | Stampede/suffocation in the tunnel leading to Haram | 1426 |
| 31 July 1987 | Security forces break up anti-US demo by Iranian Hajis | 402 |
Figure 1The proposed framework of violent activity recognition.
Figure 2Visual representation of YOLOV4 model on pilgrim’s dataset.
The internal details of the proposed MobileNetV2 for activity recognition.
| Layer (Type) | Output Shape | Numbers of Parameters | Connected to |
|---|---|---|---|
| global_average_pooling2d | (None, 1280) | 0 | out_relu[0][0] |
| Dense | (None, 1024) | 1,311,744 | global_average_pooling2d[0][0] |
| Dense | (None, 1024) | 1,049,600 | dense[0][0] |
| Dense | (None, 512) | 524,800 | dense_1[0][0] |
| Dense | (None, 2) | 1026 | dense_2[0][0] |
Description of input and output parameters used in the proposed temporal features extraction model.
| Variable or Symbol | Meaning |
|---|---|
| T | Time |
|
| Input gate |
|
| Forget |
|
| Output |
|
| Recurring unit |
|
| Input at the current time |
|
| Hidden state of the previous time step |
|
| Current hidden state |
|
| Memory cell |
|
| Final representation of the entire sequence |
|
| Bias |
Detailed summary of the spatial-temporal model for abnormal activity recognition.
| Layer (Type) | Output Shape | Numbers of Parameters |
|---|---|---|
| InputLayer | (None, 30, 1000) | 0 |
| LSTM | (None, 30, 256) | 1,287,168 |
| LSTM | (None, 30, 128) | 197,120 |
| Batch Normalisation | (None, 30, 128) | 512 |
| LSTM | (None, 30, 64) | 49,408 |
| Flatten | (None, 1920) | 0 |
| Batch Normalisation | (None, 1920) | 7680 |
| Dense | (None, 256) | 491,776 |
| Dropout | (None, 256) | 0 |
| Dense | (None, 128) | 32,896 |
| Dropout | (None, 128) | 0 |
| Dense | (None, 2) | 258 |
Figure 3Visual representation of the Hockey Fight and Surveillance Fight datasets.
Description and statistics of the Hockey Fight dataset.
| Dataset | Details |
|---|---|
| Dataset | Hockey Fight [ |
| Samples | 1000 |
| Resolution | 360 × 288 × 3 |
| Violent Scenes | 500 |
| Nonviolent Scenes | 500 |
Description and statistics of the Surveillance Fight dataset.
| Dataset | Details |
|---|---|
| Dataset | Surveillance Fight [ |
| Samples | 300 |
| Resolution | 480 × 360, 1280 × 720 |
| Violent Scenes | 150 |
| Nonviolent Scenes | 150 |
Performance of the proposed sequential learning model on the Hockey Fight dataset.
| Model | Category | Precision | Recall | F1-Score | Support | Accuracy (%) |
|---|---|---|---|---|---|---|
| AlexNet | Violent Activity | 0.67 | 0.98 | 0.80 | 101 | 75 |
| Normal Activity | 0.96 | 0.52 | 0.67 | 99 | ||
| VGG-16 | Violent Activity | 0.92 | 0.76 | 0.83 | 101 | 84 |
| Normal Activity | 0.79 | 0.93 | 0.86 | 99 | ||
| Proposed | Violent Activity | 0.96 | 0.97 | 0.97 | 101 | 96 |
| Normal Activity | 0.97 | 0.96 | 0.96 | 99 |
Experimental result of the first sequential model on the Hockey Fight dataset.
| LR | Category | Precision | Recall | F1-Score | Accuracy (%) |
|---|---|---|---|---|---|
| 0.000001 | Violent Activity | 0.909091 | 0.9375 | 0.923077 | 0.925 |
| Normal Activity | 0.940594 | 0.913462 | 0.926829 | ||
| 0.00001 | Violent Activity | 0.927835 | 0.9375 | 0.932642 | 0.935 |
| Normal Activity | 0.941748 | 0.932692 | 0.937198 | ||
| 0.0001 | Violent Activity | 0.909091 | 0.9375 | 0.923077 | 0.925 |
| Normal Activity | 0.940594 | 0.913462 | 0.926829 | ||
| 0.001 | Violent Activity | 0.843137 | 0.895833 | 0.868687 | 0.87 |
| Normal Activity | 0.897959 | 0.846154 | 0.871287 |
Detail summary of the second experiment. A bidirectional LSTM was used to learn the complex pattern of violent activity effectively.
| LR | Category | Precision | Recall | F1-Score | Accuracy (%) |
|---|---|---|---|---|---|
| 0.000001 | Violent Activity | 0.818182 | 0.9375 | 0.873786 | 0.87 |
| Normal Activity | 0.933333 | 0.807692 | 0.865979 | ||
| 0.00001 | Violent Activity | 0.936842 | 0.927083 | 0.931937 | 0.935 |
| Normal Activity | 0.933333 | 0.942308 | 0.937799 | ||
| 0.0001 | Violent Activity | 0.927835 | 0.9375 | 0.932642 | 0.935 |
| Normal Activity | 0.941748 | 0.932692 | 0.937198 | ||
| 0.001 | Violent Activity | 0.927083 | 0.927083 | 0.927083 | 0.93 |
| Normal Activity | 0.932692 | 0.932692 | 0.932692 |
A detailed summary of the tesidual LSTM architecture.
| LR | Category | Precision | Recall | F1-Score | Accuracy (%) |
|---|---|---|---|---|---|
| 0.000001 | Violent Activity | 0.89899 | 0.927083 | 0.912821 | 0.915 |
| Normal Activity | 0.930693 | 0.903846 | 0.917073 | ||
| 0.00001 | Violent Activity | 0.908163 | 0.927083 | 0.917526 | 0.92 |
| Normal Activity | 0.931373 | 0.913462 | 0.92233 | ||
| 0.0001 | Violent Activity | 0.90099 | 0.947917 | 0.923858 | 0.925 |
| Normal Activity | 0.949495 | 0.903846 | 0.926108 | ||
| 0.001 | Violent Activity | 0.908163 | 0.927083 | 0.917526 | 0.92 |
| Normal Activity | 0.931373 | 0.913462 | 0.92233 |
Performance of lightweight LSTM model.
| LR | Category | Precision | Recall | F1-Score | Accuracy (%) |
|---|---|---|---|---|---|
| 0.000001 | Violent Activity | 0.897959 | 0.916667 | 0.907216 | 0.91 |
| Normal Activity | 0.921569 | 0.903846 | 0.912621 | ||
| 0.00001 | Violent Activity | 0.919192 | 0.947917 | 0.933333 | 0.935 |
| Normal Activity | 0.950495 | 0.923077 | 0.936585 | ||
| 0.0001 | Violent Activity | 0.988989899 | 0.977083333 | 0.982820513 | 0.982965 |
| Normal Activity | 0.980693069 | 0.988461538 | 0.990731707 | ||
| 0.001 | Violent Activity | 0.866667 | 0.677083 | 0.760234 | 0.795 |
| Normal Activity | 0.752 | 0.903846 | 0.820961 |
Performance of lightweight LSTM model on Surveillance Fight dataset.
| LR | Category | Precision | Recall | F1-Score | Accuracy (%) |
|---|---|---|---|---|---|
| 0.000001 | Violent Activity | 0.56338 | 0.833333 | 0.672269 | 0.589474 |
| Normal Activity | 0.666667 | 0.340426 | 0.450704 | ||
| 0.00001 | Violent Activity | 0.621212 | 0.854167 | 0.719298 | 0.663158 |
| Normal Activity | 0.758621 | 0.468085 | 0.578947 | ||
| 0.0001 | Violent Activity | 0.8 | 0.833333 | 0.816327 | 0.810526 |
| Normal Activity | 0.822222 | 0.787234 | 0.804348 | ||
| 0.001 | Violent Activity | 0.527473 | 1 | 0.690647 | 0.547368 |
| Normal Activity | 1 | 0.085106 | 0.156863 |
Cross fold k = 5 validation on the Surveillance Fight dataset.
| Experiments | Accuracy | Loss |
|---|---|---|
| Fold 1 | 77.89473533630371% | 0.6172249455200999 |
| Fold 2 | 76.31579041481018% | 0.9399966472073605 |
| Fold 3 | 81.05263113975525% | 0.5331563221780877 |
| Fold 4 | 73.68420958518982% | 0.8440117999127037 |
| Fold 5 | 76.59574747085571% | 0.6725113670876686 |
| Average scores for all folds: | 77.10862278938293 | 0.721380216381184 |
Performance of the proposed model on Hockey Fight dataset on cross folds k = 5 validation.
| Experiments | Accuracy | Loss |
|---|---|---|
| Fold 1 | 94.49999928474426% | 0.3104947102069855 |
| Fold 2 | 95.49999833106995% | 0.44586579911410806 |
| Fold 3 | 88.49999904632568% | 0.39921369552612307 |
| Fold 4 | 87.74999976158142% | 1.0236860617576167 |
| Fold 5 | 91.50000214576721% | 0.3647297534346581 |
| Average scores for all folds: | 91.5499997138977 | 0.5087980040078983 |
Figure 4The computational complexity of the models; (a) represents the computational complexity of the CNN model, while (b) represents the computational complexity of spatio-temporal models.
Comparative analysis of the proposed model with existing methods.
| Methods | Dataset | |
|---|---|---|
| Hockey Fight (%) | Surveillance Fight (%) | |
| Motion Blobs and Random Forest [ | 82.40 | -- |
| VIF [ | 82.90 | -- |
| ViF, OViF, AdaBoost, and SVM [ | 87.50 | -- |
| Fine-Tune MobileNet [ | 87.00 | -- |
| Improved Fisher Vectors [ | 93.70 | -- |
| Hough Forests and 2D CNN [ | 94.60 | -- |
| HOG3D + KELM [ | 95.05 | -- |
| Two streams (Optical Flow + Darknet-19) [ | 98 | 74 |
| Proposed CNN-LSTM model | 98.00 | 81.05 |