| Literature DB >> 35918405 |
Rahul Nijhawan1, Sharik Ali Ansari2, Sunil Kumar3, Fawaz Alassery4, Sayed M El-Kenawy5,6.
Abstract
Increased mass shootings and terrorist activities severely impact society mentally and physically. Development of real-time and cost-effective automated weapon detection systems increases a sense of safety in public. Most of the previously proposed methods were vision-based. They visually analyze the presence of a gun in a camera frame. This research focuses on gun-type (rifle, handgun, none) detection based on the audio of its shot. Mel-frequency-based audio features have been used. We compared both convolution-based and fully self-attention-based (transformers) architectures. We found transformer architecture generalizes better on audio features. Experimental results using the proposed transformer methodology on audio clips of gunshots show classification accuracy of 93.87%, with training loss and validation loss of 0.2509 and 0.1991, respectively. Based on experiments, we are convinced that our model can effectively be used as both a standalone system and in association with visual gun-detection systems for better security.Entities:
Mesh:
Year: 2022 PMID: 35918405 PMCID: PMC9345922 DOI: 10.1038/s41598-022-17497-1
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.996
Figure 1Heatmaps shown in the map represent the data collected by Mother Jones. The data shows the mass-shootings from years 1981 to 2021. More red areas means, multiple mass-shootings in that area. Software: Location History Visualizer, Source: https://locationhistoryvisualizer.com/heatmap/.
This table shows various techniques employed in previous relevant audio related researches along with a research focus (purpose) and publish year.
| Techniques | Purpose | Year | Accuracy (%) |
|---|---|---|---|
| Neural network, SVM, KNN, decision tree[ | Audio classification | 2003–2007 | 60.0–80.4 |
| Multi layer perceptron[ | Audio classification | 2008 | 70.1 |
| One-class SVM[ | Audio classification | 2009 | 76.3 |
| Neural network[ | Feature extraction | 2010 | 80.0 |
| Deep neural network[ | Audio classification | 2013 | 85.2–8 |
| CNN, hierarchical neural network[ | Audio Classification | 2014, 2015 | 86.1 |
| LSTM, RNN[ | Audio classification | 2016 | 88.2–89.3 |
| CNN[ | Audio tagging, deep feature extraction | 2017 | 90.3–91.5 |
| Deep unsupervised learning, Unsupervised learning, weakly supervised learning, attention network[ | Audio event detection, audio representation, audio classification | 2018 | 89.0–92.0 |
| Few-shot attention, graph neural network, adversarial feature, capsule network[ | Audio classification | 2019 | 89.2–91.5 |
| Attention-based networks, DNN ensemble[ | Audio classification | 2020 | 90.2-91.8 |
| Attention-based networks, zero-shot federated learning[ | Audio classification | 2021 | 91.0-92.5 |
| Proposed approach | Audio classification | 2022 | 93.8 |
It provides a general outline of previous literature. More relevant previous works are delineated in the literature review section with important detail.
Figure 2The figure shows the division of an image into patches of size . The outputs from the linear projection layer is combined with positional embedding and a learnable class embedding for classification. The above diagram of the transformer encoder was derived from the work of Vaswani et al.[43].
Figure 3The diagram shows VIT-32 Model Architecture. It contains 24 transformer encoder blocks. The encoder block is shown in Fig. 2. in details. The arrows show the forward propagation.
5-Fold cross validation audio classification results in percentage.
| Class | Accuracy | Recall | Precision | F1 score |
|---|---|---|---|---|
| Rifle | 91.52 | 89.44 | 89.01 | 88.40 |
| Handgun | 90.85 | 89.76 | 88.65 | 89.32 |
| Random noise | 91.21 | 90.12 | 88.43 | 90.41 |
| All classes | 90.01 | 90.34 | 89.87 | 90.34 |
Comparison of our proposed approach with state of the art algorithms (testing accuracy range), on different available datasets.
| Model | Our dataset (%) | TREC (%) | Urban (%) |
|---|---|---|---|
| Resnet50 (Baseline: raw audio) | 76.0–78.0 | 71.0–73.0 | 70.0–73.4 |
| Capsule network[ | 82.2–83.6 | 83.1–84.3 | 80.1–81.9 |
| DNN ensemble[ | 83.0–84.5 | 82.2–83.4 | 84.0–85.0 |
| Zero-shot federated learning[ | 83.5–86.0 | 82.9–24.8 | 83.0–85.0 |
| Resnet50+MFCC+MelSpectogram | 84.0–87.0 | 83.0–84.5 | 82.9–85.0 |
| (PA) VT+MFCC+MelSpectogram | 89.0–90.0 | 88.0–89.5 | 87.4–89.0 |
Figure 4Graph (A) shows the training and validation accuracy curve for Resnet50 on our dataset. Graph (B) shows the training and validation loss curve for Resnet50 on our dataset. Graph (C) training and validation accuracy curve for VIT-32 on our dataset. Graph (D) shows the training and validation loss curve for VIT-32 on our dataset. In all the graphs, the x-axis shows the number of epochs.
Resnet vs ViT top training accuracy and loss range in nearby epochs.
| Rifle vs Handgun shot audio dataset | Top accuracy (%) | Loss range in nearby epochs |
|---|---|---|
| Resnet50+MFCC+MelSpectogram | 93.87 | 0.0004–0.0400 |
| VIT-32+MFCC+MelSpectogram | 93.87 | 0.2768–1.538 |