| Literature DB >> 35746398 |
Dongri Shan1, Yalu Xu1, Peng Zhang2, Xiaofang Wang2, Dongmei He2, Chenglong Zhang1, Maohui Zhou1, Guoqi Yu1.
Abstract
Object detection is one of the most important and challenging branches of computer vision. It has been widely used in people's lives, such as for surveillance security and autonomous driving. We propose a novel dual-path multi-scale object detection paradigm in order to extract more abundant feature information for the object detection task and optimize the multi-scale object detection problem, and based on this, we design a single-stage general object detection algorithm called Dual-Path Single-Shot Detector (DPSSD). The dual path ensures that shallow features, i.e., residual path and concatenation path, can be more easily utilized to improve detection accuracy. Our improved dual-path network is more adaptable to multi-scale object detection tasks, and we combine it with the feature fusion module to generate a multi-scale feature learning paradigm called the "Dual-Path Feature Pyramid". We trained the models on PASCAL VOC datasets and COCO datasets with 320 pixels and 512 pixels input, respectively, and performed inference experiments to validate the structures in the neural network. The experimental results show that our algorithm has an advantage over anchor-based single-stage object detection algorithms and achieves an advanced level in average accuracy. Researchers can replicate the reported results of this paper.Entities:
Keywords: convolution neural networks; multi-scale; object detection; single-stage
Mesh:
Year: 2022 PMID: 35746398 PMCID: PMC9227523 DOI: 10.3390/s22124616
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.847
Figure 1Four paradigms of multi-scale object detection. (a) Image Pyramid: It learns multiple detectors from different scale images. (b) Prediction Pyramid: It predicts on multiple feature maps. (c) Integrated Features: They predict on a single feature map generated from multiple features. (d) Feature Pyramid: It combines the structure of the prediction pyramid and integrated features. (e) Dual-Path Feature Pyramid: It uses the structure of the prediction pyramid and two methods of feature fusion.
Figure 2The architecture of the dual-path single shot detector. We designed a dual-path network and a feature fusion module to obtain six high-level and low-level features after fusion. Finally, the classification and bounding box regression were carried out by one-by-one convolution. The figure shows that several layers in the base network were extracted as the features for predicting objects of different sizes.
Dual-path network architecture.
| Layers | Parameters | Output | Layers | Parameters | Output | ||||
|---|---|---|---|---|---|---|---|---|---|
| Size/Stride | Groups | Size/Stride | Groups | ||||||
| Conv1_1 | conv | 7 × 7/2 | 1 | 80 × 80 × 64 | Conv5_1 | conv_a | 1 × 1/1 | 1 | 20 × 20 × 1024 |
| maxpool | 3 × 3/2 | 1 | conv_b | 1 × 1/1 | 1 | ||||
| Conv2_1 | conv_a | 1 × 1/1 | 1 | 80 × 80 × 256 | conv_b_1 | 3 × 3/1 | 32 | ||
| conv_b | 1 × 1/1 | 1 | conv_b_2 | 1 × 1/1 | 1 | ||||
| conv_b_1 | 3 × 3/1 | 32 | Conv5_2 | conv_b | 1 × 1/1 | 1 | 20 × 20 × 1024 | ||
| conv_b_2 | 1 × 1/1 | 1 | conv_b_1 | 3 × 3/1 | 32 | ||||
| Conv2_2 | conv_b | 1 × 1/1 | 1 | 80 × 80 × 256 | conv_b_2 | 1 × 1/1 | 1 | ||
| conv_b_1 | 3 × 3/1 | 32 | Conv6_1 | conv_a | 1 × 1/2 | 1 | 10 × 10 × 1024 | ||
| conv_b_2 | 1 × 1/1 | 1 | conv_b | 1 × 1/1 | 1 | ||||
| Conv3_1 | conv_a | 1 × 1/2 | 1 | 40 × 40 × 512 | conv_b_1 | 3 × 3/2 | 32 | ||
| conv_b | 1 × 1/1 | 1 | conv_b_2 | 1 × 1/1 | 1 | ||||
| conv_b_1 | 3 × 3/2 | 32 | Conv7_1 | conv_a | 1 × 1/2 | 1 | 5 × 5 × 1024 | ||
| conv_b_2 | 1 × 1/1 | 1 | conv_b | 1 × 1/1 | 1 | ||||
| Conv3_2 | conv_b | 1 × 1/1 | 1 | 40 × 40 × 512 | conv_b_1 | 3 × 3/2 | 32 | ||
| conv_b_1 | 3 × 3/1 | 32 | conv_b_2 | 1 × 1/1 | 1 | ||||
| conv_b_2 | 1 × 1/1 | 1 | Conv8_1 | conv_a | 1 × 1/2 | 1 | 3 × 3 × 1024 | ||
| Conv4_1 | conv_a | 1 × 1/2 | 1 | 20 × 20 × 1024 | conv_b | 1 × 1/1 | 1 | ||
| conv_b | 1 × 1/1 | 1 | conv_b_1 | 3 × 3/2 | 32 | ||||
| conv_b_1 | 3 × 3/2 | 32 | conv_b_2 | 1 × 1/1 | 1 | ||||
| conv_b_2 | 1 × 1/1 | 1 | Conv9_1 | conv_a | 1 × 1/2 | 1 | 1 × 1 × 1024 | ||
| Conv4_2 | conv_b | 1 × 1/1 | 1 | 20 × 20 × 1024 | conv_b | 1 × 1/1 | 1 | ||
| conv_b_1 | 3 × 3/1 | 32 | conv_b_1 | 3 × 3/2 | 32 | ||||
| conv_b_2 | 1 × 1/1 | 1 | conv_b_2 | 1 × 1/1 | 1 | ||||
Figure 3Two paradigms of dual-path block: (a) Feature segmentation is realized by channel merging. (b) The 1 × 1 convolution operation is used for feature segmentation.
Figure 4Four paradigms of feature fusion module. (a) Using a two-layer convolution operation, features are fused by sum. (b) Using a two-layer convolution operation, features are fused by the product. (c) Using a one-layer convolution operation, features are fused by sum. (d) Changing the number of channels for fusion features, and using a two-layer convolution operation, features are fused by sum.
Figure 5The paradigm of deconvolution operation.
Figure 6Two paradigms of prediction module: (a) prediction module with a residual connection; (b) prediction module without residual connection.
Ablation study on PASCAL VOC 2007 test set.
| Method | mAP | Anchor Boxes | Input Resolution |
|---|---|---|---|
| DPN (a) + PM (a) | 78.9 | 17,080 | 320 × 320 |
| DPN (a) + FFM (a) + PM (a) | 81.2 | 17,080 | 320 × 320 |
| DPN (a) + FFM (b) + PM (a) | 80.6 | 17,080 | 320 × 320 |
| DPN (a) + FFM (c) + PM (a) | 80.8 | 17,080 | 320 × 320 |
| DPN (a) + FFM (d) + PM (a) | 80.5 | 17,080 | 320 × 320 |
| DPN (a) + FFM (a) + PM (b) | 80.9 | 17,080 | 320 × 320 |
| DPN (b) + FFM (a) + PM(a) | 67.1 | 17,080 | 320 × 320 |
PASCAL VOC2007 test detection results. All models were trained on the joint training set of VOC 2007 trainval and 2012 trainval and were tested on the VOC 2007 test dataset.
| Method | SSD300 [ | SSD512 [ | STDN300 [ | STDN321 [ | STDN513 [ | DSSD321 [ | DSSD513 [ | DPSSD320 (Ours) | DPSSD512 (Ours) |
|---|---|---|---|---|---|---|---|---|---|
|
| VGG | VGG | DenseNet-169 | DenseNet-169 | DenseNet-169 | Residual-101 | Residual-101 | DPN | DPN |
|
| 77.5 | 79.5 | 78.1 | 79.3 | 80.9 | 78.6 | 81.5 | 81.2 | 82.9 |
|
| 79.5 | 84.8 | 81.1 | 81.2 | 86.1 | 81.9 | 86.6 | 88.5 | 87.9 |
|
| 83.9 | 85.1 | 86.9 | 88.3 | 89.3 | 84.9 | 86.2 | 87 | 88 |
|
| 76 | 81.5 | 76.4 | 78.1 | 79.5 | 80.5 | 82.6 | 82.3 | 87.1 |
|
| 69.6 | 73 | 69.2 | 72.2 | 74.3 | 68.4 | 74.9 | 76.2 | 79.9 |
|
| 50.5 | 57.8 | 52.4 | 54.3 | 61.9 | 53.9 | 62.5 | 56.5 | 66.3 |
|
| 87 | 87.8 | 87.7 | 87.6 | 88.5 | 85.6 | 89 | 88.7 | 88.5 |
|
| 85.7 | 88.3 | 84.2 | 86.5 | 88.3 | 86.2 | 88.7 | 88.2 | 89 |
|
| 88.1 | 87.4 | 88.3 | 88.8 | 89.4 | 88.9 | 88.8 | 88.4 | 88.4 |
|
| 60.3 | 63.5 | 60.2 | 63.5 | 67.4 | 61.1 | 65.2 | 67.4 | 71.2 |
|
| 81.5 | 85.4 | 81.3 | 83.2 | 86.5 | 83.5 | 87 | 84.6 | 87.3 |
|
| 77 | 73.2 | 77.6 | 79.4 | 79.5 | 78.7 | 78.7 | 77.3 | 79.2 |
|
| 86.1 | 86.2 | 86.6 | 86.1 | 86.4 | 86.7 | 88.2 | 86.7 | 88 |
|
| 87.5 | 86.7 | 88.9 | 89.3 | 89.2 | 88.7 | 89 | 89 | 89.1 |
|
| 83.9 | 83.9 | 87.8 | 88 | 88.5 | 86.7 | 87.5 | 87.8 | 87.3 |
|
| 79.4 | 82.5 | 76.8 | 77.3 | 79.3 | 79.7 | 83.7 | 80.9 | 85 |
|
| 52.3 | 55.6 | 51.8 | 52.5 | 53 | 51.7 | 51.1 | 59.5 | 59 |
|
| 77.9 | 81.7 | 78.4 | 80.3 | 77.9 | 78 | 86.3 | 84.3 | 86.1 |
|
| 79.5 | 79 | 81.3 | 80.8 | 81.4 | 80.9 | 81.6 | 83.7 | 81.9 |
|
| 87.6 | 86.6 | 87.5 | 86.3 | 86.6 | 87.2 | 85.7 | 87 | 86.2 |
|
| 76.8 | 80 | 77.8 | 82.1 | 85.5 | 79.4 | 83.7 | 80.6 | 82.8 |
Figure 7Accuracy and speed on PASCAL VOC2007.
COCO test-dev2015 detection results.
| Method | Data | Network | Avg. Precision, IoU: | Avg. Precision, Area: | Avg. Recall, #Dets: | Avg. Recall, Area: | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.5:0.95 | 0.5 | 0.75 | S | M | L | 1 | 10 | 100 | S | M | L | |||
| SSD300 [ | trainval35k | VGG | 25.1 | 43.1 | 25.8 | 6.6 | 25.9 | 41.4 | 23.7 | 35.1 | 37.2 | 11.2 | 40.4 | 58.4 |
| SSD512 [ | trainval35k | VGG | 28.8 | 48.5 | 30.3 | 10.9 | 31.8 | 43.5 | 26.1 | 39.5 | 42.0 | 16.5 | 46.6 | 60.8 |
| DSSD321 [ | trainval35k | Residual-101 | 28.0 | 46.1 | 29.2 | 7.4 | 28.1 | 47.6 | 25.5 | 37.1 | 39.4 | 12.7 | 42.0 | 62.6 |
| DSSD513 [ | trainval35k | Residual-101 | 33.2 | 53.3 | 35.2 | 13.0 | 35.4 | 51.1 | 28.9 | 43.5 | 46.2 | 21.8 | 49.1 | 66.4 |
| STDN300 [ | trainval | DenseNet-169 | 28.0 | 45.6 | 29.4 | 7.9 | 29.7 | 45.1 | 24.4 | 36.1 | 38.7 | 12.5 | 42.7 | 60.1 |
| STDN513 [ | trainval | DenseNet-169 | 31.8 | 51.0 | 33.6 | 14.4 | 36.1 | 43.4 | 27.0 | 40.1 | 41.9 | 18.3 | 48.3 | 57.2 |
| DPSSD320 (ours) | trainval35k | DPN | 30.6 | 50.2 | 32.2 | 10.3 | 32.0 | 47.6 | 26.8 | 39.5 | 41.5 | 16.1 | 44.9 | 62.6 |
| DPSSD512 (ours) | trainval35k | DPN | 33.9 | 53.8 | 36.3 | 14.5 | 37.5 | 48.7 | 28.7 | 43.4 | 45.7 | 20.6 | 51.2 | 64.3 |
The speed and accuracy of the algorithm are summarized as follows. The training data are the combination of VOC2007 trainval and VOC2012 trainval.
| Method | Base Network | mAP | Speed ( | Anchor Boxes | GPU | Input Resolution |
|---|---|---|---|---|---|---|
| SSD300 [ | VGG16 | 77.5 | 46 | 8732 | Titan X | 300 × 300 |
| SSD512 [ | VGG16 | 79.5 | 19 | 24,564 | Titan X | 512 × 512 |
| SSD300 (copied) | VGG16 | 77.6 | 49 | 8732 | Titan Xp | 300 × 300 |
| SSD512 (copied) | VGG16 | 79.7 | 24 | 24,564 | Titan Xp | 512 × 512 |
| DSSD321 [ | Residual-101 | 78.6 | 9.5 | 17,080 | Titan X | 321 × 321 |
| DSSD513 [ | Residual-101 | 81.5 | 5.5 | 43,688 | Titan X | 513 × 513 |
| DSSD321 (copied) | Residual-101 | 78.7 | 12.7 | 17,080 | Titan Xp | 321 × 321 |
| DSSD513 (copied) | Residual-101 | 81.3 | 9.8 | 43,688 | Titan Xp | 513 × 513 |
| STDN300 [ | DenseNet-169 | 78.1 | 41.5 | 13,888 | Titan Xp | 300 × 300 |
| STDN321 [ | DenseNet-169 | 79.2 | 40.1 | 17,080 | Titan Xp | 321 × 321 |
| STDN513 [ | DenseNet-169 | 80.9 | 28.6 | 43,680 | Titan Xp | 513 × 513 |
| DPSSD320 (ours) | DPN | 81.2 | 30.7 | 17,080 | Titan Xp | 320 × 320 |
| DPSSD512 (ours) | DPN | 82.9 | 21.3 | 43,680 | Titan Xp | 512 × 512 |