| Literature DB >> 32647299 |
Wei Li1, Kai Liu2, Lizhe Zhang1, Fei Cheng1.
Abstract
Object detection is an important component of computer vision. Most of the recent successful object detection methods are based on convolutional neural networks (CNNs). To improve the performance of these networks, researchers have designed many different architectures. They found that the CNN performance benefits from carefully increasing the depth and width of their structures with respect to the spatial dimension. Some researchers have exploited the cardinality dimension. Others have found that skip and dense connections were also of benefit to performance. Recently, attention mechanisms on the channel dimension have gained popularity with researchers. Global average pooling is used in SENet to generate the input feature vector of the channel-wise attention unit. In this work, we argue that channel-wise attention can benefit from both global average pooling and global max pooling. We designed three novel attention units, namely, an adaptive channel-wise attention unit, an adaptive spatial-wise attention unit and an adaptive domain attention unit, to improve the performance of a CNN. Instead of concatenating the output of the two attention vectors generated by the two channel-wise attention sub-units, we weight the two attention vectors based on the output data of the two channel-wise attention sub-units. We integrated the proposed mechanism with the YOLOv3 and MobileNetv2 framework and tested the proposed network on the KITTI and Pascal VOC datasets. The experimental results show that YOLOv3 with the proposed attention mechanism outperforms the original YOLOv3 by mAP values of 2.9 and 1.2% on the KITTI and Pascal VOC datasets, respectively. MobileNetv2 with the proposed attention mechanism outperforms the original MobileNetv2 by a mAP value of 1.7% on the Pascal VOC dataset.Entities:
Year: 2020 PMID: 32647299 PMCID: PMC7347846 DOI: 10.1038/s41598-020-67529-x
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Flowchart of YOLOv3 architecture with adaptive attention.
Figure 2Adaptive channel-wise attention units. r and s are compression ratios.
Figure 3Flowchart of the domain attention unit. s is the compression ratio.
Figure 4Flowchart of the spatial attention unit. t is the compression ratio.
Figure 5Detailed intro-block connections of ‘ACBL’ and ‘ARes*N’.
Comparison of the proposed model with recent works on KITTI dataset.
| Method | Car | Pedestrian | Cyclist | mAP | Inference time (ms) | Model size (M) | Gflops |
|---|---|---|---|---|---|---|---|
| YOLOv3 (original) | 91.1 | 74.6 | 79.0 | 81.6 | 47 | 246.3 | 112.6 |
| YOLOv3+SE (GMP) | 91.4 | 75.1 | 81.6 | 82.7 | 49 | 251.6 | 115.3 |
| YOLOv3+SE (GAP) | 91.2 | 75.5 | 82.2 | 83.0 | 49 | 251.6 | 115.3 |
| YOLOv3+SE (GAP and GMP) | 91.7 | 75.6 | 82.8 | 83.4 | 50 | 257.0 | 117.2 |
| YOLOv3+ACA (ours) | 92.3 | 77.0 | 83.4 | 84.2 | 50 | 262.3 | 119.9 |
| YOLOv3+ACA+ASA (ours) | 92.6 | 77.4 | 83.6 | 84.5 | 51 | 262.5 | 120.0 |
Evaluation results on KITTI dataset.
| Method | Car | Pedestrian | Cyclist | mAP | Inference time (ms) |
|---|---|---|---|---|---|
| Gaussian YOLOv3[ | 87.3 | 79.9 | 83.6 | 83.6 | 47 |
| RefineDet[ | 92.7 | 78.5 | 83.6 | 81.9 | 72 |
| SSD[ | 85.1 | 48.1 | 50.7 | 61.3 | 69 |
| RFBNet[ | 86.4 | 61.6 | 71.7 | 73.4 | 51 |
| SqueezeDet+[ | 85.5 | 73.7 | 82.0 | 80.4 | 31 |
| MS-CNN[ | 87.4 | 80.4 | 86.3 | 84.7 | 246 |
| YOLOv3+adaptive attention (ours) | 91.2 | 75.0 | 81.3 | 84.5 | 51 |
Figure 6Loss curves.
Figure 7AP curves for each class.
Evaluation results on the PASCAL VOC dataset.
| Method | mAP | Inference time (ms) |
|---|---|---|
| YOLOv3 | 81.0 | 49 |
| YOLOv3+adaptive attention (ours) | 82.2 | 54 |
Figure 8MobileNetv2 with modified SSD detector model. ASA is adaptive spatial-wise attention. ACA is adaptive channel-wise attention.
Figure 9Inverted residual module. ASA or ACA is used to recalibrate the feature maps (blue cube) generated by the point-wise convolution layer within the IRM module.
MobileNetv2 with modified SSD detector evaluation results on the PASCAL VOC dataset.
| Method | mAP | Inference time (ms) |
|---|---|---|
| MobileNetv2 with modified SSD detector | 68.4 | 3.66 |
| MobileNetv2 with modified SSD detector + adaptive attention | 70.1 | 3.92 |