| Literature DB >> 35746125 |
Mahdi Maktab Dar Oghaz1, Manzoor Razaak2, Paolo Remagnino3.
Abstract
One common issue of object detection in aerial imagery is the small size of objects in proportion to the overall image size. This is mainly caused by high camera altitude and wide-angle lenses that are commonly used in drones aimed to maximize the coverage. State-of-the-art general purpose object detector tend to under-perform and struggle with small object detection due to loss of spatial features and weak feature representation of the small objects and sheer imbalance between objects and the background. This paper aims to address small object detection in aerial imagery by offering a Convolutional Neural Network (CNN) model that utilizes the Single Shot multi-box Detector (SSD) as the baseline network and extends its small object detection performance with feature enhancement modules including super-resolution, deconvolution and feature fusion. These modules are collectively aimed at improving the feature representation of small objects at the prediction layer. The performance of the proposed model is evaluated using three datasets including two aerial images datasets that mainly consist of small objects. The proposed model is compared with the state-of-the-art small object detectors. Experiment results demonstrate improvements in the mean Absolute Precision (mAP) and Recall values in comparison to the state-of-the-art small object detectors that investigated in this study.Entities:
Keywords: SSD; deconvolution; feature fusion; small object detection; super-resolution
Year: 2022 PMID: 35746125 PMCID: PMC9228717 DOI: 10.3390/s22124339
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.847
Figure 1Objects in UAV images are usually small in size (proportional to total image size) and general purpose object detectors are not designed to cope with it [11].
Figure 2Poor feature representation of small objects at deeper layers of typical Convolutional Neural Networks, which are usually caused by multiple pooling and stride >1 processes.
A summary of selected works on CNN-object detection for small objects in images.
| Strategy | Authors | Model Features | Data | Results |
|---|---|---|---|---|
| Two-stage detectors | [ | Feature extraction CNN combined with the R-CNN framework | Mobile Mapping Systems (MMS) images | mAP of up to 85%. Comparatively, 12% higher accuracy than ResNet-152 |
| [ | R-CNN network combined with Tiny-Net, global attention block followed by a final classification block | Remote sensing images | Higher detection accuracy than R-CNN variants | |
| [ | A R-CNN network combined with a deconvolution layer | Remote sensing images | Higher accuracy than Faster R-CNN is reported | |
| [ | A region proposal network combined with fusion network that concatenates spatial and semantic information | Remote sensing | Improved detection accuracy compared to state-of-the-art | |
| [ | Multi-block SSD consists of three stages, including patching, detection, and stitching | Railway scene dataset | Improved detection rate of small objects by 23.2% in comparison with the baseline object detectors | |
| Single stage detectors | [ | Various configurations of SSD architecture, including stride elimination at different parts of the network | MS COCO dataset | Better detection accuracy for small objects in the COCO dataset when compared to baseline SSD |
| [ | Tiling-based approach for training and inference on an SSD network | Micro aerial vehicle imagery | Improved the detection performance on small objects when compared with full frame approaches | |
| [ | Modification of YOLOv3 model for multi-scale feature representation | UAV imagery | Improvement in small object detection when compared to base YOLOv3 model | |
| [ | YOLO model with multi-scale feature fusion | Traffic imagery for car accident detection | Able to detect car accidents in 0.04 seconds with 90% accuracy | |
| [ | Feature fusion and feature dilation combined with YOLO model | Vehicle imagery | Improved accuracy in the range of 80% and 88% on different datasets | |
| [ | YOLOv3 Residual blocks optimized by concatenating two ResNet units that have the same width and height | UAV imagery | Improved IoU to over 70% to 80% across different datasets compared with the baseline models | |
| [ | Region Context Network attention mechanism shortlists most promising regions, while discarding the rest of the input image to keep high resolution feature maps in deeper layers. | USC-GRAD-STD and MS COCO dataset | Improvement in average precision from 50.8% in baseline models to 57.4% | |
| [ | Feature fusion and spatial attention-based Multi-block SSD | LAKE-BOAT dataset | 79.3% mean average precision | |
| Super-resolution | [ | Patch-based and pixel-based CNN architectures for image segmentation to identify small objects | Remote sensing images | Classification accuracy of 87% reported |
| [ | A super-resolution-based generator network for up-sampling small objects | COCO dataset | Improved detection performance on small objects when compared with R-CNN models | |
| [ | Super-resolution method for feature enhancement to improve small object detection accuracy | Several RGB image datasets | Better detection accuracy compared to other super-resolution-based methods | |
| [ | A super-resolution-based Generative Adversarial Network (GAN) for small object detection | Several RGB image datasets | Achieved higher detection accuracy in comparison to R-CNN variants | |
| Feature Pyramids | [ | Extended feature pyramid network which employs large-scale super-resolution features with rich regional details to decouple small and medium object detection | Small traffic-sign Tsinghua-Tencent and MS COCO dataset | Better accuracy across both datasets compared to the state-of-the-art methods |
| [ | A two-stage detector (similar to the Faster-RCNN) which first adopts the feature pyramid architecture with lateral connections, then utilizes specialized anchors to detect the small objects from large resolution image | Small traffic-sign Tsinghua-Tencent dataset | Significant accuracy improvement compared with state-of-art methods | |
| [ | A parallel feature pyramid network constructed by widening the network width instead of increasing the network depth. Spatial pyramid pooling adopted to generate a pool of feature | MS-COCO dataset | 7.8% better average precision over latest variant of SSD | |
| [ | Multi-branch parallel feature pyramid network (MPFPN) used to boost feature extraction of the small objects. The parallel branch is designed to recover the features that missed in the deeper layers and a supervised spatial attention module used to suppress background interference | VisDrone-DET dataset | Competitive performance compared with other state-of-the-art memthods | |
| [ | Feture fusion and scaling-based SSD network with spatial context analysis | UAV imagery | Achieved 65.84% accuracy on PASCAL Visual Object Classes dataset. High accuracy on small objects in UAV images |
Figure 3Network architecture of the proposed object detection model. The SSD model is used as the baseline network and extended to include deconvolution module (orange), super-resolution module (green), and shallow layer feature fusion module (+).
Figure 4Schematic comparison of approaches used by different types of object detectors. (a) Single stage detectors (e.g., YOLO). (b) Multi-level features used at prediction layer (e.g., SSD). (c) Approach of supplying shallow layer features for prediction (e.g., FSSD). (d) Deconvolutional layers for improved feature representation (e.g., DSSD). (e) Network schema of our proposed approach.
Figure 5A deconvolutional module unit used after the SSD layers.
Figure 6A deconvolutional module unit merged with an SSD layer using element-wise sum operation.
Figure 7The super-resolution module.
Figure 8Concatenation of the and feature layers.
Figure 9Example images of the livestock dataset. In the images, sheep are small targets for the object detectors.
The mAP, Recall, and FPS comparison of the proposed model with the state-of-the-art small object detectors on our custom livestock dataset. Some of the general purpose object detectors have been included in this comparison.
| Model | FPS | Recall (%) | mAP (%) |
|---|---|---|---|
| SSD300 | 36.50 | 88.20 | 74.80 |
| SSD512 | 19.25 | 91.32 | 75.20 |
| CenterNet++ | 4.70 | 92.44 | 76.18 |
| YOLOv3 |
| 78.23 | 69.40 |
| Faster R-CNN | 7.40 | 83.60 | 71.20 |
| DSSD | 10.30 | 93.15 | 76.40 |
| FS-SSD | 17.35 | 93.91 | 77.14 |
| FF-SSD | 41.36 | 91.01 | 75.93 |
| MPFPN | 2.04 | 86.18 | 72.94 |
| EFPN | 4.14 | 90.23 | 74.81 |
|
| 8.75 |
|
|
Figure 10Comparison of small object detection between the proposed network (bottom row), the SSD network (middle row) and the FS-SSD network (top row) on the custom livestock dataset.
Figure 11Qualitative evaluation of bounding box predictions by the proposed network on custom livestock dataset. Blue boxes correspond to the ground truth label and red boxes are the predicted bounding boxes.
The mAP, Recall, and FPS comparison of the proposed model with state-of-the-art small object detectors on a subset of SDD dataset containing aerial images. Some of the general purpose object detectors have been included in this comparison.
| Model | FPS | Recall (%) | mAP (%) |
|---|---|---|---|
| SSD300 | 36.40 | 81.45 | 64.31 |
| SSD512 | 19.35 | 83.58 | 65.24 |
| CenterNet++ | 4.72 | 83.91 | 66.01 |
| YOLOv3 |
| 78.64 | 57.42 |
| Faster R-CNN | 7.40 | 80.75 | 59.60 |
| DSSD | 10.30 |
| 66.20 |
| FS-SSD | 18.05 | 85.88 | 66.02 |
| FF-SSD | 42.51 | 83.66 | 65.36 |
| MPFPN | 2.35 | 79.32 | 61.79 |
| EFPN | 4.33 | 82.11 | 63.94 |
|
| 8.75 | 85.95 |
|
Figure 12Comparison of small object detection of our proposed network (bottom row) with the SSD network (top row) onthe Pedestrian category from the Stanford drone dataset.
Figure 13Comparison of small object detection of our proposed network (bottom row) with SSD network (top row) on a custom dataset acquired under MONICA project data.
Ablation study of the proposed network on the livestock dataset. Different combinations of the feature fusion methods, super-resolution, and deconvolution were evaluated based on mAP and FPS.
| Feature Fusion | Deconvolution | SuperResolution | mAP | FPS |
|---|---|---|---|---|
| NA | NA | NA | 74.80 | 36.50 |
| Element-wise sum | NA | NA | 75.70 | 22.86 |
| Concatenation | NA | NA | 76.10 | 22.42 |
| Concatenation | NA | YES | 77.20 | 17.64 |
| Concatenation | YES | NA | 77.90 | 14.52 |
| Concatenation | YES | YES | 79.12 | 8.75 |
Figure 14Speed and accuracy comparison of the proposed method with the state-of-the-art methods on the livestock dataset. It can be observed that the proposed model supersedes other approaches in terms of mAP, which makes it suitable for applications where accuracy is critically important.
Figure 15Comparison between the proposed model mAP with some other object detectors on the SDD dataset. Car, Bicyclist, and Pedestrian categories were considered in this comparison.
Figure 16Comparison of mAP with IoU of 0.5 and 0.75 on the Pedestrian category of the SDD dataset.