| Literature DB >> 34203469 |
Dong Wang1, Huaming Wu1.
Abstract
It is a common paradigm in object detection frameworks that the samples in training and testing have consistent distributions for the two main tasks: Classification and bounding box regression. This paradigm is popular in sampling strategy for training an object detector due to its intuition and practicability. For the task of localization quality estimation, there exist two ways of sampling: The same sampling with the main tasks and the uniform sampling by manually augmenting the ground-truth. The first method of sampling is simple but inconsistent for the task of quality estimation. The second method of uniform sampling contains all IoU level distributions but is more complex and difficult for training. In this paper, we propose an H+L-Sampling strategy, selecting the high and low IoU samples simultaneously, to effectively and simply train the branch of quality estimation. This strategy inherits the effectiveness of consistent sampling and reduces the training difficulty of uniform sampling. Finally, we introduce accurate detection confidence, which combines the classification probability and the localization accuracy, as the ranking keyword of NMS. Extensive experiments show the effectiveness of our method in solving the misalignment between classification confidence and localization accuracy and improving the detection performance.Entities:
Keywords: IoU regression; Non-Maximum Suppression; R-CNN; detection confidence; object detection
Mesh:
Year: 2021 PMID: 34203469 PMCID: PMC8271873 DOI: 10.3390/s21134433
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1The framework of IoU-aware R-CNN. The main branches and the auxiliary branch are performed on and , respectively. The auxiliary branch estimates the localization quality and almost does not affect the original network of Faster R-CNN, which only changes the ranking confidence of the NMS process. Unless otherwise stated, we use class-agnostic IoU regression in a simple manner.
Figure 2For the main branches and the auxiliary branch, we separately take samples from the RPN proposals , adopting different sampling strategies. Note that the Transform refers to using the bounding box regressor in the main branches to refine the L-Samples and obtain the H-Samples.
Figure 3(top) vs. (bottom). The distributions of samples are intuitively different, and the regressed have higher IoU with objects than .
Figure 4The overall pipeline of our two-stage detector of IoU-Aware R-CNN. We change the way to choose the top-N detections. This simple but powerful branch demonstrates significant improvement in detection performance.
Results of selecting different training samples for IoU regression. Quantitative results show that our H+L-Sampling strategy is effective to resolve the misalignment problem.
| Method | AP | AP | AP | AP | AP | AP |
|---|---|---|---|---|---|---|
| Baseline FPN | 37.7 | 58.5 | 54.2 | 46.4 | 33.2 | 11.1 |
|
| 38.5 | 57.7 | 53.4 | 46.9 | 35.4 | 14.5 |
|
| 38.8 | 57.9 | 53.7 | 47.1 | 35.5 | 15.2 |
|
| 39.0 | 57.8 | 53.6 | 47.2 | 36.1 | 15.5 |
The impact of the number of samples for IoU regression.
| Numbers | AP | AP | AP | AP | AP | AP |
|---|---|---|---|---|---|---|
| 32 | 38.9 | 57.8 | 53.5 | 47.1 | 36.1 | 15.3 |
| 64 | 39.0 | 57.8 | 53.6 | 47.2 | 36.1 | 15.5 |
| 128 | 39.1 | 57.9 | 53.8 | 47.2 | 36.4 | 15.5 |
Detailed comparison of NMS with different confidences of and on multiple popular backbones.
| Backbone | Method | NMS | NMS | AP | AP50 | AP60 | AP70 | AP80 | AP90 |
|---|---|---|---|---|---|---|---|---|---|
| R-50 | FPN | ✓ | 37.7 | 58.5 | 54.2 | 46.4 | 33.2 | 11.1 | |
| IoU-aware | ✓ | 37.6 | 58.2 | 53.6 | 46.2 | 33.4 | 11.7 | ||
| R-CNN | ✓ | 39.0 | 57.8 | 53.6 | 47.2 | 36.1 | 15.5 | ||
| R-101 | FPN | ✓ | 39.4 | 60.1 | 55.6 | 48.3 | 35.6 | 12.9 | |
| IoU-aware | ✓ | 39.6 | 60.0 | 55.6 | 48.3 | 36.1 | 13.6 | ||
| R-CNN | ✓ | 41.0 | 59.7 | 55.5 | 49.2 | 38.8 | 17.7 | ||
| X-101-32x4d | FPN | ✓ | 41.2 | 62.2 | 57.8 | 50.6 | 37.8 | 14.6 | |
| IoU-aware | ✓ | 41.3 | 62.0 | 57.6 | 50.4 | 38.3 | 14.9 | ||
| R-CNN | ✓ | 42.6 | 61.6 | 57.3 | 51.2 | 40.4 | 19.1 |
Inference speed of different backbones on a single TITAN X GPU.
| Backbone | R-50 | R-101 | X-101-32x4d | |||
|---|---|---|---|---|---|---|
| IoU regression | ✗ | ✓ | ✗ | ✓ | ✗ | ✓ |
| Speed (sec./image) | 0.114 | 0.134 | 0.149 | 0.175 | 0.185 | 0.213 |
The effect of detection confidence on soft-NMS.
| Backbone | R-50 | R-101 | X-101-32x4d | |||
|---|---|---|---|---|---|---|
| IoU regression | ✗ | ✓ | ✗ | ✓ | ✗ | ✓ |
| NMS | 37.7 | 39.0 (↑1.3) | 39.4 | 41.0 (↑1.6) | 41.2 | 42.6 (↑1.4) |
| soft-NMS | 38.3 | 39.6 (↑1.3) | 40.1 | 41.6 (↑1.5) | 42.0 | 43.3 (↑1.3) |
Detection results on PASCAL VOC 2007 test.
| Backbone | IoU Regression | Speed (sec./Image) | AP | AP50 | AP60 | AP70 | AP80 | AP90 |
|---|---|---|---|---|---|---|---|---|
| R-50 | ✗ | 0.079 | 50.4 | 80.1 | 74.5 | 64.3 | 43.0 | 11.2 |
| ✓ | 0.102 | 54.1 | 80.5 | 76.0 | 66.1 | 48.9 | 18.6 | |
| R-101 | ✗ | 0.102 | 54.3 | 82.1 | 77.8 | 67.9 | 48.1 | 16.1 |
| ✓ | 0.125 | 56.8 | 81.3 | 77.4 | 68.8 | 53.3 | 23.2 |
Comparison with IoU-Net [9] on MS COCO validation. Ours means that the result is evaluated on a smaller score_thr of 0.001, resulting in more detection boxes in the candidate list of NMS.
| Method | AP | AP | AP | AP | AP | AP |
|---|---|---|---|---|---|---|
| FPN [ | 38.5 | 60.3 | 55.5 | 47.6 | 33.8 | 11.3 |
| IoU-Net [ | 40.6 | 59.0 | 55.2 | 49.0 | 38.0 | 17.1 |
| FPN | 39.4 | 60.1 | 55.6 | 48.3 | 35.6 | 12.9 |
| Ours | 41.6 | 59.9 | 55.8 | 49.5 | 39.3 | 19.6 |
| Ours | 42.0 | 60.7 | 56.6 | 50.0 | 39.5 | 19.6 |
Comparisons with other detectors on MS COCO test-dev. “MS” denotes multi-scale training, otherwise using single-scale training. All experiments of our method set score_thr to 0.001, which slightly improves detection performance without a speed reduction. IoU-Aware R-CNN means that trainval dataset is used to train the detector and soft-NMS is employed at inference.
| Method | Backbone |
| AP |
|
|
|
|
|
|---|---|---|---|---|---|---|---|---|
| one-stage detectors | ||||||||
| SSD [ | ResNet-101 | 31.2 | 50.4 | 33.3 | 10.2 | 34.5 | 49.8 | |
| RefineDet [ | ResNet-101 | 36.4 | 57.5 | 39.5 | 16.6 | 39.9 | 51.4 | |
| RetinaNet [ | ResNet-101 | 39.1 | 59.1 | 42.3 | 21.8 | 42.7 | 50.2 | |
| FSAF [ | ResNet-101 | ✓ | 40.9 | 61.5 | 44.0 | 24.0 | 44.2 | 51.3 |
| FSAF [ | ResNeXt-101-64x4d | ✓ | 42.9 | 63.8 | 46.3 | 26.6 | 46.2 | 52.7 |
| FCOS [ | ResNet-101 | ✓ | 41.5 | 60.7 | 45.0 | 24.4 | 44.8 | 51.6 |
| FCOS [ | ResNeXt-101-64x4d | ✓ | 44.7 | 64.1 | 48.4 | 27.6 | 47.5 | 55.6 |
| FoveaBox [ | ResNet-101 | ✓ | 40.8 | 61.4 | 44.0 | 24.1 | 45.3 | 53.2 |
| FoveaBox [ | ResNeXt-101 | ✓ | 42.3 | 62.9 | 45.4 | 25.3 | 46.8 | 55.0 |
| LTM [ | ResNeXt-101-64x4d | ✓ | 44.9 | 64.7 | 48.3 | 26.9 | 47.8 | 55.8 |
| ATSS [ | ResNeXt-101-32x8d | ✓ | 45.1 | 63.9 | 49.1 | 27.9 | 48.2 | 54.6 |
| two-stage detectors | ||||||||
| Faster R-CNN [ | ResNet-101 | 34.9 | 55.7 | 37.4 | 15.6 | 38.7 | 50.9 | |
| Faster R-CNN w/FPN [ | ResNet-101 | 36.2 | 59.1 | 39.0 | 18.2 | 39.0 | 48.2 | |
| Mask R-CNN [ | ResNeXt-101 | 39.8 | 62.3 | 43.4 | 22.1 | 43.2 | 51.2 | |
| Libra R-CNN [ | ResNet-101 | 41.1 | 62.1 | 44.7 | 23.4 | 43.7 | 52.5 | |
| Libra R-CNN [ | ResNeXt-101-64x4d | 43.0 | 64.0 | 47.0 | 25.3 | 45.6 | 54.6 | |
| Grid R-CNN [ | ResNet-101 | 41.5 | 60.9 | 44.5 | 23.3 | 44.9 | 53.1 | |
| Faster R-CNN w/ PISA [ | ResNeXt-101 | 42.3 | 62.9 | 46.8 | 24.8 | 45.5 | 53.1 | |
| Cascade R-CNN [ | ResNet-101 | 42.8 | 62.1 | 46.3 | 23.7 | 45.5 | 55.2 | |
| TridentNet [ | ResNet-101 | ✓ | 42.7 | 63.6 | 46.5 | 23.9 | 46.6 | 56.6 |
| IoU-Aware R-CNN | ResNet-50 | 40.7 | 59.8 | 44.0 | 22.9 | 43.5 | 51.2 | |
| IoU-Aware R-CNN | ResNet-101 | 42.3 | 61.3 | 45.7 | 23.3 | 45.5 | 54.5 | |
| IoU-Aware R-CNN | ResNeXt-101-32x4d | 43.4 | 62.8 | 46.8 | 24.7 | 46.7 | 55.1 | |
| IoU-Aware R-CNN | ResNeXt-101-32x4d | 44.3 | 62.9 | 48.3 | 25.6 | 47.5 | 56.5 | |