| Literature DB >> 35336536 |
Xinglei He1, Xiaohan Zhang1, Yichun Wang1, Hongzeng Ji1, Xiuhui Duan1, Fen Guo1.
Abstract
Achieving the accurate perception of occluded objects for autonomous vehicles is a challenging problem. Human vision can always quickly locate important object regions in complex external scenes, while other regions are only roughly analysed or ignored, defined as the visual attention mechanism. However, the perception system of autonomous vehicles cannot know which part of the point cloud is in the region of interest. Therefore, it is meaningful to explore how to use the visual attention mechanism in the perception system of autonomous driving. In this paper, we propose the model of the spatial attention frustum to solve object occlusion in 3D object detection. The spatial attention frustum can suppress unimportant features and allocate limited neural computing resources to critical parts of the scene, thereby providing greater relevance and easier processing for higher-level perceptual reasoning tasks. To ensure that our method maintains good reasoning ability when faced with occluded objects with only a partial structure, we propose a local feature aggregation module to capture more complex local features of the point cloud. Finally, we discuss the projection constraint relationship between the 3D bounding box and the 2D bounding box and propose a joint anchor box projection loss function, which will help to improve the overall performance of our method. The results of the KITTI dataset show that our proposed method can effectively improve the detection accuracy of occluded objects. Our method achieves 89.46%, 79.91% and 75.53% detection accuracy in the easy, moderate, and hard difficulty levels of the car category, and achieves a 6.97% performance improvement especially in the hard category with a high degree of occlusion. Our one-stage method does not need to rely on another refining stage, comparable to the accuracy of the two-stage method.Entities:
Keywords: 3D object detection; autonomous vehicles; multi-sensor fusion; occluded object detection; visual attention mechanism
Year: 2022 PMID: 35336536 PMCID: PMC8955271 DOI: 10.3390/s22062366
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1Common occlusion scene in autonomous driving.
Figure 2The frustum with the same length at each scale means that the unimportant point cloud and the attention point cloud cannot be effectively distinguished. The ‘+’ is concatenate operation.
Figure 3The feature vector of the unimportant object will seriously affect the expression of the feature vector of the object of interest in feature map. The ‘+’ is concatenate operation.
Figure 4The frustum with spatial attention can improve the feature expression of the focused objects in the feature map. The ‘+’ is concatenate operation.
Figure 5Projection relationship between ground truth and image.
The division scale and number of the frustum.
| Scale | Num A | Num B |
|---|---|---|
|
| 1 | |
| 1 | ||
| 1 | ||
| 1 |
Figure 6The LFA module.
The difficulty level officially provided by the KITTI dataset.
| Level | Min Bounding Box Height | Max Occlusion Level | Max Truncation |
|---|---|---|---|
| Easy | 40 Px | Fully visible | 15% |
| Moderate | 25 Px | Partly occluded | 30% |
| Hard | 25 Px | Difficult to see | 50% |
Effects of using different modules.
| Backbone | SAF | LFA | PL | Easy | Mod | Hard |
|---|---|---|---|---|---|---|
| Yes | 87.95 | 76.37 | 68.56 | |||
| Yes | Yes | 87.62 (−0.33) | 78.59 (+2.22) | 72.74 (+4.18) | ||
| Yes | Yes | 88.84 (+0.89) | 78.11 (+1.74) | 71.33 (+2.77) | ||
| Yes | Yes | 88.91 (+0.96) | 77.48 (+1.11) | 68.09 (−0.47) | ||
| Yes | Yes | Yes | 88.71 (+0.76) | 79.52(+3.15) | 75.69 (+7.13) | |
| Yes | Yes | Yes | 88.48 (+0.53) | 78.89 (+2.52) | 72.81 (+4.25) | |
| Yes | Yes | Yes | 89.72 (+1.77) | 78.27 (+1.90) | 71.17 (+2.61) | |
| Yes | Yes | Yes | Yes | 89.46 (+1.51) | 79.91 (+3.54) | 75.53 (+6.97) |
Figure 7The effect of different modules on performance improvement.
Figure 8The precision-recall curves for car 3D detection at all levels of difficulty.
Figure 9Qualitative results on the KITTI.
Performance comparison between our method and the state of the art based on the 2D detector to generate the frustum on the Cars category of the KITTI validation set.
| Method | Stage | Number of Parameters | Runtime (s) | AP3D (Cars) | APBEV (Cars) | ||||
|---|---|---|---|---|---|---|---|---|---|
| Easy | Mod | Hard | Easy | Mod | Hard | ||||
| F-pointnet | Two | - | - | 83.76 | 70.92 | 63.65 | 88.16 | 84.02 | 76.44 |
| Backbone + Refine | Two | 6,633,554 | 0.49 | 88.98 | 78.66 | 72.23 | 90.08 | 88.84 | 80.10 |
| Backbone | One | 3,316,777 | 0.26 | 87.95 | 76.37 | 68.56 | 89.88 | 87.48 | 78.99 |
| Ours | One | 3,724,013 | 0.29 | 89.46 | 79.91 | 75.53 | 91.27 | 89.63 | 85.75 |
Performance comparison between our method and the state of the art based on the 2D detector to generate the frustum on the Pedestrians category of the KITTI validation set.
| Method | Stage | Number of | Runtime (s) | AP3D (Pedestrians) | APBEV (Pedestrians) | ||||
|---|---|---|---|---|---|---|---|---|---|
| Easy | Mod | Hard | Easy | Mod | Hard | ||||
| F-pointnet | Two | - | - | 70.00 | 61.32 | 53.59 | 72.38 | 66.39 | 59.57 |
| Backbone + Refine | Two | 6,633,554 | 0.49 | 70.88 | 62.24 | 53.37 | 72.59 | 67.05 | 58.68 |
| Backbone | One | 3,316,777 | 0.26 | 68.47 | 60.63 | 50.80 | 70.31 | 66.14 | 56.09 |
| Ours | One | 3,724,013 | 0.29 | 70.61 | 61.84 | 53.93 | 72.24 | 66.58 | 59.11 |
Performance comparison between our method and the state of the art based on the 2D detector to generate the frustum on the Cyclists category of the KITTI validation set.
| Method | Stage | Number of Parameters | Runtime (s) | AP3D (Cyclists) | APBEV (Cyclists) | ||||
|---|---|---|---|---|---|---|---|---|---|
| Easy | Mod | Hard | Easy | Mod | Hard | ||||
| F-pointnet | Two | - | - | 77.15 | 56.49 | 53.37 | 81.82 | 60.03 | 56.32 |
| Backbone + Refine | Two | 6,633,554 | 0.49 | 81.69 | 69.55 | 59.87 | 83.28 | 70.10 | 61.79 |
| Backbone | One | 3,316,777 | 0.26 | 75.88 | 64.63 | 55.74 | 80.37 | 63.24 | 57.52 |
| Ours | One | 3,724,013 | 0.29 | 77.24 | 65.21 | 56.15 | 80.79 | 66.47 | 57.86 |
Performance comparison between our method and the state of the art on the KITTI validation set.
| Method | Modality | AP3D (Cars) | APBEV (Cars) | ||||
|---|---|---|---|---|---|---|---|
| Easy | Mod | Hard | Easy | Mod | Hard | ||
| VoxelNet [ | LiDAR | 81.97 | 65.46 | 62.85 | 89.60 | 84.81 | 78.57 |
| SECOND [ | LiDAR | 87.43 | 76.48 | 69.10 | 89.96 | 87.07 | 79.66 |
| PointRCNN [ | LiDAR | 88.88 | 78.63 | 77.38 | 90.21 | 87.89 | 85.51 |
| ContFuse [ | LiDAR + RGB | 86.32 | 73.25 | 67.81 | 95.44 | 87.34 | 82.43 |
| AVODFPN [ | LiDAR + RGB | 84.41 | 74.44 | 68.65 | - | - | - |
| F-pointnet [ | LiDAR + RGB | 83.76 | 70.92 | 63.65 | 88.16 | 84.92 | 76.44 |
| FconvNet [ | LiDAR + RGB | 89.02 | 78.80 | 77.09 | 90.23 | 88.79 | 86.84 |
| Ours | LiDAR + RGB | 89.46 | 79.91 | 75.53 | 91.27 | 89.63 | 85.75 |