| Literature DB >> 35632344 |
Minghui Liu1, Jinming Ma1, Qiuping Zheng1, Yuchen Liu1, Gang Shi1.
Abstract
Three-dimensional object detection in the point cloud can provide more accurate object data for autonomous driving. In this paper, we propose a method named MA-MFFC that uses an attention mechanism and a multi-scale feature fusion network with ConvNeXt module to improve the accuracy of object detection. The multi-attention (MA) module contains point-channel attention and voxel attention, which are used in voxelization and 3D backbone. By considering the point-wise and channel-wise, the attention mechanism enhances the information of key points in voxels, suppresses background point clouds in voxelization, and improves the robustness of the network. The voxel attention module is used in the 3D backbone to obtain more robust and discriminative voxel features. The MFFC module contains the multi-scale feature fusion network and the ConvNeXt module; the multi-scale feature fusion network can extract rich feature information and improve the detection accuracy, and the convolutional layer is replaced with the ConvNeXt module to enhance the feature extraction capability of the network. The experimental results show that the average accuracy is 64.60% for pedestrians and 80.92% for cyclists on the KITTI dataset, which is 1.33% and 2.1% higher, respectively, compared with the baseline network, enabling more accurate detection and localization of more difficult objects.Entities:
Keywords: 3D object detection; ConvNeXt module; attention module; multi-scale feature fusion; voxelization
Mesh:
Year: 2022 PMID: 35632344 PMCID: PMC9142975 DOI: 10.3390/s22103935
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.847
Figure 1The structure of Voxel R-CNN network with multi-attention module and MFF-ConvNeXt module.
Figure 2The structure of the point-channel attention module.
Figure 3The structure of voxel attention module. The reshape operation permutes the dimension of the tensor from to .
Figure 4The structure of the ConvNeXt block.
Figure 5The structure of the MFF-ConvNeXt module.
The difficulty level provided by the KITTI dataset.
| Level | Min. Bounding Box | Max. Occlusion | Max. Truncation |
|---|---|---|---|
| Easy | 40 Px | Fully visible |
|
| Moderate | 25 Px | Partly occluded |
|
| Hard | 25 Px | Difficult to see |
|
3D object detection performance: average precision (AP) (in %) and mean average precision (mAP) (in %) for 3D object boxes in the KITTI validation set.
| Method | Car 3D | Cyclists 3D | Pedestrians 3D | 3D mAP | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Easy | Moderate | Hard | Easy | Moderate | Hard | Easy | Moderate | Hard | ||
| VoxelNet |
|
|
|
|
|
|
|
|
|
|
| PointPillars |
|
|
|
|
|
|
|
|
|
|
| SECOND |
|
|
|
|
|
|
|
|
|
|
| PointRCNN |
|
|
|
|
|
|
|
|
|
|
| Part- |
|
|
|
|
|
|
|
|
|
|
| PV-RCNN |
|
|
|
|
|
|
|
|
|
|
| TANet |
|
|
|
|
|
|
|
|
|
|
| Voxel R-CNN |
|
|
|
|
|
|
|
|
|
|
| BtC |
|
|
|
|
|
|
|
|
|
|
| Ours |
|
|
|
|
|
|
|
|
|
|
Figure 6Results of 3D detection on the KITTI validation set. The method proposed in this paper can accurately detect the object in the point cloud. The red box represents the ground truth box of the object, the green box represents the detection result of the car, and the blue and yellow represent the detection result of the pedestrian and the cyclist, respectively.
Figure 7Visualization of 3D object detection result produced by Voxel R-CNN and the method proposed by this paper. The red box represents the ground truth box of the object, the green box represents the detection result of the car, and the blue and yellow represent the detection result of the pedestrian and the cyclist. (a) Detection results in obscured scenes. (b) Detection results in a simple traffic scenario.
Performance of the proposed method with different configurations on the KITTI validation set. The results are evaluated with the mAP calculated by 40 recall positions for all classes.
| Method | Cars 3D | Cyclists 3D | Pedestrians 3D | 3D mAP |
|---|---|---|---|---|
| Baseline |
|
|
|
|
| PCA |
|
|
|
|
| VA |
|
|
|
|
| MFF-ConvNeXt |
|
|
|
|
| PCA + VA |
|
|
|
|
| PCA + MFF-ConvNeXt |
|
|
|
|
| Ours |
|
|
|
|
A comparison of 3D object detection results (mAP) in the KITTI validation set before and after adding the attention module.
| Method | Cars 3D | Cyclists 3D | Pedestrians 3D | 3D mAP |
|---|---|---|---|---|
| PV-RCNN |
|
|
|
|
| PV-RCNN + Attention |
|
|
|
|
| Voxel R-CNN |
|
|
|
|
| Voxel R-CNN + Attention |
|
|
|
|
Inference time comparison between Voxel R-CNN and our proposed method in the KITTI validation set.
| Method | 3D mAP | Interfence Time |
|---|---|---|
| Voxel R-CNN |
| |
| Ours |
|