| Literature DB >> 28117742 |
Abstract
To understand driving environments effectively, it is important to achieve accurate detection and classification of objects detected by sensor-based intelligent vehicle systems, which are significantly important tasks. Object detection is performed for the localization of objects, whereas object classification recognizes object classes from detected object regions. For accurate object detection and classification, fusing multiple sensor information into a key component of the representation and perception processes is necessary. In this paper, we propose a new object-detection and classification method using decision-level fusion. We fuse the classification outputs from independent unary classifiers, such as 3D point clouds and image data using a convolutional neural network (CNN). The unary classifiers for the two sensors are the CNN with five layers, which use more than two pre-trained convolutional layers to consider local to global features as data representation. To represent data using convolutional layers, we apply region of interest (ROI) pooling to the outputs of each layer on the object candidate regions generated using object proposal generation to realize color flattening and semantic grouping for charge-coupled device and Light Detection And Ranging (LiDAR) sensors. We evaluate our proposed method on a KITTI benchmark dataset to detect and classify three object classes: cars, pedestrians and cyclists. The evaluation results show that the proposed method achieves better performance than the previous methods. Our proposed method extracted approximately 500 proposals on a 1226 × 370 image, whereas the original selective search method extracted approximately 10 6 × n proposals. We obtained classification performance with 77.72% mean average precision over the entirety of the classes in the moderate detection level of the KITTI benchmark dataset.Entities:
Keywords: CCD; LiDAR; decision level fusion; multiple sensor fusion; object classification; object detection; object recognition
Year: 2017 PMID: 28117742 PMCID: PMC5298778 DOI: 10.3390/s17010207
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1Overview of our work. Red arrows denote the processing of unary classifier for each sensor, and green arrows denote the fusion processing.
The notation of the used parameters and functions.
| Section | Parameters or Functions | Descriptions |
|---|---|---|
| Energy function to generate color-flattening, | ||
| Data term of energy function for pixel-wise intrinsic similarity. | ||
| A concatenated vector of all pixel values in transformed image | ||
| A concatenated vector of all pixel values in original image | ||
| 4.1 | Smoothness term of energy function. | |
| A 3-dimensional vector of the RGB values at pixel position | ||
| Weights to the difference between | ||
| A 3-dimensional vector of the CIELab color space of | ||
| A constant related to the luminance variations. | ||
| The | ||
| Intermediate variables of the split Bregman method. | ||
| 4.2 | The | |
| The number of voxels in a 3D point cloud. | ||
| The possible number of reflectance particles in voxel | ||
| A set of segmented partition of the color-flatted image. | ||
| The number of segmented partitions. | ||
| Set of spatially-connected neighborhood partitions of the | ||
| The dissimilarity function to group the adjacent partitions. | ||
| A weight constant for the color dissimilarity. | ||
| 5.1 | 75-bin color histogram measured from the mean image | |
| The texture dissimilarity between the adjacent partitions. | ||
| A weight constant for the texture dissimilarity. | ||
| 240-bin SIFT histogram of original image | ||
| A threshold value for grouping adjacent partitions. | ||
| The ground truth of the segmented and inferred segmentation images from the proposed method. | ||
| The number of training images to find | ||
| The structural loss between the ground truth and the inferred segmented partition. | ||
| The classification results of each bounding box provided from the image. | ||
| 6.2 | The classification results of each bounding box provided from the 3D point clouds. | |
| The association component between |
Figure 2Procedure from pre-processing to semantic grouping on CCD image data. (a) Input image data; (b) color-flattened image; (c) segmented image using the graph-segmentation method; (d) semantic grouping using the dissimilarity cost function.
Figure 3Segment generation on 3D point clouds. (a) 2D occupancy grid mapping results and (b) segmentation result on 3D point clouds.
Figure 4Proposed network architecture as unary classifiers.
Figure 5Architecture of the fusion network. Bbox denotes bounding box (Section 6.2).
Comparison models used to evaluate the proposed method. ConvCube, convolutional cube; CIOP, category independent object proposals; CPMC, constrained parametric min-cuts; MCG, multiscale combinatorial grouping; TBM, transferable belief model; CRF, conditional random field; 3DOP, 3D object proposal.
| Model | Proposal Generator | Representation | Representation Usage | Modality | Fusion Scheme |
|---|---|---|---|---|---|
| Sliding Window | VGG16 | ConvCube | CCD + LiDAR | CNN | |
| CIOP | VGG16 | ConvCube | CCD + LiDAR | CNN | |
| Objectness | VGG16 | ConvCube | CCD + LiDAR | CNN | |
| Selective Search | VGG16 | ConvCube | CCD + LiDAR | CNN | |
| CPMC | VGG16 | ConvCube | CCD + LiDAR | CNN | |
| MCG | VGG16 | ConvCube | CCD + LiDAR | CNN | |
| EdgeBox | VGG16 | ConvCube | CCD + LiDAR | CNN | |
| Proposed Generator | AlexNet | ConvCube | CCD + LiDAR | CNN | |
| Proposed Generator | VGG16 | conv1 | CCD + LiDAR | CNN | |
| Proposed Generator | VGG16 | conv5 | CCD + LiDAR | CNN | |
| Proposed Generator | VGG16 | fc7 | CCD + LiDAR | CNN | |
| Proposed Generator | VGG16 | conv5 + fc7 | CCD + LiDAR | CNN | |
| Proposed Generator | VGG16 | ConvCube | CCD | × | |
| Proposed Generator | VGG16 | ConvCube | LiDAR | × | |
| Proposed Generator | VGG16 | ConvCube | CCD + LiDAR | Decision-TBM | |
| Proposed Generator | VGG16 | ConvCube | CCD + LiDAR | Decision-CRF | |
| 3DOP | 3DOP | 3DOP | CCD + LiDAR | Feature-3DOP | |
| Proposed Generator | VGG16 | ConvCube | CCD + LiDAR | CNN |
Recall of each object-region proposal method. # of Bbox denotes the number of bounding box in the KITTI data. The bold numbers in the recall column represent the highest recall except for the sliding window (because the sliding window always contains 100%). The bold component in # of Bbox represent the smallest number among the entire methods.
| Method | Recall | # of Bbox | ||
|---|---|---|---|---|
| Cars | Pedestrians | Cyclists | ||
| Sliding window | 100 | 100 | 100 | |
| CIOP | 64.4 | 59.8 | 59.9 | |
| 66.9 | 60.4 | 60.1 | ||
| Selective search | 70.4 | 66.8 | 68.7 | |
| CPMC | 71.7 | 67.4 | 68.6 | |
| MCG | 76.6 | 78.9 | 74.8 | |
| EdgeBox | 85.2 | 84.3 | 82.5 | |
| Ours (CCD) | 88.4 | 85.4 | 84.8 | |
| Ours (LiDAR) | 71.8 | 63.3 | 64.2 | |
| Ours | 500 | |||
Comparison of the proposed method with design-varied models. The best scores are boldfaced.
| Model | Cars | Pedestrians | Cyclists | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Easy | Moderate | Hard | Easy | Moderate | Hard | Easy | Moderate | Hard | |
| 90.98 | 88.64 | 79.88 | 82.84 | 69.55 | 66.42 | 82.12 | 71.48 | 64.55 | |
| 90.7 | 83.67 | 79.78 | 80.54 | 68.07 | 65.23 | 80.86 | 68.59 | 63.54 | |
| 91.34 | 85.28 | 77.42 | 81.71 | 68.54 | 61.19 | 78.21 | 68.77 | 63.77 | |
| 85.88 | 87.74 | 79.01 | 79.59 | 68.45 | 62.66 | 82.65 | 65.12 | 61.38 | |
| 91.39 | 87.78 | 75.7 | 75.25 | 66.35 | 61.27 | 76.24 | 66.93 | 63.39 | |
| 89.42 | 82.94 | 77.1 | 80.94 | 67.93 | 61.58 | 79.07 | 66.67 | 63.27 | |
| 85.68 | 87.82 | 79.57 | 81.53 | 65.02 | 65.94 | 78.67 | 67.89 | 60.81 | |
| 87.43 | 84.44 | 75.42 | 73.2 | 65.28 | 64.55 | 77.51 | 66.74 | 60.15 | |
| 86.29 | 81.26 | 73.52 | 72.86 | 63.04 | 60.31 | 74.69 | 61.30 | 56.16 | |
| 74.87 | 80.98 | 75.85 | 77.57 | 60.61 | 62.79 | 70.12 | 62.49 | 59.21 | |
| 77.00 | 82.37 | 75.50 | 77.54 | 60.43 | 56.30 | 73.37 | 64.23 | 56.84 | |
| 88.59 | 83.08 | 77.30 | 79.17 | 64.54 | 64.34 | 75.69 | 66.35 | 59.58 | |
| 88.84 | 84.77 | 73.81 | 77.92 | 68.81 | 59.33 | 72.60 | 67.32 | 57.21 | |
| 70.32 | 67.97 | 59.62 | 64.96 | 59.29 | 37.28 | 63.45 | 58.34 | 30.22 | |
| 84.25 | 81.66 | 74.48 | 69.49 | 67.81 | 62.14 | 70.81 | 68.11 | 60.25 | |
| 83.48 | 82.71 | 70.55 | 78.34 | 68.97 | 60.38 | 72.84 | 68.42 | 61.01 | |
| 93.04 | 88.64 | 79.1 | 81.78 | 67.47 | 64.7 | 78.39 | 68.94 | 61.37 | |
Average precision (AP) (%) of the KITTI Object Detection Benchmark dataset. L, C and S in the “Sensor” column denote the LiDAR, CCD and stereo vision sensors, respectively. DPM, deformable part model.LSVM-MDPM, latent support vector machine-modified discriminative part based model; ICF, integrated channel features; BB, bounding box regression;
| Fusion | Sensor | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Easy | Moderate | Hard | Easy | Moderate | Hard | Easy | Moderate | Hard | |||
| Vote3D [ | × | L | 56.80 | 47.99 | 42.57 | 44.48 | 35.74 | 33.72 | 41.43 | 31.24 | 28.62 |
| LSVM-MDPM [ | × | C | 68.02 | 56.48 | 44.18 | 47.74 | 39.36 | 35.95 | 35.04 | 27.50 | 26.21 |
| SquaresICF [ | × | C | - | 57.33 | 44.42 | 40.08 | - | ||||
| MDPM-un-BB [ | × | C | 71.19 | 62.16 | 48.48 | - | - | ||||
| DPM-C8B1 [ | × | S | 74.33 | 60.99 | 47.16 | 38.96 | 29.03 | 25.61 | 43.49 | 29.04 | 26.20 |
| DPM-VOC+ VP [ | × | C | 74.95 | 64.71 | 48.76 | 59.48 | 44.86 | 40.37 | 42.43 | 31.08 | 28.23 |
| OC-DPM [ | × | C | 74.94 | 65.95 | 53.86 | - | - | ||||
| AOG [ | × | C | 84.36 | 71.88 | 59.27 | - | - | ||||
| SubCat [ | × | C | 84.14 | 75.46 | 59.71 | 54.67 | 42.34 | 37.95 | - | ||
| DA-DPM [ | × | C | - | 56.36 | 45.51 | 41.08 | - | ||||
| Faster R-CNN [ | × | C | 86.71 | 81.84 | 71.12 | 78.86 | 65.90 | 61.18 | 72.26 | 63.35 | 55.90 |
| FilteredICF [ | × | C | - | 61.14 | 53.98 | 49.29 | - | ||||
| pAUCEnsT [ | × | C | - | 65.26 | 54.49 | 48.60 | 51.62 | 38.03 | 33.38 | ||
| 3DVP [ | × | C | 87.46 | 75.77 | 65.38 | - | - | ||||
| Regionlets [ | × | C | 84.75 | 76.45 | 59.70 | 73.14 | 61.15 | 55.21 | 70.41 | 58.72 | 51.83 |
| uickitti | × | C | 90.83 | 89.23 | 79.46 | 83.49 | 67.00 | 78.40 | 70.90 | 62.54 | |
| Fusion-DPM [ | Decision | L + C | - | 59.51 | 46.67 | 42.05 | - | ||||
| MV-RGBD-RF [ | Early | L + C | 76.40 | 69.92 | 57.47 | 73.30 | 56.59 | 49.63 | 52.97 | 42.61 | 37.42 |
| 3DOP [ | Early | S + C | 93.04 | 88.64 | 79.10 | 81.78 | 67.47 | 64.70 | 78.39 | 68.94 | 61.37 |
| Ours (CCD) | × | C | 88.84 | 84.77 | 73.81 | 77.92 | 68.81 | 59.33 | 72.60 | 67.32 | 57.21 |
| Ours (LiDAR) | × | L | 70.32 | 67.97 | 59.62 | 64.96 | 59.29 | 37.28 | 63.45 | 58.34 | 30.22 |
| Ours (TBM) | Decision | L + C | 84.25 | 81.66 | 74.48 | 69.49 | 67.81 | 62.14 | 70.81 | 68.11 | 60.25 |
| Ours (CRF) | Decision | L + C | 83.48 | 82.71 | 70.55 | 78.34 | 68.97 | 60.38 | 72.84 | 68.42 | 61.01 |
| Ours | Decision | L + C | 70.84 | ||||||||
Figure 6Qualitative results of our proposed method. We projected the classification results on the image data. (a) The results of CCD unary classifier. (b) The results of LiDAR unary classifier. (c) The results of . (d) The results of proposed method. Each box indicates the following: yellow box: correctly-detected and -classified objects; red box: failures; green box: un-detected objects.