| Literature DB >> 30373281 |
Jianwei Li1,2, Wei Gao3,4, Heping Li5,6, Fulin Tang7,8, Yihong Wu9,10.
Abstract
3D scene reconstruction is an important topic in computer vision. A complete scene is reconstructed from views acquired along the camera trajectory, each view containing a small part of the scene. Tracking in textureless scenes is well known to be a Gordian knot of camera tracking, and how to obtain accurate 3D models quickly is a major challenge for existing systems. For the application of robotics, we propose a robust CPU-based approach to reconstruct indoor scenes efficiently with a consumer RGB-D camera. The proposed approach bridges feature-based camera tracking and volumetric-based data integration together and has a good reconstruction performance in terms of both robustness and efficiency. The key points in our approach include: (i) a robust and fast camera tracking method combining points and edges, which improves tracking stability in textureless scenes; (ii) an efficient data fusion strategy to select camera views and integrate RGB-D images on multiple scales, which enhances the efficiency of volumetric integration; (iii) a novel RGB-D scene reconstruction system, which can be quickly implemented on a standard CPU. Experimental results demonstrate that our approach reconstructs scenes with higher robustness and efficiency compared to state-of-the-art reconstruction systems.Entities:
Keywords: 3D reconstruction; camera tracking; simultaneous localization and mapping (SLAM); volumetric integration
Year: 2018 PMID: 30373281 PMCID: PMC6263609 DOI: 10.3390/s18113652
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1The pipeline of the proposed CPU-based 3D reconstruction system, which consists of two main stages: robust camera tracking and efficient volumetric integration. TSDF, Truncated Signed Distance Function.
Figure 2(a,b) Comparison of ORBfeature extraction and edge extraction on an RGB image. Green denotes the ORB feature points; blue denotes edges with low uncertainty; red denotes edges with high uncertainty. (c) Variations of the camera tracking accuracy (Absolute Trajectory (ATE) RMSE in centimeters) with different numbers of ORB features extracted per frame.
Accuracy of camera tracking (ATE RMSE in centimeters) compared to different tracking methods on the TUMRGB-D dataset (with an Intel Core i7-4790 CPU). Bold shows the best results. X denotes uninitialized or the tracking lost.
| Methods | Point | Edge | Point and Line | Point and Edge | ||
|---|---|---|---|---|---|---|
| ORB-SLAM | Edge VO | Edge SLAM | PL-SLAM | Our Method | ||
| Monocular | RGB-D | Monocular | Monocular | Monocular | RGB-D | |
| fr1_xyz |
| 1.07 | 16.51 | 1.31 | 1.21 | 0.91 |
| fr2_desk |
| 0.90 | 33.67 | 1.75 | - | 0.92 |
| fr2_xyz |
| 0.40 | 21.41 | 0.49 | 0.43 | 0.37 |
| fr3_st_near | 1.58 | 1.10 | 47.63 | 1.12 | 1.25 |
|
| fr3_st_far | 0.77 | 1.06 | 121.00 |
| 0.89 | 1.02 |
| fr3_snt_near | X | X | 101.03 | 8.29 | - |
|
| fr3_snt_far | X | 6.71 | 41.76 | 6.71 | - |
|
Mean computing speed of camera tracking on the TUM RGB-D dataset (with an Intel Core i7-4790 CPU). Bold shows the best results.
| Operation | Point | Point and Line | Point and Edge |
|---|---|---|---|
| ORB-SLAM | PL-SLAM | Our Method | |
| Features Extraction (ms) |
| 31.32 | Point: |
| Edge: | |||
| Initial Pose Estimation (ms) | 7.16 | 7.16 |
|
| Track Local Map (ms) |
| 12.58 |
|
| Total (fps) | 50 Hz | 20 Hz |
|
The accuracy of camera tracking (ATE RMSE in centimeters) via points and edges on the ICL-NUIM [38] and Augmented ICL-NUIM [12] datasets. Bold shows the best results. X denotes uninitialized or the tracking lost.
| Methods | Point | Edge | Point and Edge | |
|---|---|---|---|---|
| ORB-SLAM [ | Edge VO [ | Our Method | ||
| RGB-D | RGB-D | RGB-D | ||
| ICL-NUIM | kt0 | X | 39.6 |
|
| kt1 | 0.77 | 27.7 |
| |
| kt2 | 1.29 | 64.8 |
| |
| kt3 |
| 114 | 0.94 | |
| ICL-NUIM | kt0 |
| X | 3.48 |
| kt1 | X | 92.9 |
| |
| kt2 |
| 44.5 | 1.96 | |
| kt3 | 1.36 | 27.3 |
| |
| Augmented | Living Room 1 | 3.71 | X |
|
| Living Room 2 | 1.09 | 112 |
| |
| Office 1 | X | 176 |
| |
| Office 2 |
| 178 | 3.13 | |
Figure 3The variation of the number of camera views and data fusion time. (a): variation of the number of camera views before and after view selection with different velocity thresholds. (b): variation of data fusion time before and after view selection (ICL-NUIM and our datasets: v = 0.2; TUM RGB-D dataset: v = 0.5).
Figure 4The enlarged views of reconstruction results with different numbers (100, 200, 500, 1000, 2000 and 3000) of camera views on the fr2_xyz sequence of the TUM RGB-D dataset.
Figure 5Comparison of the reconstruction results for a desk scene before and after camera view selection. Note that the mesh model is fused by standard TSDF.
Accuracy of the estimated camera trajectories (ATE RMSE in centimeters) and surface reconstruction (median distance in centimeters) on the ICL-NUIM living room sequences [38]. Bold shows the best results.
| Methods | Camera Trajectories (RMSE) | Surface Reconstruction (Median Distance) | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| kt0 | kt1 | kt2 | kt3 | Average | kt0 | kt1 | kt2 | kt3 | Average | ||
| GPU | Kintinuous [ | 7.2 | 0.5 | 1.0 | 35.5 | 11.05 | 1.1 | 0.8 | 0.9 | 15.0 | 4.45 |
| Choi et al. [ | 1.4 | 7.0 | 1.0 | 3.0 | 1.53 | 1.0 | 1.4 | 1.0 | 1.9 | 1.33 | |
| ElasticFusion [ | 0.9 | 0.9 | 1.4 | 10.6 | 3.45 | 0.7 | 0.7 | 0.8 | 2.8 | 1.25 | |
| InfiniTAMv3 [ | 0.9 | 2.9 | 0.9 | 4.1 | 2.20 | 1.3 | 1.1 |
| 1.4 | 0.98 | |
| BundleFusion [ | 0.6 |
|
| 1.1 |
|
|
| 0.7 | 0.8 |
| |
| CPU | DVOSLAM [ | 10.4 | 2.9 | 19.1 | 15.2 | 11.90 | 3.2 | 6.1 | 11.9 | 5.3 | 6.63 |
| Our method |
| 0.7 | 1.2 |
| 0.80 |
| 0.7 | 1.0 |
| 0.73 | |
Figure 6Reconstruction results with the proposed approach on ICL-NUIM living room sequences, the first row for estimated camera trajectories compared with the ground truth and the second row for surface reconstruction models.
Accuracy of the estimated camera trajectories (ATE RMSE in meters) on the Augmented ICL-NUIM dataset [12]. X denotes the tracking lost. Bold shows the best results.
| Methods | Living Room 1 | Living Room 2 | Office 1 | Office 2 | Average | |
|---|---|---|---|---|---|---|
| GPU | Kintinuous [ | 0.27 | 0.28 | 0.19 | 0.26 | 0.250 |
| Choi et al. [ | 0.10 | 0.13 | 0.13 | 0.09 | 0.113 | |
| ElasticFusion [ | 0.62 | 0.37 | 0.13 | 0.13 | 0.313 | |
| InfiniTAM v3 [ | X | X | X | X | X | |
| BundleFusion [ |
|
| 0.15 |
| 0.045 | |
| CPU | DVO SLAM [ | 1.02 | 0.14 | 0.11 | 0.11 | 0.345 |
| Our method | 0.04 |
|
| 0.03 |
| |
Accuracy of the estimated camera trajectories (ATE RMSE in centimeters) and mean speed (fps) of data fusion on the TUM RGB-D dataset [35]. Bold shows the best results.
| Methods | Camera Trajectories (RMSE) | Mean Speed (fps) | |||||
|---|---|---|---|---|---|---|---|
| fr1_desk | fr2_xyz | fr3_office | fr3_nst | Average | GPU | CPU | |
| Kintinuous [ | 3.7 | 2.9 | 3.0 | 3.1 | 3.18 | 15 Hz | - |
| Choi et al. [ | 39.6 | 29.4 | 8.1 | - | 25.7 | offline | - |
| ElasticFusion [ | 2.0 | 1.1 | 1.7 | 1.6 | 1.60 | 32 Hz | - |
| InfiniTAM v3 [ | 1.8 | 2.1 | 2.2 | 2.0 | 2.03 |
| - |
| BundleFusion [ |
| 1.1 | 2.2 |
| 1.53 | 36 Hz | - |
| DVO SLAM [ | 2.1 | 1.8 | 3.5 | 1.8 | 2.30 | - | 30 Hz |
| Our method |
|
|
| 1.9 |
| - |
|
Figure 7Comparison of the reconstruction results on real-world scenes. Fr3_snt_far and fr3_snt_near are manually scanned with the Asus Xtion sensor. Corridor and room are scanned through a robot equipped with Microsoft Kinect.