| Literature DB >> 35408264 |
Claudio Cimarelli1, Hriday Bavle1, Jose Luis Sanchez-Lopez1, Holger Voos1.
Abstract
Unsupervised learning for monocular camera motion and 3D scene understanding has gained popularity over traditional methods, which rely on epipolar geometry or non-linear optimization. Notably, deep learning can overcome many issues of monocular vision, such as perceptual aliasing, low-textured areas, scale drift, and degenerate motions. In addition, concerning supervised learning, we can fully leverage video stream data without the need for depth or motion labels. However, in this work, we note that rotational motion can limit the accuracy of the unsupervised pose networks more than the translational component. Therefore, we present RAUM-VO, an approach based on a model-free epipolar constraint for frame-to-frame motion estimation (F2F) to adjust the rotation during training and online inference. To this end, we match 2D keypoints between consecutive frames using pre-trained deep networks, Superpoint and Superglue, while training a network for depth and pose estimation using an unsupervised training protocol. Then, we adjust the predicted rotation with the motion estimated by F2F using the 2D matches and initializing the solver with the pose network prediction. Ultimately, RAUM-VO shows a considerable accuracy improvement compared to other unsupervised pose networks on the KITTI dataset, while reducing the complexity of other hybrid or traditional approaches and achieving comparable state-of-the-art results.Entities:
Keywords: deep learning; depth estimation; unsupervised learning; visual odometry
Mesh:
Year: 2022 PMID: 35408264 PMCID: PMC9003133 DOI: 10.3390/s22072651
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1RAUM-VO block diagram. The figure shows the flow of information inside RAUM-VO from the input image sequence to the final estimated pose between each pair of consecutive image frames.
Figure 2Diagram of RAUM-VO training. A sequence of images and 2D matches between pairs is the input for the training. The depth network takes only a single image to output a disparity map. The pose network outputs the 3D rigid transformation, as rotation and translation, between the two input images temporally ordered concatenated along the channel dimension. The matches are the input to the frame-to-frame rotation algorithm, whose output guides the training and adjusts the pose network estimation at test time.
Figure 3KITTI train trajectories. Estimated trajectories for the KITTI odometry sequences from 00 to 08. Poses are given in camera frame. Thus, positive x means right direction and positive z means forward. Best viewed in color.
Figure 4KITTI test trajectories. Estimated trajectories for the KITTI odometry sequences 09 and 10. Poses are given in camera frame. Thus, positive x means right direction and positive z means forward. Best viewed in color.
Odometry quantitative evaluation. Result obtained on KITTI odometry seq. 00–10. Data is retrieved from [16]. Best results are highlighted in bold, second best with an underline.
| Category | Method | Metric | 00 | 01 | 02 | 03 | 04 | 05 | 06 | 07 | 08 | 09 | 10 | Train Avg. Err. | Tot. Avg. Err. |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Geometric | ORB-SLAM2 [ |
| 11.43 | 107.57 | 10.34 |
|
| 9.04 | 14.56 | 9.77 | 11.46 | 9.30 | 2.57 | 19.604 | 17.119 |
|
|
| 0.89 |
|
|
|
|
|
|
|
|
|
|
| ||
| ATE | 40.65 | 502.20 | 47.82 |
|
| 29.95 | 40.82 | 16.04 | 43.09 | 38.77 |
| 80.312 | 69.727 | ||
| RPE (m) | 0.169 | 2.970 | 0.172 | 0.031 | 0.078 | 0.140 | 0.237 | 0.105 | 0.192 | 0.128 |
| 0.455 | 0.388 | ||
| RPE (°) | 0.079 | 0.098 | 0.072 | 0.055 | 0.079 | 0.058 | 0.055 | 0.047 | 0.061 | 0.061 | 0.065 | 0.067 | 0.066 | ||
| VISO2 [ |
| 10.53 | 61.36 | 18.71 | 30.21 | 34.05 | 13.16 | 17.69 | 10.80 | 13.85 | 18.06 | 26.10 | 23.373 | 23.138 | |
|
| 2.73 | 7.68 | 1.19 | 2.21 | 1.78 | 3.65 | 1.93 | 4.67 | 2.52 | 1.25 | 3.26 | 3.151 | 2.988 | ||
| ATE | 79.24 | 494.60 | 70.13 | 52.36 | 38.33 | 66.75 | 40.72 | 18.32 | 61.49 | 52.62 | 57.25 | 102.438 | 93.801 | ||
| RPE (m) | 0.221 | 1.413 | 0.318 | 0.226 | 0.496 | 0.213 | 0.343 | 0.191 | 0.234 | 0.284 | 0.442 | 0.406 | 0.398 | ||
| RPE (°) | 0.141 | 0.432 | 0.108 | 0.157 | 0.103 | 0.131 | 0.118 | 0.176 | 0.128 | 0.125 | 0.154 | 0.166 | 0.161 | ||
| Unsupervised | SfM-Learner [ |
| 21.32 | 22.41 | 24.10 | 12.56 | 4.32 | 12.99 | 15.55 | 12.61 | 10.66 | 11.32 | 15.25 | 15.169 | 14.826 |
|
| 6.19 | 2.79 | 4.18 | 4.52 | 3.28 | 4.66 | 5.58 | 6.31 | 3.75 | 4.07 | 4.06 | 4.584 | 4.490 | ||
| ATE | 104.87 | 109.61 | 185.43 | 8.42 | 3.10 | 60.89 | 52.19 | 20.12 | 30.97 | 26.93 | 24.09 | 63.956 | 56.965 | ||
| RPE (m) | 0.282 | 0.660 | 0.365 | 0.077 | 0.125 | 0.158 | 0.151 | 0.081 | 0.122 | 0.103 | 0.118 | 0.225 | 0.204 | ||
| RPE (°) | 0.227 | 0.133 | 0.172 | 0.158 | 0.108 | 0.153 | 0.119 | 0.181 | 0.152 | 0.159 | 0.171 | 0.156 | 0.158 | ||
| SC-SfMLearner [ |
| 11.01 | 27.09 | 6.74 | 9.22 | 4.22 | 6.70 | 5.36 | 8.29 | 8.11 | 7.64 | 10.74 | 9.638 | 9.556 | |
|
| 3.39 | 1.31 | 1.96 | 4.93 | 2.01 | 2.38 | 1.65 | 4.53 | 2.61 | 2.19 | 4.58 | 2.752 | 2.867 | ||
| ATE | 93.04 | 85.90 | 70.37 | 10.21 | 2.97 | 40.56 | 12.56 | 21.01 | 56.15 | 15.02 | 20.19 | 43.641 | 38.907 | ||
| RPE (m) | 0.139 | 0.888 | 0.092 | 0.059 | 0.073 | 0.070 | 0.069 | 0.075 | 0.085 | 0.095 | 0.105 | 0.172 | 0.159 | ||
| RPE (°) | 0.129 | 0.075 | 0.087 | 0.068 | 0.055 | 0.069 | 0.066 | 0.074 | 0.074 | 0.102 | 0.107 | 0.077 | 0.082 | ||
|
|
| 9.365 |
| 6.830 | 3.697 | 2.570 | 4.964 | 3.138 | 3.568 | 7.125 | 13.625 | 11.131 |
| 6.812 | |
|
| 2.840 |
| 1.582 | 2.478 | 0.566 | 2.083 | 0.959 | 1.866 | 2.608 | 3.146 | 4.784 | 1.727 | 2.134 | ||
| ATE | 94.949 |
| 83.155 | 4.112 | 2.377 | 30.227 | 8.726 | 8.872 | 59.887 | 66.591 | 18.792 | 35.812 | 37.063 | ||
| RPE (m) | 0.090 |
| 0.087 | 0.037 | 0.055 | 0.041 | 0.051 | 0.044 | 0.074 | 0.166 | 0.077 |
| 0.093 | ||
| RPE (°) | 0.072 |
| 0.057 |
| 0.036 | 0.049 |
|
| 0.052 | 0.067 | 0.083 |
| 0.054 | ||
| Hybrid | DF-VO [ |
|
| 39.46 |
|
|
|
|
|
|
|
|
| 5.969 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| ||
| ATE |
| 117.40 |
|
|
|
|
|
|
|
|
|
|
| ||
| RPE (m) |
| 1.554 |
|
|
|
|
|
|
|
|
| 0.205 |
| ||
| RPE (°) |
|
|
|
|
|
|
|
|
|
|
|
|
| ||
|
|
|
|
|
| 3.217 | 2.860 |
|
|
|
|
|
|
|
| |
|
| 0.775 | 0.868 | 0.582 | 1.334 | 0.645 | 1.153 | 0.837 | 1.037 | 1.074 | 0.318 | 0.683 | 0.923 | 0.846 | ||
| ATE |
|
|
| 2.602 | 2.283 |
|
|
|
|
| 12.297 |
|
| ||
| RPE (m) |
|
|
|
|
|
|
|
|
|
| 0.078 |
|
| ||
| RPE (°) |
| 0.062 |
|
|
|
| 0.042 | 0.058 |
|
|
|
|
|
The table shows an insight into the possible margins for improvement in the pose predictions coming from unsupervised methods. Hence, we substitute alternately the ground-truth translations and rotations in the pose network estimates. We show the variation in the relevant metrics for the KITTI test sequences 9 and 10.
| Metrics | 09 | 10 | |
|---|---|---|---|
| Simple-Mono-VO |
| 13.625 | 11.131 |
|
| 3.146 | 4.784 | |
| ATE | 66.591 | 18.792 | |
| RPE (m) | 0.166 | 0.077 | |
| RPE (°) | 0.067 | 0.083 | |
| Ground-Truth |
| 13.325 | 11.409 |
|
| 3.146 | 4.784 | |
| ATE | 65.081 | 20.715 | |
| RPE (m) | 0.162 | 0.028 | |
| RPE (°) | 0.067 | 0.083 | |
| Ground-Truth |
| 3.029 | 6.038 |
|
| 0.010 | 0.014 | |
| ATE | 9.026 | 12.894 | |
| RPE (m) | 0.070 | 0.080 | |
| RPE (°) | 0.005 | 0.005 |
F2F solver initialization. Comparison of different initialization approaches for the Levenberg–Marquardt scheme that solves the frame-to-frame motion. Overall, the rotation from the pose network is the best, followed by a constant motion model.
| Initialization | Metrics | 00 | 01 | 02 | 03 | 04 | 05 | 06 | 07 | 08 | 09 | 10 | Avg. Train | Avg. All |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Identity |
| 6.192 | 8.023 | 5.888 | 3.919 | 2.860 | 7.659 | 9.100 | 10.969 | 5.402 | 3.851 | 9.475 | 6.668 | 6.667 |
|
| 2.222 | 1.025 | 1.670 | 1.909 | 0.645 | 3.340 | 2.926 | 6.565 | 1.926 | 0.742 | 2.605 | 2.470 | 2.325 | |
| ATE | 39.195 | 21.231 | 91.621 | 2.651 | 2.283 | 40.192 | 19.682 | 20.592 | 30.142 | 12.939 | 13.399 | 29.732 | 26.721 | |
| RPE (m) | 0.040 | 0.259 | 0.060 | 0.030 | 0.052 | 0.039 | 0.046 | 0.036 | 0.052 | 0.069 | 0.077 | 0.068 | 0.069 | |
| RPE (°) | 0.100 | 0.101 | 0.072 | 0.082 | 0.035 | 0.083 | 0.059 | 0.158 | 0.067 | 0.070 | 0.088 | 0.084 | 0.083 | |
| Constant Motion |
| 6.062 | 12.009 | 5.823 | 6.606 | 2.860 | 5.877 | 3.033 | 2.481 | 19.533 | 3.255 | 5.843 | 7.143 | 6.671 |
|
| 2.128 | 1.833 | 1.728 | 3.119 | 0.645 | 2.105 | 0.837 | 1.150 | 7.772 | 0.862 | 0.683 | 2.368 | 2.078 | |
| ATE | 58.308 | 49.099 | 79.710 | 6.678 | 2.283 | 29.920 | 9.234 | 2.258 | 99.024 | 11.190 | 12.297 | 37.390 | 32.727 | |
| RPE (m) | 0.044 | 0.265 | 0.056 | 0.030 | 0.052 | 0.039 | 0.046 | 0.028 | 0.160 | 0.069 | 0.078 | 0.080 | 0.079 | |
| RPE (°) | 0.075 | 0.086 | 0.059 | 0.066 | 0.035 | 0.060 | 0.042 | 0.068 | 0.702 | 0.072 | 0.051 | 0.133 | 0.120 | |
| Pose Network |
| 2.548 | 8.354 | 2.578 | 3.217 | 2.860 | 3.045 | 3.033 | 2.390 | 3.632 | 2.927 | 5.843 | 3.517 | 3.675 |
|
| 0.775 | 0.868 | 0.582 | 1.334 | 0.645 | 1.153 | 0.837 | 1.037 | 1.074 | 0.318 | 0.683 | 0.923 | 0.846 | |
| ATE | 16.272 | 23.748 | 16.139 | 2.602 | 2.283 | 17.470 | 9.234 | 2.164 | 16.303 | 8.664 | 12.297 | 11.802 | 11.561 | |
| RPE (m) | 0.040 | 0.257 | 0.050 | 0.030 | 0.052 | 0.038 | 0.046 | 0.028 | 0.053 | 0.068 | 0.078 | 0.066 | 0.067 | |
| RPE (°) | 0.059 | 0.062 | 0.048 | 0.048 | 0.035 | 0.044 | 0.042 | 0.058 | 0.045 | 0.042 | 0.051 | 0.049 | 0.049 |
PnP vs. pose network. Comparison of the trajectory estimated by PnP combined with the depth network and the poses predicted by our trained network.
| Poses Source | Metrics | 00 | 01 | 02 | 03 | 04 | 05 | 06 | 07 | 08 | 09 | 10 | Avg. Train | Avg. All |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Pose Network |
| 9.365 | 8.920 | 6.830 | 3.697 | 2.570 | 4.964 | 3.138 | 3.568 | 7.125 | 13.625 | 11.131 | 5.575 | 6.812 |
|
| 2.840 | 0.562 | 1.582 | 2.478 | 0.566 | 2.083 | 0.959 | 1.866 | 2.608 | 3.146 | 4.784 | 1.727 | 2.134 | |
| ATE | 94.949 | 30.004 | 83.155 | 4.112 | 2.377 | 30.227 | 8.726 | 8.872 | 59.887 | 66.591 | 18.792 | 35.812 | 37.063 | |
| RPE (m) | 0.090 | 0.304 | 0.087 | 0.037 | 0.055 | 0.041 | 0.051 | 0.044 | 0.074 | 0.166 | 0.077 | 0.087 | 0.093 | |
| RPE (°) | 0.072 | 0.042 | 0.057 | 0.048 | 0.036 | 0.049 | 0.040 | 0.048 | 0.052 | 0.067 | 0.083 | 0.049 | 0.054 | |
| PnP |
| 6.808 | 17.627 | 6.319 | 4.046 | 2.627 | 4.629 | 2.981 | 3.013 | 6.360 | 7.019 | 6.708 | 6.045 | 6.194 |
|
| 2.190 | 1.195 | 1.339 | 2.364 | 0.582 | 1.863 | 0.781 | 1.691 | 2.317 | 2.029 | 2.644 | 1.591 | 1.727 | |
| ATE | 79.125 | 63.596 | 76.800 | 4.402 | 2.424 | 29.000 | 8.660 | 7.106 | 52.700 | 35.664 | 9.576 | 35.979 | 33.550 | |
| RPE (m) | 0.061 | 0.636 | 0.086 | 0.033 | 0.055 | 0.039 | 0.049 | 0.040 | 0.067 | 0.082 | 0.073 | 0.118 | 0.111 | |
| RPE (°) | 0.060 | 0.057 | 0.049 | 0.042 | 0.029 | 0.039 | 0.032 | 0.036 | 0.043 | 0.068 | 0.085 | 0.043 | 0.049 | |
| F2F rotation w/ |
| 2.796 | 15.552 | 2.775 | 3.482 | 3.123 | 3.008 | 3.164 | 2.373 | 3.876 | 3.072 | 4.343 | 4.461 | 4.324 |
|
| 0.775 | 0.868 | 0.582 | 1.334 | 0.645 | 1.146 | 0.837 | 0.861 | 1.074 | 0.318 | 0.683 | 0.902 | 0.829 | |
| ATE | 17.662 | 41.782 | 15.194 | 2.342 | 2.459 | 17.203 | 9.451 | 3.983 | 16.741 | 8.288 | 8.909 | 14.091 | 13.092 | |
| RPE (m) | 0.043 | 0.527 | 0.053 | 0.034 | 0.055 | 0.040 | 0.050 | 0.035 | 0.055 | 0.071 | 0.073 | 0.099 | 0.094 | |
| RPE (°) | 0.059 | 0.062 | 0.048 | 0.048 | 0.035 | 0.045 | 0.042 | 0.059 | 0.046 | 0.042 | 0.051 | 0.049 | 0.049 | |
| F2F rotation w/ |
| 2.829 | 9.870 | 2.766 | 4.146 | 3.080 | 3.029 | 3.177 | 2.802 | 3.804 | 3.130 | 5.875 | 3.945 | 4.046 |
|
| 0.775 | 0.868 | 0.582 | 1.334 | 0.645 | 1.146 | 0.837 | 0.861 | 1.074 | 0.318 | 0.683 | 0.902 | 0.829 | |
| ATE | 18.339 | 28.499 | 15.497 | 2.468 | 2.419 | 17.363 | 9.502 | 4.732 | 16.426 | 9.033 | 12.410 | 12.805 | 12.426 | |
| RPE (m) | 0.043 | 0.307 | 0.053 | 0.037 | 0.055 | 0.041 | 0.051 | 0.036 | 0.056 | 0.070 | 0.079 | 0.075 | 0.075 | |
| RPE (°) | 0.059 | 0.062 | 0.048 | 0.048 | 0.035 | 0.045 | 0.042 | 0.059 | 0.046 | 0.042 | 0.051 | 0.049 | 0.049 |
Scale alignment. Results of the scale alignment procedure applied to the translation vector from the F2F and the essential matrix estimated motions.
| Metrics | 09 | 10 | |
|---|---|---|---|
| F2F |
| 4.14 | 5.68 |
| ATE | 12.91 | 11.67 | |
| RPE (m) | 0.114 | 0.091 | |
| Essential Matrix |
| 4.02 | 5.99 |
| ATE | 11.77 | 12.42 | |
| RPE (m) | 0.124 | 0.099 | |
| Pose Network |
| 2.927 | 5.843 |
| ATE | 8.664 | 12.297 | |
| RPE (m) | 0.068 | 0.078 |