| Literature DB >> 32316336 |
Faisal Khan1, Saqib Salahuddin1, Hossein Javidnia2.
Abstract
Monocular depth estimation from Red-Green-Blue (RGB) images is a well-studied ill-posed problem in computer vision which has been investigated intensively over the past decade using Deep Learning (DL) approaches. The recent approaches for monocular depth estimation mostly rely on Convolutional Neural Networks (CNN). Estimating depth from two-dimensional images plays an important role in various applications including scene reconstruction, 3D object-detection, robotics and autonomous driving. This survey provides a comprehensive overview of this research topic including the problem representation and a short description of traditional methods for depth estimation. Relevant datasets and 13 state-of-the-art deep learning-based approaches for monocular depth estimation are reviewed, evaluated and discussed. We conclude this paper with a perspective towards future research work requiring further investigation in monocular depth estimation challenges.Entities:
Keywords: CNN monocular depth; monocular depth estimation; single image depth estimation
Year: 2020 PMID: 32316336 PMCID: PMC7219073 DOI: 10.3390/s20082272
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Datasets for monocular depth estimation.
| Dataset | Labelled Images | Annotation | Brief Description |
|---|---|---|---|
| NYU-v2 [ | 1449 | Depth + Segmentation | Red-green-blue (RGB) and depth images taken from indoor scenes. |
| Make3D [ | 534 | Depth | RGB and depth images taken from outdoor scenes. |
| KITTI [ | 94K | Depth aligned with RAW data + Optical Flow | RGB and depth from 394 road scenes. |
| Pandora [ | 250K | Depth + Annotation | RGB and depth images. |
| SceneFlow [ | 39K | Depth + Disparity + Optical Flow+ Segmentation Map | Stereo image sets rendered from synthetic data with ground truth depth, disparity and optical flow. |
Categories of deep learning-based monocular depth estimation methods (FC: fully convolutional; CNN: convolutional neural networks).
| Method | Architecture | Category |
|---|---|---|
| EMDEOM [ | FC | Supervised |
| ACAN [ | Encoder-Decoder | |
| DenseDepth [ | Encoder-Decoder | |
| DORN [ | CNN | |
| VNL [ | Encoder-Decoder | |
| BTS [ | Encoder-Decoder | |
| LISM [ | Encoder-Decoder | Self-supervised |
| monoResMatch [ | CNN | |
| PackNet-SfM [ | CNN | |
| VOMonodepth [ | Auto-Decoder | |
| monodepth2 [ | CNN | |
| GASDA [ | CNN | Semi-supervised |
Properties of the studied methods for monocular depth estimation (FC: fully convolutional; ED: encoder-decoder; AD: auto-decoder; CNN: convolutional neural networks; K: trained on KITTI; N: trained on NYU-v2).
| Method | Input | Type | Optimizer | Parameters | Output | GPU Memory | GPU Model |
|---|---|---|---|---|---|---|---|
| BTS [ | ED | Adam | 47M | 1080 Ti | |||
| DORN [ | CNN | Adam | 123.4M | 12 GB | TITAN Xp | ||
| VNL [ | ED | SGD | 2.7M | N/A | N/A | ||
| ACAN [ | ED | SGD | 80M | 11 GB | 1080 Ti | ||
| VOMonodepth [ | AD | Adam | 35M | 12 GB | TITAN Xp | ||
| LSIM [ | ED | Adam | 73.3M | 12 GB | TITAN Xp | ||
| GASDA [ | CNN | Adam | 70M | N/A | N/A | ||
| DenseDepth [ | ED | Adam | 42.6M | TITAN Xp | |||
| monoResMatch [ | CNN | Adam | 42.5M | 12 GB | TITAN Xp | ||
| EMDEOM [ | FC | Adam | 63M | 12 GB | TITAN Xp | ||
| PackNet-SfM [ | CNN | Adam | 128M | Tesla V100 | |||
| monodepth2 [ | CNN | Adam | 70M | 12 GB | TITAN Xp |
Evaluation results on KITTI dataset. Best method per metric is emboldened and highlighted in green. (RD: KITTI Raw Depth [31]; CD: KITTI Continuous Depth [31,32]; SD: KITTI Semi-Dense Depth [31,32]; ES: Eigen Split [33]; ID: KITTI Improved Depth [34]).
| Method | Train | Test | Abs Rel | Sq Rel | RMSE | RMSElog |
|
|
|
|---|---|---|---|---|---|---|---|---|---|
| BTS [ | ES(RD) | ES(RD) | 0.060 | 0.182 |
| 0.092 | 0.959 |
|
|
| DORN [ | ES(RD) | ES(RD) | 0.071 | 0.268 | 2.271 | 0.116 | 0.936 | 0.985 | 0.995 |
| VNL [ | ES(RD) | ES(RD) | 0.072 | 0.883 | 3.258 | 0.117 | 0.938 | 0.990 | 0.998 |
| ACAN [ | ES(RD) | ES(RD) | 0.083 | 0.437 | 3.599 | 0.127 | 0.919 | 0.982 | 0.995 |
| VOMonodepth [ | ES(RD) | ES(RD) | 0.091 | 0.548 | 3.790 | 0.181 | 0.892 | 0.956 | 0.979 |
| LSIM [ | FT | RD | 0.169 | 0.6531 | 3.790 | 0.195 | 0.867 | 0.954 | 0.979 |
| GASDA [ | ES(RD) | ES(RD) | 0.143 | 0.756 | 3.846 | 0.217 | 0.836 | 0.946 | 0.976 |
| DenseDepth [ | ES(RD) | ES(RD) | 0.093 | 0.589 | 4.170 | 0.171 | 0.886 | 0.965 | 0.986 |
| monoResMatch [ | ES(RD) | ES(RD) | 0.096 | 0.673 | 4.351 | 0.184 | 0.890 | 0.961 | 0.981 |
| EMDEOM [ | RD, CD | SD | 0.118 | 0.630 | 4.520 | 0.209 | 0.898 | 0.966 | 0.985 |
| monodepth2 [ | ES(RD) | ES(RD) | 0.115 | 0.903 | 4.863 | 0.193 | 0.877 | 0.959 | 0.981 |
| PackNet-SfM [ | ES(RD) | ID | 0.078 | 0.420 | 3.485 | 0.121 | 0.931 | 0.986 | 0.996 |
| DeepV2D [ | ES(RD) | ES(RD) |
|
|
|
|
| 0.993 | 0.997 |
Evaluation results on NYU-v2 dataset. Best method per metric is emboldened and highlighted in green.
| Method | Abs Rel | Sq Rel | RMSE | RMSElog |
|
|
|
|---|---|---|---|---|---|---|---|
| BTS [ | 0.112 |
|
| 0.047 | 0.882 | 0.979 | 0.995 |
| VNL [ | 0.113 | 0.034 | 0.364 | 0.054 | 0.815 |
| 0.993 |
| DenseDepth [ | 0.123 | 0.045 | 0.465 | 0.053 | 0.846 | 0.970 | 0.994 |
| ACAN [ | 0.123 | 0.101 | 0.496 | 0.174 | 0.826 | 0.974 | 0.990 |
| DORN [ | 0.138 | 0.051 | 0.509 | 0.653 | 0.825 | 0.964 | 0.992 |
| monoResMatch [ | 1.356 | 1.156 | 0.694 | 1.125 | 0.825 | 0.965 | 0.967 |
| monodepth2 [ | 2.344 | 1.365 | 0.734 | 1.134 | 0.826 | 0.958 | 0.979 |
| EMDEOM [ | 2.035 | 1.630 | 0.620 | 1.209 | 0.896 | 0.957 | 0.984 |
| LSIM [ | 2.344 | 1.156 | 0.835 | 1.175 | 0.815 | 0.943 | 0.975 |
| PackNet-SfM [ | 2.343 | 1.158 | 0.887 | 1.234 | 0.821 | 0.945 | 0.968 |
| GASDA [ | 1.356 | 1.156 | 0.963 | 1.223 | 0.765 | 0.897 | 0.968 |
| VOMonodepth [ | 2.456 | 1.192 | 0.985 | 1.234 | 0.756 | 0.884 | 0.965 |
| DeepV2D [ |
| 0.094 | 0.403 |
|
| 0.989 |
|
Comparison of the models in terms of inference time (FC: fully convolutional; CNN: convolutional neural networks). Best method is emboldened and highlighted in green.
| Method | Inference Time | Network/FC/CNN |
|---|---|---|
| BTS [ | 0.22 s | Encoder-decoder |
| VNL [ | 0.25 s | Auto-decoder |
| DeepV2D [ | 0.36 s | CNN |
| ACAN [ | 0.89 s | Encoder-decoder |
| VOMonodepth [ | 0.34 s | CNN |
| LSIM [ | 0.54 s | CNN |
| GASDA [ | 0.57 s | Encoder-decoder |
| DenseDepth [ | 0.35 s | Encoder-decoder |
| monoResMatch [ | 0.37 s | CNN |
| EMDEOM [ | 0.63 s | FC |
| DORN [ | 0.98 s | Encoder-decoder |
| PackNet-SfM [ | 0.97 s | CNN |
| monodepth2 [ | 0.56 s | CNN |
Properties of the low-accuracy methods trained on either KITTI or NYU-v2 datasets. (FC: fully convolutional, ED: encoder-decoder, AD: auto-decoder, K: trained on KITTI dataset, N: trained on NYU-v2 dataset and CNN: convolutional neural networks).
| Method | Input | Type | Optimizer | Parameters | Output | GPU Memory | RMSE | GPU Model |
|---|---|---|---|---|---|---|---|---|
| Zhou et al. [ | CNN | Adam | N/A | N/A | 4.975 | N/A | ||
| Casser et al. [ | CNN | Adam | N/A | 11 GB | 4.7503 | 1080 Ti | ||
| Guizilini et al. [ | FC | Adam | 86M | N/A | 4.601 | N/A | ||
| Godard et al. [ | FC | Adam | 31M | 12 GB | 4.935 | TITAN Xp | ||
| Eigen et al. [ | CNN | Adam | N/A | 6 GB | N/A | TITAN Black | ||
| Guizilin et al. [ | ED | Adam | 79M | 4.270 | Tesla V100 | |||
| Tang et al. [ | CNN | RMSprop | 80M | 12 GB | N/A | N/A | ||
| Ramamonjisoa et al. [ | ED | Adam | 69M | 11 GB | 0.401 | 1080 Ti | ||
| Riegler et al. [ | N/A | ED | Adam | N/A | N/A | N/A | N/A | N/A |
| Ji et al. [ | ED | Adam | N/A | 12 GB | 0.704 | TITAN Xp | ||
| Almalioglu et al. [ | GAN | RMSprop | 63M | 12 GB | 5.448 | TITAN V | ||
| Pillai et al. [ | CNN | Adam | 97M | 4.958 | Tesla V100 | |||
| Wofk et al. [ | ED | SGD | N/A | N/A | 0.604 | N/A | ||
| Watson et al. [ | ED | SGD | N/A | N/A | N/A | N/A | ||
| Chen et al. [ | ED | Adam | N/A | 11 GB | 3.871 | 1080 Ti | ||
| Lee et al. [ | CNN | SGD | 61M | N/A | 0.538 | N/A |
Figure 1Qualitative comparison of five state-of-the-art-monocular depth estimation methods. From left to right: Input Image, Ground Truth, BTS [49], DeepV2D [50], DenseDepth [47], MonoResMatch [38] and DORN [18].