| Literature DB >> 32273887 |
Kun Zhou1,2, Xiangxi Meng3, Bo Cheng2,4.
Abstract
Stereo vision is a flourishing field, attracting the attention of many researchers. Recently, leveraging on the development of deep learning, stereo matching algorithms have achieved remarkable performance far exceeding traditional approaches. This review presents an overview of different stereo matching algorithms based on deep learning. For convenience, we classified the algorithms into three categories: (1) non-end-to-end learning algorithms, (2) end-to-end learning algorithms, and (3) unsupervised learning algorithms. We have provided a comprehensive coverage of the remarkable approaches in each category and summarized the strengths, weaknesses, and major challenges, respectively. The speed, accuracy, and time consumption were adopted to compare the different algorithms.Entities:
Mesh:
Year: 2020 PMID: 32273887 PMCID: PMC7125450 DOI: 10.1155/2020/8562323
Source DB: PubMed Journal: Comput Intell Neurosci
Figure 1Geometry of epipolar lines, where C1 and C2 are the left and right camera lens centers, respectively. Point P1 in one image plane may have arisen from any of the points in the line C1P1 and may appear in the alternate image plane at any point on the epipolar line E2.
Figure 2Depth map comparison between SGM (a) and GC-Net (b).
Comparison of the three frameworks.
| Framework | Inst. | Advantage | Disadvantage |
|---|---|---|---|
| Non-end-to-end | MC-CNN, content-CNN, SGM-Net | (1) Simple; (2) better performance compared to traditional methods | (1) High computational burden; (2) limited receptive field and the lack of context information; (3) still using postprocessing |
| End-to-end | PSMNet, GC-Net | (1) Disparity image quality; (2) easy to design | (1) Huge cost burden and large memory footprint; (2) long time cost; (3) needing ground truth data |
| Unsupervised | LR-consistency-check [ | (1) Not needing ground truth data | (1) Poor performance |
Comparison of unsupervised stereo matching methods on the KITTI stereo 2015 benchmark.
| Method | Abs rel | Sq rel | RMSE | RMSE log |
|
|
| Runtime (s) | Environment |
|---|---|---|---|---|---|---|---|---|---|
| Luo et.al [ | 0.094 |
|
| 0.177 | 0.891 | 0.965 | 0.984 | — | — |
| Garg et al. [ | 0.169 | 1.080 | 5.104 | 0.273 | 0.740 | 0.904 | 0.962 | — | — |
| Godard et al. [ |
| 0.835 | 4.392 |
|
|
|
| 0.035 | Nvidia Titan X |
| Zhou et al. [ | 0.208 | 1.768 | 6.856 | 0.283 | 0.678 | 0.885 | 0.957 | — | — |
| Yin and Shi [ | 0.155 | 1.296 | 5.857 | 0.233 | 0.793 | 0.931 | 0.973 |
| Nvidia Titan X |
Figure 3Two Siamese network structures: (a) the basic Siamese network structure to estimate the similarity between two image patches; (b) the accelerated Siamese network by employing a dot layer.
Comparison of stereo matching methods using CNN for cost calculation on the KITTI stereo 2015 benchmark.
| Methods | >2 pixels | >3 pixels | >4 pixels | >5 pixels | Mean error | Runtime (s) | Environment | |||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Non-occ | All | Non-occ | All | Non-occ | All | Non-occ | All | Non-occ | All | |||
| Deep Embed [ | 5.05 | 6.47 | 3.10 | 4.24 | 2.32 | 3.25 | 1.92 | 2.68 | 0.9 px | 1.1 px | 3 | Nvidia GTX Titan (CUDA, Caffe) |
| MC-CNN-acrt [ |
|
|
|
|
|
|
|
|
| 0.9 px | 67 | Nvidia GTX Titan (CUDA, Lua/Torch7) |
| Content-CNN [ | 4.98 | 6.51 | 3.07 | 4.29 | 2.39 | 3.36 | 2.03 | 2.82 | 0.8 px | 1.0 px |
| Nvidia Titan X (CUDA) |
| OCV-SGBM | 9.47 | 10.86 | — | — | — | — | — | — | — | — | 1.1s at 2.5 GHz CPU | 2.5 GHz CPU (C++) |
Comparison of non-end-to-end stereo matching methods using CNN for cost aggregation and postprocessing on the KITTI stereo 2012 benchmark.
| Methods | >2 pixels (%) | >3 pixels (%) | >4 pixels (%) | >5 pixels (%) | EPE NOC (px) | Runtime (s) | Environment | ||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Non-occ | All | Non-occ | All | Non-occ | All | Non-occ | All | ||||
| SGM-NET [ |
| 5.15 |
| 3.50 |
| 2.80 |
| 2.36 | 0.7 px |
| Nvidia (R) Titan X (Torch7) |
| Displets [ | 3.90 |
| 2.37 |
| 1.97 |
| 1.72 |
| 0.7 px | 265 | 8+ cores at 3.0 GHz (Matlab + C/C++) |
Comparison of non-end-to-end stereo matching methods using CNN for cost aggregation and postprocessing on the KITTI stereo 2012 benchmark.
| Methods | All pixels | Nonoccluded pixels | Runtime (s) | Environment | ||||
|---|---|---|---|---|---|---|---|---|
| D1-bg (%) | D1-fg (%) | D1-all (%) | D1-bg (%) | D1-fg (%) | D1-all (%) | |||
| Displets [ | 3.00 |
| 3.43 | 2.73 | 4.95 | 3.09 | 265 | 8+ cores @ 3.0 GHz (Matlab + C/C++) |
| SGM-Net [ | 2.66 | 8.64 | 3.66 |
| 7.44 | 3.09 | 67 | Nvidia (R) Titan X (Torch7) |
| DRR [ |
| 6.04 |
| 2.34 |
|
|
| Nvidia (R) Titan X (--) |
| CNN + CRF [ | — | — | 5.50 | — | — | 4.84 | 1.3 s | C++/CUDA |
Figure 4The two popular basic architectures for end-to-end disparity estimations: (a) 2D encoder-decoder structure; (b) 3D regularization structure.
Comparison of end-to-end stereo matching methods on the KITTI stereo 2012 benchmark.
| Methods | >2 pixels (%) | >3 pixels (%) | >4 pixels (%) | >5 pixels (%) | EPE NOC | Runtime (s) | Environment | ||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Non-occ | All | Non-occ | All | Non-occ | All | Non-occ | All | ||||
| PSMNet [ | 2.44 | 3.01 | 1.49 | 1.89 | 1.12 | 1.42 | 0.90 | 1.15 | 0.5 px | 0.41 | Nvidia Titan Xp (CUDA) |
| SegStereo [ | 2.66 | 3.19 | 1.68 | 2.03 | 1.25 | 1.52 | 1.00 | 1.21 | 0.5 px | 0.6 | Caffe |
| iResNet [ | 2.69 | 3.34 | 1.71 | 2.16 | 1.30 | 1.63 | 1.06 | 1.32 | 0.5 px | 0.12 | Nvidia Titan |
| X (Caffe) | |||||||||||
| GC-Net [ | 2.71 | 3.46 | 1.77 | 2.30 | 1.36 | 1.77 | 1.12 | 1.46 | 0.6 px | 0.9 | Nvidia Titan |
| X (--) | |||||||||||
| PDSNet [ | 3.82 | 4.64 | 1.92 | 2.53 | 1.38 | 1.85 | 1.12 | 1.51 | 0.9 px | 0.5 | Nvidia Titan X |
| L-ResMatch [ | 3.64 | 5.06 | 2.27 | 3.40 | 1.76 | 2.67 | 1.50 | 2.26 | 0.7 px | 48 | Nvidia Titan X |
| DispNet [ | 7.38 | 8.11 | 4.11 | 4.65 | 2.77 | 3.20 | 2.05 | 2.39 | 0.9 px |
| Nvidia Titan X |
| EdgeStereo [ | 2.32 | 2.88 | 1.46 | 1.83 |
|
| 0.83 | 1.04 |
| 0.32 | Nvidia GTX 1080Ti (Caffe) |
| GwcNet-gc [ |
|
|
|
| — | — |
|
| 0.5 px | 0.32 | Nvidia Titan Xp (-) |
Comparison of end-to-end stereo matching methods on the KITTI stereo 2015 benchmark.
| Methods | All pixels | Non-occluded pixels | Runtime (s) | Environment | ||||
|---|---|---|---|---|---|---|---|---|
| D1-bg (%) | D1-fg (%) | D1-all (%) | D1-bg (%) | D1-fg (%) | D1-all (%) | |||
| PSMNet [ | 1.86 | 4.62 | 2.32 | 1.71 | 4.31 | 2.14 | 0.41 | Nvidia Titan Xp (CUDA) |
| SegStereo [ | 1.88 | 4.07 | 2.25 | 1.76 | 3.70 | 2.08 | 0.6 | Caffe |
| iResNet [ | 2.25 | 3.40 | 2.44 | 2.07 |
| 2.19 | 0.12 | Nvidia Titan X (Caffe) |
| GC-Net [ | 2.21 | 6.16 | 2.87 | 2.02 | 5.58 | 2.61 | 0.9 | Nvidia Titan X (--) |
| PDSNet [ | 2.29 | 4.05 | 2.58 | 2.09 | 3.68 | 2.36 | 0.5 | Nvidia Titan X |
| L-ResMatch [ | 2.72 | 6.95 | 3.42 | 2.35 | 5.74 | 2.91 | 48 | Nvidia Titan X |
| EdgeStereo [ | 1.84 |
|
| 1.69 | 2.94 |
| 0.32 | Nvidia GTX 1080Ti (Caffe) |
| CRL [ | 2.48 | 3.59 | 2.67 | 2.32 | 3.12 | 2.45 | 0.47 | Nvidia GTX 1080 |
| LRCR [ | 2.55 | 5.42 | 3.03 | 2.23 | 4.19 | 2.55 | 49.2 | -- |
| DispNet [ | 4.32 | 4.41 | 4.34 | 4.11 | 3.72 | 4.05 |
| Nvidia Titan X |
| GwcNet-gc [ |
| 3.93 | 2.11 |
| 3.49 | 1.92 | 0.32 | Nvidia Titan Xp (--) |
| SCV-Net [ | 2.22 | 4.53 | 2.61 | 2.04 | 4.28 | 2.41 | 0.36 | Nvidia GTX 1080Ti |
Figure 5Standard pipeline of unsupervised stereo matching algorithms.